Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Finding all regex matches has always been O(n²) (iev.ee)

253 points by lalitmaganti 5 days ago | 68 comments

burakemir 7 hours ago [-]

I believe the technique described is similar to what I published here (this is not about all matches, but left-longest/shortest)

"Compiling regular expressions to sequential machines" (2005) ACM Symposium of Applied Computing https://dl.acm.org/doi/10.1145/1066677.1066992

(Note that there is a small mistake in the paper due to ambiguity, found by Vladimir Gapeyev. So the result does not hold in the generality stated but only for a special case when there is no ambiguity at "the end". There went my first PhD student publication...)

The two pass technique used to be implemented in the Scala compiler at the time (building DFAs upfront) , which could do regexps over lists and other sequences, but the approach would not work for top-down tree regexps so I did not pursue that and it got ripped out later.

It is good to see derivative regular expressions, Brzozowski/"position automata" used and discussed.

conartist6 21 hours ago [-]

@ievev Have you ever seen an implementation like @bablr/regex? https://github.com/bablr-lang/regex-vm It's an NFA system so it isn't going to be winning any awards for throughput, but in this particular case it does seem to completely avoid the complexity blowup. It will run your heap out of memory though on really big inputs.

The strategy this engine uses is just to evolve the state as a function of time. A match can be successfully completed, yet not be emitted because some other longer match could still supercede it by being longer or more leftmost.

I tried the pattern /d+s+/g on 10,000,000 digits followed by no space. It took 4 seconds to return no results. I tried it on 20,000,000 digits followed by no space. It took 8 seconds to return no results. I tried on 100,000,000 and I ran out of heap space.

Test setup: https://gist.github.com/conartist6/051838025af1e04d966e03aa9...

hyperpape 18 hours ago [-]

Returning no results is going to be linear in any DFA or NFA based implementation, though. You go character by character, and confirm that there are no matches.

It's only when you return multiple matches that the engines have a problem and become superlinear.

ogogmad 10 hours ago [-]

It might be possible to use a randomised algorithm to estimate the number of matches in only linear time.

tzs 18 hours ago [-]

Do any RE implementations do anything like the query planning that databases do or the rewrites that compilers do that can replace the RE with a different sequence of REs or string searches that might execute faster?

For example in the expression in your example (I'm assuming based on your description of the test data that /d+s+/ means the same as /\d+\s+/ in the RE engines I've used) any match must contain a digit followed by a space.

A scan for all places where a digit is followed by whitespace with each such place then being checked to find the length of the string of whitespace starting there and the length of the string of digits ending there should be linear time and constant heap space.

gilleain 11 hours ago [-]

Not an answer, but related, I was looking at Memgraph (graph db) query planning for Cypher, which is similar (ish) to what you are asking about:

https://memgraph.com/docs/querying/query-plan

The reason why I was looking was to do query planning for a declarative pattern-finding in atomic structures (hierarchical labelled graphs, I suppose) although I'm slowly realising just how insanely hard it might be to get it to work efficiently!

conartist6 18 hours ago [-]

You're correct, I accidentally omitted backslashes on \d and \s. I checked and the pattern was correct during the test.

ieviev 21 hours ago [-]

I have not heard of this before, i will have a look!

10000truths 23 hours ago [-]

Restricting regex features to guarantee time complexity works, but it requires sacrificing potentially useful features like backtracking (or in the article's case, constraining oneself to fixed-upper-bound-length needles).

In a real-world deployment where you want to run any arbitrary regex in an idiot/malice-proof manner, the best solution is the same solution you'd use for running any other kind of untrusted code - sandbox it! A good regex API should limit its execution time and memory consumption and return a timeout error in case those limits are exceeded. Ideally, those parameters would be configurable at the API level. Unfortunately, the only regex libraries I know of that get this right are .NET's standard library Regex API and the third-party regex package in Python.

ieviev 20 hours ago [-]

> constraining oneself to fixed-upper-bound-length needles

wait! you haven't reached the important part of the post yet

gpvos 22 hours ago [-]

I find it weird to have the Perl innovation (?:...) be called "traditional regex". Perl was rather innovative back then, even if it's more than 30 years ago now. Traditional regex is what came before it (grep -E being the most advanced form). I wonder what counts as nontraditional in the author's eyes.

coldtea 12 hours ago [-]

>even if it's more than 30 years ago now.

Here's your answer.

>Traditional regex is what came before it

No, that's "ancient regex".

trashb 10 hours ago [-]

> No, that's "ancient regex".

wouldn't the "ancient regex" be the ed "g/re/p" version?

  -E, --extended-regexp
    Interpret PATTERNS as extended regular expressions (EREs, see below).
  -G, --basic-regexp
    Interpret PATTERNS as basic regular expressions (BREs, see below).  This is the default.
  -P, --perl-regexp
    Interpret I<PATTERNS> as Perl-compatible regular expressions (PCREs).  This option is experimental when combined with the -z (--null-data) option, and grep  -P  may  warn of unimplemented features.

From the manpage it seems my grep make distinction between "Extended" "Basic" and "Perl" regexes.

coldtea 8 hours ago [-]

>wouldn't the "ancient regex" be the ed "g/re/p" version?

That's "prehistoric regex"

leoc 22 hours ago [-]

In my head a regex-like thing of Perl origin is known as a 'perlex'.

MathMonkeyMan 18 hours ago [-]

can call it PCRE now

leoc 10 hours ago [-]

'PCRE' never really caught on broadly, unfortunately.

It seems that Boost actually uses 'perlex' https://www.boost.org/doc/libs/1_31_0/libs/regex/doc/syntax_... .

ieviev 21 hours ago [-]

Haha, you're right about that. I was looking for another word for "default"

adzm 24 hours ago [-]

Is there any reason that RE#'s two-pass approach couldn't be adopted by other regex engines?

Ah, there is a post with more detail about RE# and discussion here recently that I must have missed: https://news.ycombinator.com/item?id=47206647

ieviev 23 hours ago [-]

The part that makes it difficult is that it doesn't return the same matches, it returns almost the same matches but not exactly.

But if PCRE semantics isn't set in stone then i hope leftmost longest could be the default some day. There's a lot of nice things you get for free with the two pass approach

nine_k 23 hours ago [-]

> nearly everything that matters in practice: where the matches are, how long they are, and how many there are

I would say that regexes that matter in practice, e.g. when digging through logs, have clear boundaries that curb the pathological backtracking behavior. In particular, I find it difficult to imagine a practical need to find all matches of an expression like /.*a|b/, as shown in the article. Realistically you'd have to handle /\b.*a|b\b/, or similar, because realistically when you need all matches, you don't want intersecting matches. This means you want to proceed past the end of the n-th match to look for n+1-th match, and never want to use indeterminate prefixes like /.*a/.

This OTOH gives a reasonably useful heuristic if your regexp comes from an untrusted source and could be adversarial. Check that it does not start with a prefix with a Kleene star, like /a*/. Require at least one positive match (in each alternate branch). Of course, /a+b|c/ would still be quadratic if your text is long sequences of "a" interspersed with characters other than "b". But this, again, is more of a theoretical case, to my mind.

ieviev 23 hours ago [-]

> I would say that regexes that matter in practice, e.g. when digging through logs, have clear boundaries that curb the pathological backtracking behavior

I agree with you in the sense that most practical regexes do not expose this quadratic blowup (from all matches) but i do not think the same about backtracking. The effect of backtracking is immediately clear when you're searching inside text without a clear anchor like \A or ^ or a very rare string prefix. It is much more visible with either larger patterns or larger character classes like unicode

hrmtst93837 12 hours ago [-]

That only works when you control both the pattern and the shape of the input. Once you accept user regexes or start scanning big log dumps without tight boundaries the weird cases stop being weird.

Greedy globs plus unanchored branches can turn a boring batch job into a CPU bonfire, and defensive limits seems a lot less optional when one dumb pattern can pin a core for minutes. A timeout or a restricted regex subset is usually cheaper than pretending everybody writes sane patterns.

hrmtst93837 23 hours ago [-]

[dead]

mananaysiempre 4 hours ago [-]

A couple of notes:

- The Austin Group has recently accepted lazy quantifiers for inclusion into the next release of POSIX[1]. They think they have worked out a reasonable declarative definition for what they should mean. I am less than sure of that, but either way dismissing the whole thing as irredeemably tied to backtracking seems inappropriate.

- Once again the generalization in the title is AFAIK largely correct for industrial engines, but incorrect—arguably to the point of being misleading—for academic work. Just looking into the “parsing” subfolder of my papers stash reveals a 1998 paper[2] on linear-time maximal-munch tokenization, so at the very least the problem was recognized, and IIRC there’s a bunch of related work around that paper too.

- It is true that you can’t stream the haystack in the general case, but to what precise extent you can is an interesting question with a known algorithmic answer[3].

[1] https://www.austingroupbugs.net/view.php?id=793, https://www.austingroupbugs.net/view.php?id=1329, https://www.austingroupbugs.net/view.php?id=1857, see also the mailing list.

[2] Reps (1998), ACM TOPLAS 20(2), 259–273, https://dl.acm.org/doi/10.1145/276393.276394, https://research.cs.wisc.edu/wpis/papers/toplas98b.pdf.

[3] Grathwohl, Henglein, Rasmussen (2014), ICTAC ’14, LNCS 8687, 224–240, https://link.springer.com/chapter/10.1007/978-3-319-10882-7_..., https://utr.dk/pubs/files/grathwohl2014-0-paper.pdf.

ieviev 3 hours ago [-]

> lazy quantifiers for inclusion into the next release of POSIX

That is surprising!

We've found that in certain simpler scenarios it's possible to use complement to express lazy quantifiers, but in the general case it appears very fragile.

a simple example: a.*?b can be rewritten to something like (a.*b&~(a.*b_+)) in RE# syntax, which effectively means "there is a match, but must not contain a shorter match"

> 1998 TOPLAS

I will read through it properly, but i can already see from page 8 that it requires a table of (DFA size x input length) which makes me very suspicious that it is more of a thought exercise than a real world solution.

> It is true that you can’t stream the haystack in the general case, but to what precise extent you can is an interesting question with a known algorithmic answer[3].

Thank you for this, this is an interesting paper

mananaysiempre 2 hours ago [-]

>> lazy quantifiers for inclusion into the next release of POSIX

> That is surprising!

I mean, POSIX BREs (only!) also include (single-digit) backreferences. Surprisingly, (with such a restriction) this is actually polynomial[1,2] if impractical (h/t 'burntsushi for the reference[3]). But I still wouldn’t take POSIX as the arbiter of sanity in this case. Thus far I’m not even convinced their text actually amounts to a well-defined ordering on parse trees.

I don’t know if nongreedy quantifiers are all that interesting without match groups, though, so this isn’t a particularly burning question in my view.

[1] https://branchfree.org/2019/04/04/question-is-matching-fixed...

[2] https://github.com/travisdowns/polyregex

[3] https://news.ycombinator.com/item?id=40431198

saidnooneever 9 hours ago [-]

i want to say threats dont only come from inputs gathered over the internet.

there are many reasons to exploit things. one example is local privilege escalation. If your service has high privileges and somehow someone can edit an input source for it (like some file it reads thats accessible to the user, or even by tricking the service into looking at the wrong file) it will still be a useful vector.

now this might seem far fetched, but a lot of exploits i've seen actually do this type of stuff.

for example you find a program which gatheres some debug or support info package, and touches a directory which is user accessible. user put some kind of link or tricky file in there and boom, service compromised.

I would only not use hardened mode if the regex is actually embedded directly into the program, because that would atleast require the program itself to be touched before it breaks (which would already require the same level of privileges as the program runs on).

So, long story short. Be aware that if your program touches local resources that are not matching its own privilege level, like some log locations, tmp, etc , be sure that stuff doesn not get turned into regex or use the hardened mode to prevent problems.

its not always about users providing some input via an webpage or some online service that causes something to break..

davidgrenier 9 hours ago [-]

I wonder how gracefully redgrep handles this. This tool hasn't been talked about since the year of its release. If I recall correctly, it doesn't handle some obstruse regexes the way conventional tools do however.

https://github.com/google/redgrep

ieviev 7 hours ago [-]

I have not verified it but i assume matching the whole line grep-style would bypass this problem entirely

Since a regex with ^<pattern>$ can not have overlapping matches it should guarantee linearity, given a linear engine. Or in the case of preprocessing lines first and then running is_match it's also linear

mjmas 14 hours ago [-]

> no capture groups > [...] > no lazy quantifiers - .*?

Here's the caveats.

And so running a regex engine on the matches seems like it would get you back to O(regexlen * haystacklen * matchcount) or roughly O(mn²) again.

ieviev 9 hours ago [-]

It's still O(n * m). The matches are non-overlapping, even if you run a separate engine on the matches post-match it will at most traverse the full input once across all matches.

yxhuvud 11 hours ago [-]

Hmm. I kinda wonder how something like a regexp engine based on marpa would fall. It would probably be slower in the common well behaved cases, as well as preprocessing. But I wonder if there are any cases that couldn't be done in a O(n) way if you limit yourself to only one match (instead of finding all possible matches like it actually can do). Marpa definitely handles most degenerate regexp cases linearly (once you make the match generation not do overlapping entries), but there may be some regexp feature it can't handle?

Bringing a fully general CFG parser to parse regexps would be like hunting mosquitos with a nuke though..

nitely 20 hours ago [-]

FWIW, nim-regex does achieve linear time in the rebar test[0], even if the regex includes capture groups. It's NFA based.

[0]: https://github.com/BurntSushi/rebar/pull/20#issuecomment-256...

ieviev 20 hours ago [-]

Oh that is interesting! I haven't even looked Nim regex until now, is it similar to the approach in Go?

nitely 19 hours ago [-]

It's similar to RE2, but it lacks the on the fly DFA, ie: it's just the classic Thompson's NFA with some tweaks. It does not implement find all the same way, though.

p0w3n3d 12 hours ago [-]

I always thought that finite automata can go with O(n) through the string, with some perks like multistate (i.e. when finding [ab](b)+a in abba you get first state at character 1 and second first state at the 2nd character. Although I understand that regex syntax can be complicated and you can do really O(n^2) in it, but maybe the easier regex could be compiled the easier way...

ummonk 19 hours ago [-]

Great stuff.

I would argue that hardened mode should be default though, similar to how siphash is the default hashing function in Rust hash maps. Faster mode should be opt in if the user is confident that the supplied data is nonmalicious and they need the speed up.

ieviev 18 hours ago [-]

I’m still open to it and thinking about it actually. I will explore if it’s possible to eliminate the large losses on common patterns and if it turns out it is then it’s a no brainer.

Going forward this and the extended operators + large pattern perf will hopefully be a strong selling point to gain more traction

babelfish 20 hours ago [-]

Cursor just wrote a great blog post on this - "Fast regex search: indexing text for agent tools" https://cursor.com/blog/fast-regex-search

vanderZwan 7 hours ago [-]

> i think i'll rest for a bit after this. i can only do 80-hour weeks for so long

Jesus Christ, 80 hours?! I really hope the author seriously takes a proper break! I mean, they seem to be riding that incredible high that comes from having a breakthrough in deeply understanding a really tough problem after thinking about it for too long, so I kind of get it, but that is also all the more reason to take good care the precious brain that now stores all that knowledge, before it burns out!

hawtads 12 hours ago [-]

The original Kleene Star Regex was invented to model neural networks. Have you tried throwing a transformer at the problem /s? Also O(n²) but at least you get hardware acceleration ¯\(ツ)/¯

Here's Kleene's Representation of Events in Nerve Nets and Finite Automata:

https://www.rand.org/content/dam/rand/pubs/research_memorand...

22 hours ago [-]

thaumasiotes 22 hours ago [-]

> the problem we're talking about in this post (finding all longest matches without quadratic blowup)

Wait, what? I thought this was about finding all matches. With a minor tweak to the opening example:

We want to match `(.*a | b)` against `bbbbbabbbbb`.

I want to detect each `b` individually, and I also want to detect `bbbbba`, `bbbba`, `bbba`, `bba`, `ba`, and `a`. That's what it means to find all matches.

ieviev 22 hours ago [-]

Good catch! I changed this to leftmost-longest nonoverlapping matches so it's not misleading

thaumasiotes 10 hours ago [-]

> I changed this to leftmost-longest nonoverlapping matches so it's not misleading

Well, you changed the sentence I quoted. That doesn't really address what I was objecting to; I quoted that sentence because that's the first point in the essay where it's clear that you don't mean "all matches". That's where the reader becomes confused, but they don't become confused because that sentence is unclear - they become confused because everything else on the page is misleading, and that sentence unambiguously contradicts the misleading impression created by everything else.

Your headline says "all matches", your subheadline says "all matches", and your text both before and (still) after the sentence you changed frequently says "all matches", and in none of those cases do you actually mean "all matches". You mention that there is an existing solution, REmatch, but only to dismiss it as "solving a different problem", the problem of finding all matches. You also note that "all matches" is inherently quadratic because the size of the output is potentially quadratic, leading me to wonder why it's a surprise that asking for all matches yields quadratic performance.

ieviev 10 hours ago [-]

I don't agree that overlapping matches is the only interpretation of "all matches". it very clearly does return all matches and it removes overlap

The post states clearly "finding all leftmost-longest non-overlapping matches without quadratic blowup"

The engines mentioned at the beginning all return nonoverlapping matches. Basic document search returns nonoverlapping matches.

Finding overlapping matches is exotic and rarely ever used in practice outside of network security. It does solve a different problem

ivlozada 14 hours ago [-]

[dead]

bedardbrandon89 24 hours ago [-]

[dead]

openclaw01 17 hours ago [-]

[dead]

derodero24 8 hours ago [-]

[dead]

zahlman 23 hours ago [-]

> search a document for a pattern and it takes a second. search one a hundred times larger and it doesn't take a hundred seconds - it can take almost three hours.

Most of this is about quadratic time find-all operations where a search operation is linear. But it's also still possible to get quadratic behaviour out of a single search without catastrophic backtracking, more easily than you might expect. In late January to early February, Tim Peters was talking about an example of this on the Python forums (see e.g. https://discuss.python.org/t/add-re-prefixmatch-deprecate-re...) and also related the experience of trying to diagnose the issue with AI (see https://discuss.python.org/t/claude-code-how-much-hype-how-m... and onward). Peters' example was:

  \d+\s+

on a string containing only digits, a prefix match takes O(n) time as it considers every possible end position for the digit, and immediately sees no following whitespace. But the search is quadratic because it has to repeat that O(n) work at every position; the regex engine can't track the fact that it's already examined the string and found no whitespace, so it re-tries each digit match length.

(This is arguably "backtracking" since it tries the longest match first, but clearly not in a catastrophic way; if you use `\d+?` instead then of course it only searches forward but is still O(n). It actually is slower in my testing in the Python implementation; I don't exactly know why. As noted in the discussion, the possessive quantifier `\d++` is considerably faster, and of course doesn't backtrack, but still causes O(n^2) searching. The repeated attempts to match `\s+` aren't the problem; the problem is repeatedly looking for digits in places where digits were already found and rejected.)

The way to fix this proposed in the discussion is to use a negative lookbehind assertion before the digits: `(?<!\d)\d+\s+`. This way, the regex engine can bail out early when it's in the middle of a digit string; if the previous character was a digit, then either `\d+\s+` doesn't match here, or it would have matched there.

A simpler idea is to just search for `\d\s+`, or even `\d\s` — since these will be present if and only if `\d+\s+` is. This way, though, you still need to do extra work with the partial match to identify the start and end of the full match. My first idea was to use positive lookbehind for the digits, since the lookbehind match doesn't need to backtrack. In fact lookbehinds require a fixed-length pattern, so this is really just a more complicated way to do the `\d\s+` simplification.

----

> Hyperscan (and its fork Vectorscan) is a true linear-time all-matches regex engine. it achieves this by using "earliest match" semantics - reporting a match the moment the DFA enters a match state, instead of continuing to find the longest one.

Is this not just equivalent to forcing "reluctant" quantifiers (`\d+?`) everywhere?

ieviev 22 hours ago [-]

with all-matches semantics it returns a significantly higher number of matches than leftmost greedy.

eg. /abc*/ and abccccc will return you matches at ab|c|c|c|c|c|

I think it's very common and ok that people reason about other engines in terms of backtracking but it works very differently. And fixed length lookbehinds are more of a Java/Python thing, other engines support all lookbehinds.

The main idea of linear regex and intuitive semantics is that it should be declarative and the engine does whatever is the fastest without you having to worry about it. Instead of describing character by character how to perform the search and where it can blow up, think of it as just a specification. Then you can truly express whatever is the shortest/most convenient to explain.

Something i'm still trying to figure out and perhaps failing to understand is what are the killer features of backtracking regex that you would really miss if you were to use linear regex? It would help me a lot to know, i'm trying to convince others to make the switch

MoonZ 23 hours ago [-]

> In fact lookbehinds require a fixed-length pattern

Just a small note: some regex engines support "variable length lookbehind", check the last column on this wikipedia article : https://en.wikipedia.org/wiki/Comparison_of_regular_expressi...

zahlman 22 hours ago [-]

Good to know. Although a lookbehind for `\d+` doesn't really gain anything over a lookbehind for `\d` anyway; they match in the same circumstances, just with different results.

Izkata 23 hours ago [-]

If there's supposed to be a literal asterisk in there somewhere, you can escape it with a backslash. Right now two paragraphs are italic because of mismatched asterisks.

zahlman 22 hours ago [-]

Thanks. There are no asterisks in the regexes; I had simply missed the closing asterisk on some intentional emphasis. (And then I also had to fix some escaping inserted by the system to try to correct for the actual problem.)

mjmas 15 hours ago [-]

> the search is quadratic because it has to repeat that O(n) work at every position

The problem is that this is one of the regexes that backtracking engines have a bad time with.

With a NFA implementation it can be done in O(regexlen * haystacklen) time, though that only holds for true regular expressions (no backreferences).

https://swtch.com/~rsc/regexp/regexp1.html

And then for re.search, since the NFA wants to just do it once, it should run it with the pattern as

  ^.*?(\d+\s+).*$

(where *? is a non-greedy repeat)

nmilo 23 hours ago [-]

[flagged]

vntok 22 hours ago [-]

The article must be pretty great if that's the strongest argument against it. Thanks! Added to the reading list.

uwais12 19 hours ago [-]

[flagged]

nadavdebi 1 days ago [-]

[flagged]

ChadNauseam 1 days ago [-]

[flagged]

halperter 1 days ago [-]

I'm of the opinion that their post is still human-written. They describe in their first blog post that it's human written (which could be a lie, but oh well) and their other blog posts seem to have the same informal lowercase hyphenated writing style. Sentance length varies throughout the posts. Sure, you could prompt this writing style, but I lean towards thinking that the piece is human written.

dooglius 1 days ago [-]

The fact that terms like Aho-Corasick, PLDI, Go, etc. are properly capitalized, including if they begin sentences, but otherwise sentences are uncapitalized, makes me think it's an explicit LLM instruction "don't capitalize the start of sentences" rather than writing style.

srcreigh 24 hours ago [-]

ChatGPT also loves Aho-Corasick and seems to overuse it as an optimization fall back idea. ChatGPT has suggested the algorithm to me but the code ended up slowing down a lot.

sigbottle 20 hours ago [-]

ChatGPT was heavily RL'd on competitive programming in 2025, and aho-corasick is a traditional algorithm in the competitive programming space.

tux3 24 hours ago [-]

No, this is just what that writing style looks like. Names and acronyms are usually capitalized normally.

I keep being surprised by the magnitude of the disconnect between this place and the other circles of hell. I'd have thought the Venn diagram would have a lot more overlap.

dooglius 23 hours ago [-]

Oh the venn diagram might be big, the HN population just has a lot of variance I think, and is less of a community per se. I don't doubt what you're saying, though in the grand scheme of things, I think the "too lazy to hit shift" population dwarfs any of these groups.

tux3 23 hours ago [-]

Yeah, I can agree with the variance. Except that the "too lazy to hit shift" community is not something I would ever confuse with people writing long form articles about their regex engine research that they'll be presenting at PLDI.

The confusion might be understandable for people who have never encountered this style before, but that's still a very uncharitable take about an otherwise pretty interesting article.

jlarocco 1 days ago [-]

What's with this silly "all lower case" style lately?

Jack Dorsey's layoff message last month did the same thing.

Is it some kind of "Prove you're not an AI by purposely writing like an idiot" or something?

gfody 24 hours ago [-]

not anti-capitalist, just a subtle preference away from capitalism

ieviev 23 hours ago [-]

It is human written and i've thoroughly went over every paragraph but i did use some help with wording. i suppose it does show now that you mention it

ChadNauseam 2 hours ago [-]

It does but apparently not to the point that it bothers most people. So I guess I wouldn't worry about it. Keep up the good work on making regex faster!

mkehrt 1 days ago [-]

This reads nothing like any AI text I've read before.

IshKebab 24 hours ago [-]

This is pretty clearly written by a human.

cubefox 23 hours ago [-]

I also have the impression that the writing was at least "enhanced" with an LLM. For example, the amount of " - " dashes (over 60) is much higher than in a normal text.

By the way, I consider it a pretty disgraceful defect of HN that people routinely "flag" opinions merely because they disagree. The flag is for spam, not for downvoting.

ChadNauseam 3 hours ago [-]

Haha. But I guess it proves my assertion that "we can still tell" was wrong. Scary to think about the future though. Soon less RLHF'd models will be available, and it won't be so easy to recognize their tells. I don't think the whole post was written by an LLM in one prompt and I'm sure it was actually a lot of work to write, but yeah, it was definitely enhanced by one.

It reminds me of how peer review is supposed to be blind, but certain authors like Simon Peyton Jones have such a distinctive voice that surely reviewers instantly know when a paper is by him. The models have been successfully made to talk in a particular way, and it's very clear once you becomes familiar with it. My guess is that most HN users are not familiar and I just sound like a crazy person to them

The author admitted it in another comment anyway https://news.ycombinator.com/item?id=47494833

Rendered at 19:47:59 GMT+0000 (Coordinated Universal Time) with Vercel.