▲I got the highest score on ARC-AGI again swapping Python for Englishjeremyberman.substack.com

109 points by freediver 10 hours ago | 36 comments

0x20cowboy 4 hours ago [-]

> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?

Because they are not.

Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.

It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.

gwd 2 hours ago [-]

> Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.

Do submarines swim? I don't really care if it gets me where I want to go. The fact is that just two days ago, I asked Claude to look at some reasonably complicated concurrent code to which I had added a new feature, and asked it to list what tests needed to be added; and then when I asked GPT-5 to add them, it one-shot nailed the implementations. I've written a gist of it here:

https://gitlab.com/-/snippets/4889253

Seriously just even read the description of the test it's trying to write.

In order to one-shot that code, it had to understand:

- How the cache was supposed to work

- How conceptually to set up the scenario described

- How to assemble golang's concurrency primitives (channels, goroutines, and waitgroups), in the correct order, to achieve the goal.

Did it have a library of concurrency testing patterns in its head? Probably -- so do I. Had it ever seen my exact package before in its training? Never.

I just don't see how you can argue with a straight face that this is "pattern matching". If that's pattern matching, then pattern matching is not an insult.

If anything, the examples in this article are the opposite. Take the second example, which is basically 'assemble these assorted pieces into a rectangle'. Nearly every adult has assembled a minimum of dozens of things in their lives; many have assembled thousands of things. So it's humans in this case who are simply "pattern matching questions on a contrived test", and the LLMs, which almost certainly didn't have a lot of "assemble these items" in their training data, that are reasoning out what's going on from first principles.

Akronymus 1 hours ago [-]

> I just don't see how you can argue with a straight face that this is "pattern matching". If that's pattern matching, then pattern matching is not an insult.

IMO its still "just" a, very good, autocomplete. No actual reasoning, but lots of statistics on what is the next token to spit out.

NoahZuniga 44 minutes ago [-]

> Do submarines swim?

That's the main point of the parent comment. Arguing about the definition of "reasoning" or "pattern matching" is just a waste of time. What really matters is if it produces helpful output. Arguing about that is way better!

Instead of saying: "It's just pattern matching -> It won't improve the world", make an argument like: "AI's seem to have trouble specializing like humans -> adopting AI will increase error rates in business processes -> due to the amount of possible edge cases, most people will get into an edge case with no hope of escaping it -> many people's lives will get worse".

The first example relies on us agreeing on the definition of pattern matching, and then taking a conclusion based on how those words feel. This has no hope of convincing me if I don't like your definition! The second one is an argument that could potentially convince me, even if I'm an AI optimist. It is also just by itself an interesting line of reasoning.

ozgung 5 minutes ago [-]

No it's not "just a very good autocomplete". I don't know why people repeat this thing (it's wrong) but I find it an extremely counterproductive position. Some people just love to dismiss the capabilities of AI with a very shallow understanding of how it works. Why?

It generates words one by one, like we all do. This doesn't mean it does just that and nothing else. It's the mechanics of how they are trained and how they do inference. And most importantly how they communicate with us. It doesn't define what they are or their limits. This is reductionism. Ignoring the mathematical complexity of a giant neural network.

ACCount37 2 hours ago [-]

"Not understanding or reasoning" is anthropocentric cope. There is very little practical difference between "understanding" and "reasoning" implemented in human mind and that implemented in LLMs.

One notable difference, however, is that LLMs disproportionately suck at spatial reasoning. Which shouldn't be surprising, considering that their training datasets are almost entirely text. The ultimate wordcel makes for a poor shape rotator.

All ARC-AGI tasks are "spatial reasoning" tasks. They aren't in any way special. They just force LLMs to perform in an area they're spectacularly weak at. And LLMs aren't good enough yet to be able to brute force through this innate deficiency with raw intelligence.

HighGoldstein 2 hours ago [-]

> There is very little practical difference between "understanding" and "reasoning" implemented in human mind and that implemented in LLMs.

Source?

ACCount37 1 hours ago [-]

The primary source is: measured LLM performance on once-human-exclusive tasks - such as high end natural language processing or commonsense reasoning.

Those things were once thought to require a human mind - clearly, not anymore. Human commonsense knowledge can be both captured and applied by a learning algorithm trained on nothing but a boatload of text.

But another important source is: loads and loads of mech interpret research that tried to actually pry the black box open and see what happens on the inside.

This found some amusing artifacts - such as latent world models that can be extracted from the hidden state, or neural circuits corresponding to high level abstracts being chained together to obtain the final outputs. Very similar to human "abstract thinking" in function - despite being implemented on a substrate of floating point math and not wet meat.

NooneAtAll3 2 hours ago [-]

...literally benchmarks the post is all about?

practical difference is about results - and results are here

wiseowise 4 hours ago [-]

[flagged]

bloqs 4 hours ago [-]

please consider a less emotive, flaming/personal tone in the future, hacker news is much more readable without it!

I would broadly agree that it's a bit far, but the OPs point does have some validity, its often the same formulaic methodology

modeless 6 hours ago [-]

I've been testing LLMs on Sokoban-like puzzles (in the style of ARC-AGI-3) and they are completely awful at them. It really highlights how poor their memory is. They can't remember abstract concepts or rules between steps, even if they discover them themselves. They can only be presented with lossy text descriptions of such things which they have to re-read and re-interpret at every step.

LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.

sunrunner 3 hours ago [-]

I'm not sure how similar this is but I tried the same quite a while back with a simple 5x5 nonogram (Picross) and had similar difficulties.

I found not only incorrect 'reasoning' but also even after being explicit about why a certain deduction was not correct the same incorrect deduction would then appear later, and this happened over and over.

Also, there's already a complete database of valid answers at [1], so I'm not sure why the correct answer couldn't just come from that, and the 'reasoning' can be 'We solved this here, look...' ;)

[1] The wonderful https://pixelogic.app/every-5x5-nonogram

Akronymus 1 hours ago [-]

> I found not only incorrect 'reasoning' but also even after being explicit about why a certain deduction was not correct the same incorrect deduction would then appear later, and this happened over and over.

Because its in the context window and a lot of training material refers to earlier stuff for later stuff it is trained to bring up that stuff again and again. Even if it is in the window as a negative.

M4v3R 6 hours ago [-]

I wonder scaffolding synthesis is the way to go. Namely the LLM itself first reasons about the problem and creates scaffolding for a second agent that will do the actual solving. All inside a feedback loop to adjust the scaffolding based on results.

modeless 6 hours ago [-]

In general I think the more of the scaffolding that can be folded into the model, the better. The model should learn problem solving strategies like this and be able to manage them internally.

sixo 5 hours ago [-]

I toyed around with the idea of using an LLM to "compile" user instructions into a kind of AST of scaffolding, which can then be run by another LLM. It worked fairly wellbfor the kind of semi-structured tasks LLMs choke on like "for each of 100 things, do...", but I haven't taken it beyond a minimal impl.

harshitaneja 5 hours ago [-]

I am working on something similar but with an AST for legal documents. So far, it seems promising but still rudimentary.

plantain 5 hours ago [-]

If you've ever used Claude Code + Plan mode - you know that exactly this is true.

albertzeyer 5 hours ago [-]

This sounds interesting.

I would really like to read a full research paper made out of this, which describes the method in more detail, gives some more examples, does more analysis on it, etc.

Btw, this uses LLMs on pure text-level? Why not images? Most of these patterns are easy to detect on image-level, but I assume when presented as text, it's much harder.

> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?

I think this argument is a bit flawed. Yes, you can define AGI as being better than (average) humans in every possible task. But isn't this very arbitrary? Isn't it more reasonable to expect that different intelligent systems (including animals, humans) can have different strengths, and it is unreasonable to expect that one system is really better in everything? Maybe it's more reasonable to define ASI that way, but even for ASI, if a system is already better in a majority of tasks (but not necessarily in every task), I think this should already count as ASI. Maybe really being better in every possible task is just not possible. You could design a task that is very specifically tailored for human intelligence.

bubblyworld 4 hours ago [-]

I suspect (to use the language of the author) current LLMs have a bit of a "reasoning dead zone" when it comes to images. In my limited experience they struggle with anything more complex than "transcribe the text" or similarly basic tasks. Like I tried to create an automated QA agent with Claude Sonnet 3.5 to catch regressions in my frontend, and it will look at an obviously broken frontend component (using puppeteer to drive and screenshot a headless browser) and confidently proclaim it's working correctly, often making up a supporting argument too. I've had much more success passing the code for the component and any console logs directly to the agent in text form.

My memory is a bit fuzzy, but I've seen another QA agent that takes a similar approach of structured text extraction rather than using images. So I suspect I'm not the only one finding image-based reasoning an issue. Could also be for cost reasons though, so take that with a pinch of salt.

ACCount37 25 minutes ago [-]

LLM image frontends suck, and a lot of them suck big time.

The naive approach of "use a pretrained encoder to massage the input pixels into a bag of soft tokens and paste those tokens into the context window" is good enough to get you a third of the way to humanlike vision performance - but struggles to go much further.

Claude's current vision implementation is also notoriously awful. Like, "a goddamn 4B Gemma 3 beats it" level of awful. For a lot of vision-heavy tasks, you'd be better off using literally anything else.

Garlef 1 hours ago [-]

That's a super neat approach.

But the core issue seems to be: How do you come up with the fitness function that drives the evolutionary process without human intervention in the first place?

(I've tried something similar with a coding agent where I let the agent modify parts of its system prompt... But it got stuck very fast since there was no clear fitness function)

Davidzheng 6 hours ago [-]

Actually really promising stuff. I think a lot of the recent advances in the last 6mo - 1yr is in the other loop (for ex. the google deepthink model which got IMO gold and the OAI IMO gold all use substantive other loop search strategies [though it's unclear what these are] to maybe parallelize some generation/verification process). So there's no reason why we can't have huge advances in this area even outside of the industry labs in my view (I'm uninformed in general so take this comment with a large grain of salt).

wiz21c 3 hours ago [-]

isn't the author actually overfitting a solution ? He'll sure beat ARC AGI, but that will be all.

deyiao 1 hours ago [-]

I don't think so. The author isn't training an LLM, but rather using an LLM to solve a specific problem. This method could also be applied to solve other problems.

justatdotin 4 hours ago [-]

> LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones.

blank stare

mjburgess 3 hours ago [-]

We have dead-zones in adductive reasoning, not in induction or deduction. Almost all failures of reasoning in people are in abducing what model describes the situation at hand.

eg., we can apply the rule, "-A cannot follow from A", etc. regardless of the A

eg., we always know that if the number of apples is 2, then it cannot be any of "all numbers without 2" -- which quantifies over all numbers

You will not find a "gap" for a given number, whereas with LLMs, gaps of this kind are common

rel_ic 1 hours ago [-]

> we can apply the rule, "-A cannot follow from A", etc. regardless of the A

You can't think of any domains where we are unable to apply this rule? I feel like I'm surrounded by people claiming "A, therefore -A!!"

And if I'm one of them, and this were a reasoning dead-zone for me, I wouldn't be able to tell!

mjburgess 50 minutes ago [-]

That's an abductive failure to recognise that something is A, and something else is not-A

I dont see cases where people recognise the contradiction and then perform it.

virgilp 2 minutes ago [-]

How can you know? One could argue that the entire phenomenon of cognitive dissonance is "people (internally) recognize the contradiction and then perform it"

didroe 3 hours ago [-]

>With RL, models no longer just learn what sounds correct based on patterns they've seen. They learn what words to output to be correct. RL is the process of forcing the pre-trained weights to be logically consistent.

How does Reinforcement Learning force the weights to be logically consistent? Isn't it just about training using a coarser/more-fuzzy granularity of fitness?

More generally, is it really solving the task if it's given a large number of attempts and an oracle to say whether it's correct? Humans can answer the questions in one shot and self-check the answer, whereas this is like trial and error with an external expert who tells you to try again.

amelius 2 hours ago [-]

This sounds like it is just slightly smarter than brute forcing your way to a solution.

Oh well, more support for my prediction: nobody will win a Nobel prize for reaching AGI.

jokoon 5 hours ago [-]

Those are bold claims

imiric 2 hours ago [-]

Congrats, you made LLMs perform slightly better at a contrived puzzle. This finally proves that we've cracked intelligence and are well on our way towards AGI.

pilooch 7 hours ago [-]

Congrats, this solution resembles AlphaEvolve. Text serves as the high-level search space, and genetic mixing (map-elites in AE) merges attemps at lower levels.

doctorpangloss 7 hours ago [-]

you would be interested in dSPY

Loading comments...

0x20cowboy 4 hours ago [-]

> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?

Because they are not.

Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.

It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.

gwd 2 hours ago [-]

> Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.

https://gitlab.com/-/snippets/4889253

Seriously just even read the description of the test it's trying to write.

In order to one-shot that code, it had to understand:

- How the cache was supposed to work

- How conceptually to set up the scenario described

- How to assemble golang's concurrency primitives (channels, goroutines, and waitgroups), in the correct order, to achieve the goal.

Did it have a library of concurrency testing patterns in its head? Probably -- so do I. Had it ever seen my exact package before in its training? Never.

I just don't see how you can argue with a straight face that this is "pattern matching". If that's pattern matching, then pattern matching is not an insult.

Akronymus 1 hours ago [-]

> I just don't see how you can argue with a straight face that this is "pattern matching". If that's pattern matching, then pattern matching is not an insult.

IMO its still "just" a, very good, autocomplete. No actual reasoning, but lots of statistics on what is the next token to spit out.

NoahZuniga 44 minutes ago [-]

> Do submarines swim?

ozgung 5 minutes ago [-]

ACCount37 2 hours ago [-]

"Not understanding or reasoning" is anthropocentric cope. There is very little practical difference between "understanding" and "reasoning" implemented in human mind and that implemented in LLMs.

HighGoldstein 2 hours ago [-]

> There is very little practical difference between "understanding" and "reasoning" implemented in human mind and that implemented in LLMs.

Source?

ACCount37 1 hours ago [-]

The primary source is: measured LLM performance on once-human-exclusive tasks - such as high end natural language processing or commonsense reasoning.

But another important source is: loads and loads of mech interpret research that tried to actually pry the black box open and see what happens on the inside.

NooneAtAll3 2 hours ago [-]

...literally benchmarks the post is all about?

practical difference is about results - and results are here

wiseowise 4 hours ago [-]

[flagged]

bloqs 4 hours ago [-]

please consider a less emotive, flaming/personal tone in the future, hacker news is much more readable without it!

I would broadly agree that it's a bit far, but the OPs point does have some validity, its often the same formulaic methodology

modeless 6 hours ago [-]

sunrunner 3 hours ago [-]

I'm not sure how similar this is but I tried the same quite a while back with a simple 5x5 nonogram (Picross) and had similar difficulties.

Also, there's already a complete database of valid answers at [1], so I'm not sure why the correct answer couldn't just come from that, and the 'reasoning' can be 'We solved this here, look...' ;)

[1] The wonderful https://pixelogic.app/every-5x5-nonogram

Akronymus 1 hours ago [-]

M4v3R 6 hours ago [-]

modeless 6 hours ago [-]

In general I think the more of the scaffolding that can be folded into the model, the better. The model should learn problem solving strategies like this and be able to manage them internally.

sixo 5 hours ago [-]

harshitaneja 5 hours ago [-]

I am working on something similar but with an AST for legal documents. So far, it seems promising but still rudimentary.

plantain 5 hours ago [-]

If you've ever used Claude Code + Plan mode - you know that exactly this is true.

albertzeyer 5 hours ago [-]

This sounds interesting.

I would really like to read a full research paper made out of this, which describes the method in more detail, gives some more examples, does more analysis on it, etc.

Btw, this uses LLMs on pure text-level? Why not images? Most of these patterns are easy to detect on image-level, but I assume when presented as text, it's much harder.

> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?

bubblyworld 4 hours ago [-]

ACCount37 25 minutes ago [-]

LLM image frontends suck, and a lot of them suck big time.

Garlef 1 hours ago [-]

That's a super neat approach.

But the core issue seems to be: How do you come up with the fitness function that drives the evolutionary process without human intervention in the first place?

(I've tried something similar with a coding agent where I let the agent modify parts of its system prompt... But it got stuck very fast since there was no clear fitness function)

Davidzheng 6 hours ago [-]

wiz21c 3 hours ago [-]

isn't the author actually overfitting a solution ? He'll sure beat ARC AGI, but that will be all.

deyiao 1 hours ago [-]

I don't think so. The author isn't training an LLM, but rather using an LLM to solve a specific problem. This method could also be applied to solve other problems.

justatdotin 4 hours ago [-]

> LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones.

blank stare

mjburgess 3 hours ago [-]

We have dead-zones in adductive reasoning, not in induction or deduction. Almost all failures of reasoning in people are in abducing what model describes the situation at hand.

eg., we can apply the rule, "-A cannot follow from A", etc. regardless of the A

eg., we always know that if the number of apples is 2, then it cannot be any of "all numbers without 2" -- which quantifies over all numbers

You will not find a "gap" for a given number, whereas with LLMs, gaps of this kind are common

rel_ic 1 hours ago [-]

> we can apply the rule, "-A cannot follow from A", etc. regardless of the A

You can't think of any domains where we are unable to apply this rule? I feel like I'm surrounded by people claiming "A, therefore -A!!"

And if I'm one of them, and this were a reasoning dead-zone for me, I wouldn't be able to tell!

mjburgess 50 minutes ago [-]

That's an abductive failure to recognise that something is A, and something else is not-A

I dont see cases where people recognise the contradiction and then perform it.

virgilp 2 minutes ago [-]

How can you know? One could argue that the entire phenomenon of cognitive dissonance is "people (internally) recognize the contradiction and then perform it"

didroe 3 hours ago [-]

How does Reinforcement Learning force the weights to be logically consistent? Isn't it just about training using a coarser/more-fuzzy granularity of fitness?

amelius 2 hours ago [-]

This sounds like it is just slightly smarter than brute forcing your way to a solution.

Oh well, more support for my prediction: nobody will win a Nobel prize for reaching AGI.

jokoon 5 hours ago [-]

Those are bold claims

imiric 2 hours ago [-]

Congrats, you made LLMs perform slightly better at a contrived puzzle. This finally proves that we've cracked intelligence and are well on our way towards AGI.

pilooch 7 hours ago [-]

Congrats, this solution resembles AlphaEvolve. Text serves as the high-level search space, and genetic mixing (map-elites in AE) merges attemps at lower levels.

doctorpangloss 7 hours ago [-]

you would be interested in dSPY