yrimon's Shortform

26th Jan 2025

1 min read

2

This is a special post for quick takes by yrimon. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

6 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:07 AM

[-]yrimon10mo*85

Humans get smarter by thinking. In particular they deduce correct conclusions from known premises or notice and resolve internal inconsistencies. As long as they are more likely to correct wrong things to right things than vice versa, they converge to being smarter.

AIs are close or at that level of ability, and as soon as it's taken advantage of, they will self improve really fast.

[-]cubefox10mo42

Yes. Current reasoning models like DeepSeek-R1 rely on verified math and coding data sets to calculate the reward signal for RL. It's only a side-effect if they also get better at other reasoning tasks, outside math and programming puzzles. But in theory we don't actually need strict verifiability for a reward signal, only your much weaker probabilistic condition. In the future, a model could check the goodness of it's own answers. At which point we would have a self-improving learning process, which doesn't need any external training data for RL.

And it is likely that such a probabilistic condition works on many informal tasks. We know that checking a result is usually easier than coming up with the result, even outside exact domains. E.g. it's much easier to recognize a good piece of art than to create one. This seems to be a fundamental fact about computation. It is perhaps a generalization of the apparent fact that NP problems (with quickly verifiable solutions) cannot in general be reduced to P problems (which are quickly solvable).

[-]yrimon4mo40

Just listened to the imo team at OpenAI talk about their model. https://youtu.be/EEIPtofVe2Q?si=kIPDW5d8Wjr2bTFD Some notes:

The techniques they used are general, and especially useful for RL on hard-to-verify-solution-correctness problems.
It now says when it doesn't know something, or didn't figure it out. This is a requisite for training the model successfully on its own output.
The people behind the model are from the multi agent team. For one age to be bale to work with another, the reports from the other agent need to be trustworthy.

[-]MalcolmMcLeod4mo10

Yikes. If they're telling the truth about all this---particularly the "useful for RL on hard-to-verify-solution-correctness problems"---then this is all markedly timeline-shortening. What's the community consensus on how likely this is to be true?

[-]yrimon4mo20

I have no idea what the community consensus is. I doubt they're lying.

For anyone who already had short timelines this couldn't shorten them that much. For instance, 2027 or 2028 is very soon, and https://ai-2027.com/ assumed there would be successful research done along the way. So for me, very little more "yikes" than yesterday.

It does not seem to me like this is the last research breakthrough needed for full fledged agi, either. LLMs are superhuman at no/low context buildup tasks, but haven't solved context management (be that through long context windows, memory retrieval techniques, online learning or anything else).

I also don't think it's surprising that these research breakthroughs keep happening. Remember that their last breakthrough (strawberry, o1) was "make RL work". This one might be something like "make reward prediction and MCTS work" like mu zero, or some other banal thing that worked on toy cases in the 80s but was non trivial to reimplement in LLMs.

[-]yrimon9mo20

I suspect the most significant advance exposed to the public this week is Claude plays pokémon. There, Claude maintains a notebook in which it logs its current state so that it will be able to reason over the state later.

This is akin to what I do when I'm exploring or trying to learn a topic. I spend a lot of time figur ng out and focusing on the most important bits of context that I need in order to understand the next step.

An internal notebook allows natural application of chain of thought to attention and allows natural continuous working memory. Up until now chain of thought would need to repeat itself to retain a thought in working memory across a long time scale. Now it can just put the facts that are most important to pay attention to into its notebook.

I suspect this makes thinking on a long time scale way more powerful and that future models will be optimized for internal use of such a tool.

Unfortunately, continuity is an obvious precursor to developing long-term goals.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

yrimon's Shortform

2