Dario Amodei says AI will be writing 90% of the code in 6 months and almost all the code in 12 months. I am with Arthur B here, I expect a lot of progress and change very soon but I would still take the other side of that bet. The catch is: I don’t see the benefit to Anthropic of running the hype machine in overdrive on this, at this time, unless Dario actually believed it.
Which means that, if this does not in fact happen in 3-6 months, it should be taken as evidence that there's some unknown-to-us reason for Anthropic to be running the hype machine in this way, and we should therefore update towards their "AGI by 2026-2027" forecasts likewise failing.
Not a strong update: obviously this is just something Dario said "on the run" whereas the 2026-2027 figure is (presumably) the result of careful consideration by the whole Anthropic team, and he might not have a good model of how slow technological adoption is but still have a good model of the roadmap to AGI, etc.
Still, this is incremental evidence towards "Dario do be running his mouth on AI progress".
These tests are a good measure of human general intelligence
Human general intelligence. I think it's abundantly clear that the cognitive features that are coupled in humans are not necessarily coupled in LLMs.
Analogy: In humans, the ability to play chess is coupled with general intelligence: we can expect grandmasters to be quite smart. Does that imply Stockfish is a general-purpose hypergenius?
People said the same about understanding of context, hallucinations, and other stuff
Of note: I have never said anything of that sort, nor nodded along at people saying it. I think I've had to eat crow after making a foolish "LLMs Will Never Do X" claim a total of zero times (having previously made a cautiously small but nonzero number of such claims).
We'll see if I can keep up this streak.
Might lead to widespread chaos, the internet becoming unusable due to AI slop and/or AI agents hacking everything, etc. It won't be pleasant, but not omnicide-tier.
Valid complaint, honestly. I wasn't really going for "good observables to watch out for" there, though, just for making the point that my current model is at all falsifiable (which is I think what @Jman9107 was mostly angling for, no?).
The type of evidence I expect to actually end up updating on, in real life, if we are in the LLMs-are-AGI-complete timeline, is this one:
Reasoning models' skills starting to generalize in harder-to-make-legible ways that look scary to me.
Some sort of subtle observable or argument that's currently an unknown unknown to me, which will make me think about it a bit and realize it upends my whole model.
We should have empirical evidence about this, actually, since the LW team has been experimenting with a "virtual comments" feature. @Raemon, the EDT issue aside, were the comments any good if you forgot they're written by an LLM? Can you share a few (preferably a lot) examples?
Because "RL on passing precisely defined unit tests" is not "RL on building programs that do what you want", and is most definitely not "RL on doing novel useful research".
I'd say long reasoning wasn't really elicited by CoT prompting
IIRC, "let's think step-by-step" showed up in benchmark performance basically immediately, and that's the core of it. On the other hand, there's nothing like "be madly obsessed with your goal" that's known to boost LLM performance in agent settings.
There were clear "signs of life" on extended inference-time reasoning; there are (to my knowledge) none on agent-like reasoning.
you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
That's basically the core of my argument. If LLMs learned agency skills, they would've been elicitable in some GPT-N, with no particular reason to think that this N needs to be very big. On the contrary, by extrapolating a similarly qualitative jump from GPT-3 to GPT-4 as happened from GPT-2 to GPT-3, I'd expected these skills to spontaneously show up in GPT-4 – if they were ever going to show up.
They didn't show up. GPT-4 ended up as a sharper-looking GPT-3.5, and all progress since then amounted to GPT-3.5's shape being more sharply defined, without that shape changing.
Yup, agreed.
The update to my timelines this would cause isn't a direct "AI is advancing faster than I expected", but an indirect "Dario makes a statement about AI progress that seems overly ambitious and clearly wrong to me, but is then proven right, which suggests he may have a better idea of what's going on than me in other places as well, and my skepticism regarding his other overambitious-seeming statements is now more likely to be incorrect".