That's useful information, thanks for sharing!
Note that I'm not arguing that LLM models aren't getting better along some dimensions.[1] The part you quoted was mostly about immediate visceral impressions that a newly released model is much smarter based on how it talks; not more careful investigations.
In particular, "better conceptual resolution" is precisely the sort of improvement I would have expected (though I guess I didn't pre-register that). This is what e. g. GPQA probably measures, and I've personally also noticed improvements in how accurately LLMs understand research papers loaded into their context. (Sonnet 4 makes mistakes Opus 4 or o3 don't.)
What I would get concerned about is if LLMs started rapidly scaling at autonomous production/modification of new ideas. From what you're describing, that isn't quite what's happening? The outputs are only reliable when they're reporting things already in your notes, and the value of new ideas they generate is usually as inspiration, instead of as direct contributions. There's an occasional wholesale good idea, but the rate of those is low, and even when they do happen, it's usually because you've already done 90% of the work setting up the context. If so, that's been my experience as well.
Now, granted, in the limit of infinitely precise conceptual resolution, LLMs would develop the abilities to autonomously act and innovate. But what seems to be happening is that performance on some tasks ("chat with a PDF") scales with conceptual resolution much better than the performance on other tasks ("prove this theorem", "pick the next research direction"), and conceptual resolution isn't improving as fast as it once was (as e. g. in the GPT-3 to GPT-4 jump). So although ever-more people find that LLMs become useful for processing/working with their ideas, the rate of absolute improvement there is low, and it doesn't, to me, look like it's on a transcendental trajectory.
Though it's also possible that my model of idea complexity is incorrect, that LLMs' scaling is still exponential in this domain, and we are on track to them becoming superhumanly good at idea processing and generation by e. g. 2030. Definitively something to keep in mind.
I'd be interested in your reports on how LLMs' contributions to your work change with the subsequent generations of LLM.
In the context of the OP, my main point here is that I don't think any of this can be explained as "personality improvements".
Do you think some of the effect might just be from your getting better at using LLMs, knowing what to expect and what not to expect from them?
I don't want to say the pretraining will "plateau", as such, I do expect continued progress. But the dimensions along which the progress happens are going to decouple from the intuitive "getting generally smarter" metric, and will face steep diminishing returns.
Obligatory "Thane compulsively digs into the details of this week's LLM Innovators paper instead of doing anything productive":
This was still a super impressive result for Claude Sonnet 3.5. Its ideas were almost as good as the human ideas, despite all the Goodhart’s Law issues.
Ehhhh, I'm not particularly interested in grading that on a curve. What was the human performance in this study, in absolute terms? Was the generative process that produced human ideas for this study capable of innovation?
Keep in mind that these ideas weren't generated as part of an active research loop where the humans improved on them in tandem with trying to execute them, discussed those ideas with other researchers, thought about them over days/weeks, etc. It's not necessarily the case that the average human performance here was anything impressive.[1][2]
And my interpretation is that yes, the human ideas here were, on average, pretty bad. The human performance was: effectiveness 4.8, excitement 4.5, novelty 4.9, soundness 5.4, overall 4.0. Extracting the meaning of those from the appendix (page 18):
Score Interpretations
(Effectiveness 5) "Mixed results: The method provides mixed results. It works better than baselines on some datasets or metrics, but not consistently across all of them. The gains tend to be very small and not significant."
(Novelty 5) "Somewhat novel: The idea has some differences from existing work, but the variations are very incremental rather than substantial. It might refine or extend previous ideas but lacks enough originality to justify a new paper on its own."
(Excitement 4.5) The meaning of 4/10 is defined only indirectly, as something between 3 and 5.:
(Soundness 5.4) Between:
(Overall 4) "Ok but not good enough, rejection for major AI conferences."
I'd say that's some floor-level performance from the humans. I don't think it'd be too controversial to say that the average human-published paper is pretty terrible in absolute terms: it does ~nothing to actually push the frontier, it does not constitute useful research, it's just something researchers do to survive publish-or-perish. Is matching that performance actually indicative of any ability to innovate? I'm doubtful.
And Novelty 5, 4.7 for LLMs? This is not beating the "LLMs can't innovate" allegations; in fact, it's evidence for that[3]. A machine that goes into the arXiv database, extracts the idea from some random paper, makes some random-but-not-outright-nonsensical changes to it, and outputs that, would've likely attained the same score.
Especially since humans don't have actual perfect recall of all papers ever published or all research ideas ever suggested on the Internet, and LLMs more or less do. This is biasing math benchmarks towards overestimating LLM capabilities; why assume this isn't happening here?
Now, the above does have a huge potential hole: maybe we should be looking at the top 5% of AI-generated ideas, instead of the average value? After all, if LLMs could produce a 10/10 idea 5% of the time, that would be more than sufficient for transformative impact.
The paper does provide a way to eyeball this: page 25 features plots of score distributions. (Subtlety: each AI idea got 4-5 reviews, and in the bar plot, each dot is a review score, not an average score for a given project across all reviews it got. Also, for perspective: there was 96 reviews total.)
We could see the following regarding AI performance:
Score Interpretations
Novelty:
Excitement:
Overall:
(For perspective: Opus 4 and o3 estimate that major AI conferences accept top 10-15% of papers worldwide. I don't know any better than that.)
Now, that actually does incrementally update me towards "LLMs may be capable of innovation". The higher-end values may be noise (in the below sense), but the 6/10 tier is pretty solidly reached.
That said: Keep in mind that those are still individual experts' estimates of novelty/excitement, not ground-truth values. For example, that 8/10 idea may be an outlier in the sense of "this one reviewer liked this one idea disproportionally well", not in the sense of "AI came up with a really good idea".
Also, uhh... Turns out the ideas initially evaluated and the ideas actually executed-on were not exactly the same ideas. See 5.1 at page 8:
We show the counts of all types of changes made to human and AI ideas in Table 6, where we see that human ideas and AI ideas involve an average of 2.9 and 3.1 changes, respectively. This indicates that only a moderate number of changes are made to both human and AI ideas. Moreover, all of the changes focus on experiment details rather than altering any algorithms proposed in the original ideas. Examples of these changes include switching to benchmarks that are appropriate for the given tasks, updating the backbone models to more recent ones, adding more comprehensive evaluation metrics, specifying any missing hyper-parameters and prompt details, adding stronger baselines, and adding more analysis or ablation studies.
Details on page 27. Some excerpts:
- In another example, the AI-generated idea “Sociolinguistic Role-Play Prompting” proposed experiments on OpenSubtitles and XNLI, which were both removed because they don’t contain the sociolinguistic metadata necessary for the proposed experiments. In the AI-generate idea “Adaptive Semantic Masking”, the executor added more datasets, including Jailbreak-Bench and DAN-Forbidden-Questions, apart from AdvBench mentioned in the original idea
- This refers to adding or changing baseline methods in the proposed experiments. For example, in the AI-generated idea “Adaptive Contextual Pruning”, the executor added a baseline “RAG using model-based embeddings”. In the AI-generate idea “EntropyGuided Prompt Mutation”, the proposed baseline Monte Carlo Dropout was dropped since it’s infeasible on black-box LLMs.
- In multiple projects, executors decided the temperature and top_p values when sampling responses from LLMs, the number of iterations for applying the proposed method, the number of demo examples for in-context learning, and the number of runs when reporting performance.
For fairness' sake, humans' ideas were "tweaked" in this manner as well...
But no, I think this nontrivially poisons any conclusions we should draw from this paper. First, such "small tweaks" only sound small; they might have pretty significant impact, and making them may require good taste/research intuitions.
This is something where I'd expect humans to perform significantly better: as in, if the original idea-generators actually sat down to execute their projects, they would've likely made these tweaks themselves. LLMs, on the other hand, are pretty bad at recovering their stride in this manner (see e. g. LLMs Play Pokémon performance).
So, summing up:
Overall, this is a better LLM Innovators paper than the usual LLM Innovators papers, methodology-wise. It, like, actually tries to measure what it says it's trying to measure, instead of playing shell games. I'd be interested in whether LLMs' performance on this "benchmark" improves as capabilities increase, and if yes, that may be concerning.
But, like, the actual average performance it currently reports is abysmal, and inasmuch as the incrementally better performance of the top decile of AI ideas serves as incremental evidence towards LLMs-can-innovate, all the aforementioned biases serve as evidence that we should expect their actual independent ground-true performance to be incrementally worse. The whole thing then approximately cancels out.
Ad absurdum, if the study generated the ideas by breaking into the researchers' houses at night, waking them up, and forcing them to immediately generate the idea at gunpoint, that generative process would probably not have been very capable of producing innovations, right? Obviously the paper must have done something much more reasonable, but it was still in the same direction of human underperformance.
For example: would the researchers have actually contributed the best ideas they could come up with? If they arrived at some brilliant idea that would score at Excitement 9/10, why would they waste it on this study, instead of keeping it to themselves and then pursuing that research thread on their own?
This is IMO a major problem with all "let's hire a bunch of experts to design AI benchmarks". Why would the mathematicians you hired for inventing novel problems give you the actually novel problem-solution pairs if they came up with some, instead of keeping them for personal publication and giving you some boring rehash of a known problem? (Which the LLM then solves by retrieving an obscure memorized StackExchange answer to basically the same problem.)
Indeed, they probably wouldn't even attempt to produce something actually novel: that's hard research work which you can't reliably churn out for a deadline.
Now, you might counter that LLMs could also be arranged into scaffolds and told to iterate on ideas for a while, and that this might significantly improve their performance. I don't think this has been shown to be true; IIRC Google did something like this and the performance barely improved from just asking LLMs on the spot.
Inasmuch as we would've updated towards LLM innovators if they scored higher here.
Whenever I send an LLM some query I expect to be able to answer myself (instead of requesting a primer on some unknown-to-me subject), I usually try to figure out how to solve it myself, either before reading the response, or before sending the query at all. I. e., I treat the LLM's take as a second opinion.
This isn't a strategy against brain atrophy, though: it's because (1) I often expect to be disappointed by the LLM's answer, meaning I'll end up needing to solve the problem myself anyway, so might as well get started on that, (2) I'm wary of the LLM concocting some response that's subtly yet deeply flawed, so it's best if I have an independent take to contrast it with. And if I do skip this step before reading the response, I usually indeed then end up disappointed by/suspicious of the LLM's take, so end up having to think it over myself anyway.
It confuses me a bit when people talk about LLMs atrophying their brains, because the idea of blindly taking an LLM's response at face value[1] doesn't immediately occur to me as a thing someone might do.
So my advice for avoiding LLM brain atrophy would be to reframe your model of LLMs to feature a healthy, accurate level of distrust towards them. The brain-atrophy-preventing strategies then just become the natural, common-sensical things to do, rather than something extra.
In situations where you would've otherwise reasoned it out on your own, I mean. I do mostly trust them to report the broad strokes of well-established knowledge accurately, at this point. But the no-LLM counterfactual there would've involved me likewise just reading that information from some (likely lower-quality) internet source, so there's no decrease in brain exercise.
that I discussed in AI #1191
Here's to the world staying around long enough for us to read AI #1191.
Recommending movies
I've been trying to use Deep Research tools as a way to find hyper-specific fiction recommendations as well. The results have been mixed. They don't seem to be very good at grokking the hyper-specificness of what you're looking for, usually they have a heavy bias towards the popular stuff that outweighs what you actually requested[1], and if you ask them to look for obscure works, they tend to output garbage instead of hidden gems (because no taste).
It did produce good results a few times, though, and is only slightly worse than asking for recommendations on r/rational. Possibly if I iterate on the prompt a few times (e. g., explicitly point out the above issues?), it'll actually become good.
Like, suppose I'm looking for some narrative property X. I want to find fiction with a lot of X. But what the LLM does is multiplying the amount of X in a work by the work's popularity, so that works that are low in X but very popular end up in its selection.
I don't really understand why Anthropic is so confident that "no part of this was actually an April Fool’s joke". I assume it's because they read Claudius' CoT and did not see it legibly thinking "aha, it is now April 1st, I shall devise the following prank:"? But there wouldn't necessarily be such reasoning. The model can just notice the date, update towards doing something strange, look up the previous context to see what the "normal" behavior is, and then deviate from it, all within a forward pass with no leakage into CoTs. Edit: ... Like a sleeper agent being activated, you know.
The timing is so suspect. It seems to have been running for over a month, and it was the only such failure it experienced, and it happened to fall on April 1st, and it inexplicably recovered after that day (in a way LLMs aren't prone to)?
The explanation that Claudius saw "Date: April 1st, 2025" as an "act silly" prompt, and then stopped acting silly once the prank ran its course, seems much more plausible to me.
(Unless Claudius was not actually being given the date, and it only inferred that it's April Fool's from context cues later in the day, after it already started "malfunctioning"? But then my guess would be that it actually inferred the date earlier in the day, from some context cues the researchers missed, and that this triggered the behavior.)
I agree that it's a promising direction.
I did actually try a bit of that back in the o1 days. What I've found is that getting LLMs to output formal Lean proofs is pretty difficult: they really don't want to do that. When they're not making mistakes, they use informal language as connective tissue between Lean snippets, they put in "sorry"s (a placeholder that makes a lemma evaluate as proven), and otherwise try to weasel out of it.
This is something that should be solvable by fine-tuning, but at the time, there weren't any publicly available decent models fine-tuned for that.
We do have DeepSeek-Prover-V2 now, though. I should look into it at some point. But I am not optimistic, sounds like it's doing the same stuff, just more cleverly.
Relevant: Terence Tao does find them helpful for some Lean-related applications.
(Disclaimer: only partially relevant rant.)
Outside of [coding], I don't know of it being more than a somewhat better google
I've recently tried heavily leveraging o3 as part of a math-research loop.
I have never been more bearish on LLMs automating any kind of research than I am now.
And I've tried lots of ways to make it work. I've tried telling it to solve the problem without any further directions, I've tried telling it to analyze the problem instead of attempting to solve it, I've tried dumping my own analysis of the problem into its context window, I've tried getting it to search for relevant lemmas/proofs in math literature instead of attempting to solve it, I've tried picking out a subproblem and telling it to focus on that, I've tried giving it directions/proof sketches, I've tried various power-user system prompts, I've tried resampling the output thrice and picking the best one. None of this made it particularly helpful, and the bulk of the time was spent trying to spot where it's lying or confabulating to me in its arguments or proofs (which it ~always did).
It was kind of okay for tasks like "here's a toy setup, use a well-known formula to compute the relationships between A and B", or "try to rearrange this expression into a specific form using well-known identities", which are relatively menial and freed up my working memory for more complicated tasks. But it's pretty minor usefulness (and you have to re-check the outputs for errors anyway).
I assume there are math problems at which they do okay, but that capability sure is brittle. I don't want to overupdate here, but geez, getting LLMs from here to the Singularity in 2-3 years just doesn't feel plausible.
Money is a claim on things other people value. You can't destroy value purely by doing something with your claim on that value.
Except the degenerate case of "making yourself or onlookers sad by engaging in self-destructive behaviors where you destroy your claim on resources", I guess. But it's not really an operation purely with money.
Hmm, I guess you can make something's success conditional on your having money (e. g., a startup backed by your investments), and then deliberately destroy your money, dooming the thing. But that's a very specific situation and it isn't really purely about the money either; it's pretty similar to "buy a thing and destroy it". Closest you can get, I think?
(Man, I hope this is just a concept-refinement exercise and I'm not giving someone advice on how to do economics terrorism.)
Very interesting result; I was surprised to see an actual slowdown.
The extensive analysis of the factors potentially biasing the study's results and the careful statements regarding what the study doesn't show are appreciated. Seems like very solid work overall.
That said, one thing jumped out at me:
That seems like misaligned incentives, no? The participants got paid more the more time they spent on tasks. A flat reward for completing a task plus a speed bonus seems like a better way to structure it?
Edit: Ah, I see it's addressed in an appendix:
Still seems like a design flaw to me, but I suppose it isn't as trivial to fix as I'd thought.