Where is the hard evidence that LLMs are useful?
Has anyone seen convincing evidence of AI driving developer productivity or economic growth?
It seems I am only reading negative results about studies on applications.
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
https://www.lesswrong.com/posts/25JGNnT9Kg4aN5N5s/metr-research-update-algorithmic-vs-holistic-evaluation
And in terms of startup growth:
apparently wider economic measurements are not clear?
Also agency still seems very bad, about what I would have expected from decent scaffolding on top of GPT-3:
https://www.lesswrong.com/posts/89qhQH8eHsrZxveHp/claude-plays-whatever-it-wants
(Plus ongoing poor results on Pokémon, modern LLMs still can only win with elaborate task-specific scaffolding)
Though performance on the IMO seems impressive, the very few examples of mathematical discoveries by LLMs don’t seem (to me) to be increasing much in either frequency or quality, and so far are mostly of type “get a better lower bound by combinatorially trying stuff” which seems to advantage computers with...
The textbook reading group on "An Introduction to Universal Artificial Intelligence," which introduces the necessary background for AIXI research, has started, and really gets underway this Monday (Sept. 8th) with sections 2.1 - 2.6.2. Now is about the last chance to easily jump in (since we have only read the intro chapter 1 so far). Please read in advance and be prepared to ask questions and/or solve some exercises. First session had around 20-25 attendees, will probably break up into groups of 5.
Meeting calendar is on the website: https://uaiasi.com/
Reach out to me in advance for a meeting link, DM or preferably colewyeth@gmail.com. Include your phone number if you want to be added to the WhatsApp group (optional).
Pitch for reading the book from @Alex_Altair: https://www.lesswrong.com/posts/nAR6yhptyMuwPLokc/new-intro-textbook-on-aixi
This is following up on the new AIXI research community announcement: https://www.lesswrong.com/posts/H5cQ8gbktb4mpquSg/launching-new-aixi-research-community-website-reading-group
I wonder if the reason that polynomial time algorithms tend to be somewhat practical (not runtime n^100) is just that we aren’t smart enough to invent really necessarily complicated polynomial time algorithms.
Like, the obvious way to get n^100 is to nest 100 for loops. A problem which can only be solved in polynomial time by nesting 100 for loops (presumably doing logically distinct things that cannot be collapsed!) is a problem that I am not going to solve in polynomial time…
A fun illustration of survivorship/selection bias is that nearly every time I find myself reading an older paper, I find it insightful, cogent, and clearly written.
Selection bias isn't the whole story. The median paper in almost every field is notably worse than it was in, say, 1985. Academia is less selective than it used to be—in the U.S., there are more PhDs per capita, and the average IQ/test scores/whatever metric has dropped for every level of educational attainment.
Grab a journal that's been around for a long time, read a few old papers and a few new papers at random, and you'll notice the difference.
Rationality (and other) heuristics I've actually found useful for getting stuff done, but unfortunately you probably won't:
1: Get it done quickly and soon. Every step of every process outside of yourself will take longer than expected, so the effective deadline is sooner than you might think. Also if you don't get it done soon you might forget (or forget some steps).
1(A): 1 is stupidly important.
2: Do things that set off positive feedback loops. Aggressively try to avoid doing other things. I said aggressively.
2(A): Read a lot, but not too much.*
3: You are probably already making fairly reasonable choices over the action set you are considering. It's easiest to fall short(er) of optimal behavior by failing to realize you have affordances. Discover affordances.
4: Eat.
(I think 3 is least strongly held)
*I'm describing how to get things done. Reading more has other benefits, for instance if you don't know the thing you want to get done yet, and its pleasant and self-actualizing.
The primary optimization target for LLM companies/engineers seems to be making them seem smart to humans, particularly the nerds who seem prone to using them frequently. A lot of money and talent is being spent on this. It seems reasonable to expect that they are less smart than they seem to you, particularly if you are in the target category. This is a type of Goodharting.
In fact, I am beginning to suspect that they aren't really good for anything except seeming smart, and most rationalists have totally fallen for it, for example Zvi insisting that anyone who is not using LLMs to multiply their productivity is not serious (this is a vibe not a direct quote but I think it's a fair representation of his writing over the last year). If I had to guess, LLMs have 0.99x'ed my productivity by occasionally convincing me to try to use them which is not quite paid for by very rarely fixing a bug in my code. The number is close to 1x because I don't use them much, not because they're almost useful. Lots of other people seem to have much worse ratios because LLMs act as a superstimulus for them (not primarily a productivity tool).
Certainly this is an impressive technology, surpris...
Mathematics students are often annoyed that they have to worry about "bizarre or unnatural" counterexamples when proving things. For instance, differentiable functions without continuous derivative are pretty weird. Particularly engineers tend to protest that these things will never occur in practice, because they don't show up physically. But these adversarial examples show up constantly in the practice of mathematics - when I am trying to prove (or calculate) something difficult, I will try to cram the situation into a shape that fits one of the theorems in my toolbox, and if those tools don't naturally apply I'll construct all kinds of bizarre situations along the way while changing perspective. In other words, bizarre adversarial examples are common in intermediate calculations - that's why you can't just safely forget about them when proving theorems. Your logic has to be totally sound as a matter of abstraction or interface design - otherwise someone will misuse it.
From soares and fallenstein “towards idealized decision theory”:
“If someone cannot formally state what it means to find the best decision in theory, then they are probably not ready to construct heuristics that attempt to find the best decision in practice.”
This statement seems rather questionable. I wonder if it is a load-bearing assumption.
I think that “ruggedness” and “elegance” are alternative strategies for dealing with adversity - basically tolerating versus preparing for problems. Both can be done more or less skillfully: low-skilled ruggedness is just being unprepared and constantly suffering, but the higher skilled version is to be strong, healthy, and conditioned enough to survive harsh circumstances without suffering. Low-skilled elegance is a waste of time (e.g. too much makeup but terrible skin) and high skilled elegance is… okay basically being ladylike and sophisticated. Yes I admit it this is mostly about gender.
Other examples: it’s rugged to have a very small number of high quality possessions you can easily throw in a backpack in under 20 minutes, including 3 outfits that cover all occasions. It’s elegant to travel with three suitcases containing everything you could possibly need to look and feel your best, including a both an ordinary and a sun umbrella.
I also think a lot of misunderstanding between genders results from these differing strategies, because to some extent they both work but are mutually exclusive. Elegant people may feel taken advantage of because everyone starts expecting them to do ...
Since this is mid-late 2025, we seem to be behind the aggressive AI 2027 schedule? The claims here are pretty weak, but if LLMs really don’t boost coding speed, this description still seems to be wrong.
[edit: okay actually it’s pretty much mid 2025 still, months don’t count from zero though probably they should because they’re mod 12]
I don't think there's enough evidence to draw hard conclusions about this section's accuracy in either direction, but I would err on the side of thinking ai-2027's description is correct.
Footnote 10, visible in your screenshot, reads:
For example, we think coding agents will move towards functioning like Devin. We forecast that mid-2025 agents will score 85% on SWEBench-Verified.
SOTA models score at:
• 83.86% (codex-1, pass@8)
• 80.2% (Sonnet 4, pass@several, unclear how many)
• 79.4% (Opus 4, pass@several)
(Is it fair to allow pass@k? This Manifold Market doesn't allow it for its own resolution, but here I think it's okay, given that the footnote above makes claims about 'coding agents', which presumably allow iteration at test time.)
Also, note the following paragraph immediately after your screenshot:
The agents are impressive in theory (and in cherry-picked examples), but in practice unreliable. AI twitter is full of stories about tasks bungled in some particularly hilarious way. The better agents are also expensive; you get what you pay for, and the best performance costs hundreds of dollars a month.11 Still, many companies find ways to fit AI agents into their workflows.12
AI tw...
If I understand correctly, Claude's pass@X benchmarks mean multiple sampling and taking the best result. This is valid so long as compute cost isn't exceeding equivalent cost of an engineer.
codex's pass @ 8 score seems to be saying "the correct solution was present in 8 attempts, but the model doesn't actually know what the correct result is". That shouldn't count.
That is, we see the first generation of massively scaled RLVR around 2026/2027. So it kind of has to work out of the box for AGI to arrive that quickly?
By 2027, we'll also have 10x scaled-up pretraining compared to current models (trained on 2024 compute). And correspondingly scaled RLVR, with many diverse tool-using environments that are not just about math and coding contest style problems. If we go 10x lower than current pretraining, we get original GPT-4 from Mar 2023, which is significantly worse than the current models. So with 10x higher pretraining than current models, the models of 2027 might make significantly better use of RLVR training than the current models can.
Also, 2 years might be enough time to get some sort of test-time training capability started, either with novel or currently-secret methods, or by RLVRing models to autonomously do post-training on variants of themselves to make them better at particular sources of tasks during narrow deployment. Apparently Sutskever's SSI is rumored to be working on the problem (at 39:25 in the podcast), and overall this seems like the most glaring currently-absent faculty. (Once it's implemented, something else might end u...
It looks like the market is with Kokotajlo on this one (apparently this post must be expanded to see the market).
Particularly after my last post, I think my lesswrong writing has had bit too high of a confidence / effort ratio. Possibly I just know the norms of this site well enough lately that I don't feel as much pressure to write carefully. I think I'll limit my posting rate a bit while I figure this out.
LW doesn't punish, it upvotes-if-interesting and then silently judges.
confidence / effort ratio
(Effort is not a measure of value, it's a measure of cost.)
The hedonic treadmill exists because minds are built to climb utility gradients - absolute utility levels are not even uniquely defined, so as long as your preferences are time-consistent you can just renormalize before maximizing the expected utility of your next decision.
I find this vaguely comforting. It’s basically a decision-theoretic and psychological justification for stoicism.
(must have read this somewhere in the sequences?)
I think self-reflection in bounded reasoners justifies some level of “regret,” “guilt,” “shame,” etc., but the basic reasoning above should hold to first order, and these should all be treated as corrections and for that reason should not get out of hand.
AI-specific pronouns would actually be kind of helpful. “They” and “It” are both frequently confusing. “He” and “she” feel anthropomorphic and fake.
Perhaps LLM's are starting to approach the intelligence of today's average human: capable of only limited original thought, unable to select and autonomously pursue a nontrivial coherent goal across time, learned almost everything they know from reading the internet ;)
This doesn't seem to be reflected in the general opinion here, but it seems to me that LLM's are plateauing and possibly have already plateaued a year or so ago. Scores on various metrics continue to go up, but this tends to provide weak evidence because they're heavily gained and sometimes leak into the training data. Still, those numbers overall would tend to update me towards short timelines, even with their unreliability taken into account - however, this is outweighed by my personal experience with LLM's. I just don't find them useful for practically ...
Huh o1 and the latest Claude were quite huge advances to me. Basically within the last year LLMs for coding went to "occasionally helpful, maybe like a 5-10% productivity improvement" to "my job now is basically to instruct LLMs to do things, depending on the task a 30% to 2x productivity improvement".
An ASI perfectly aligned to me must literally be a smarter version of myself. Anything less than that is a compromise between my values and the values of society. Such a compromise at its extreme fills me with dread. I would much rather live in a society of some discord between many individually aligned ASI’s, than build a benevolent god.
@Thomas Kwa will we see task length evaluations for Claude Opus 4 soon?
Anthropic reports that Claude can work on software engineering tasks coherently for hours, but it’s not clear if this means it can actually perform tasks that would take a human hours. I am slightly suspicious because they reported that Claude was making better use of memory on Pokémon, but this did not actually cash out as improved play. This seems like a fairly decisive test of my prediction that task lengths would stagnate at this point; if it does succeed at hours long tasks, I will...
I don't run the evaluations but probably we will; no timeframe yet though as we would need to do elicitation first. Claude's SWE-bench Verified scores suggest that it will be above 2 hours on the METR task set; the benchmarks are pretty similar apart from their different time annotations.
Sure, but trends like this only say anything meaningful across multiple years, any one datapoint adds almost no signal, in either direction. This is what makes scaling laws much more predictive, even as they are predicting the wrong things. So far there are no published scaling laws for RLVR, the literature is still developing a non-terrible stable recipe for the first few thousand training steps.
It looks like Gemini is self-improving in a meaningful sense:
https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/
Some quick thoughts:
This has been going on for months; on the bullish side (for ai progress, not human survival) this means some form of self-improvement is well behind the capability frontier. On the bearish side, we may not expect a further speed up on the log scale (since it’s already factored in to some calculations).
I did not expect this degree of progress so soon; I am now much ...
Unfortunate consequence of sycophantic ~intelligent chatbots: everyone can get their theories parroted back to them and validated. Particularly risky for AGI, where the chatbot can even pretend to be running your cognitive architecture. Want to build a neuro-quantum-symbolic-emergent-consciousness-strange-loop AGI? Why bother, when you can just put that all in a prompt!
A lot of new user submissions these days to LW are clearly some poor person who was sycophantically encouraged by an AI to post their crazy theory of cognition or consciousness or recursion or social coordination on LessWrong after telling them their ideas are great. When we send them moderation messages we frequently get LLM-co-written responses, and sometimes they send us quotes from an AI that has evaluated their research as promising and high-quality as proof that they are not a crackpot.
Basic sanity check: We can align human children, but can we align any other animals? NOT to the extent that we would trust them with arbitrary amounts of power, since they obviously aren't smart enough for this question to make much sense. Just like, are there other animals that we've made care about us at least "a little bit?" Can dogs be "well trained" in a way where they actually form bonds with humans and will go to obvious personal risk to protect us, or not eat us even if they're really hungry and clearly could? How about species further on the evolutionary tree like hunting falcons? Where specifically is the line?
As well as the "theoretical - empirical" axis, there is an "idealized - realistic" axis. The former distinction is about the methods you apply (with extremes exemplified by rigorous mathematics and blind experimentation, respectively). The later is a quality of your assumptions / paradigm. Highly empirical work is forced to be realistic, but theoretical work can be more or less idealized. Most of my recent work has been theoretical and idealized, which is the domain of (de)confusion. Applied research must be realistic, but should pragmatically draw on theory and empirical evidence. I want to get things done, so I'll pivot in that direction over time.
Sometimes I wonder if people who obsess over the "paradox of free will" are having some "universal human experience" that I am missing out on. It has never seemed intuitively paradoxical to me, and all of the arguments about it seem either obvious or totally alien. Learning more about agency has illuminated some of the structure of decision making for me, but hasn't really effected this (apparently) fundamental inferential gap. Do some people really have this overwhelming gut feeling of free will that makes it repulsive to accept a lawful universe?
I used to, as a child. I did accept a lawful universe, but I thought my perception of free will was in tension with that, so that perception must be "an illusion".
My mother kept trying to explain to me that there was no tension between these things, because it was correct that my mind made its own decisions rather than some outside force. I didn't understand what she was saying though. I thought she was just redefining 'free will' from a claim that human brains effectively had a magical ability to spontaneously ignore the laws of physics to a boring tautological claim that human decisions are made by humans rather than something else.
I changed my mind on this as a teenager. I don't quite remember how, it might have been the sequences or HPMOR again. I realised that my imagination had still been partially conceptualising the "laws of physics" as some sort of outside force, a set of strings pulling my atoms around, rather than as a predictive description of me and the universe. Saying "the laws of physics make my decisions, not me" made about as much sense as saying "my fingers didn't move, my hand did." That was what my mother had been trying to tell me.
Self-reflection allows self-correction.
If you can fit yourself inside your world model, you can also model the hypothesis that you are wrong in some specific systematic way.
A partial model is a self-correction, because it says “believe as you will, except in such a case.”
This is the true significance of my results with @Daniel C:
https://www.lesswrong.com/posts/Go2mQBP4AXRw3iNMk/sleeping-experts-in-the-reflective-solomonoff-prior
That is, reflective oracles allow Solomonoff induction to think about ways of becoming less wrong.
If instead of building LLMs, tech companies had spent billions of dollars designing new competing search engines that had no ads but might take a few minutes to run and cost a few cents per query, would the result have been more or less useful?
To what extent would a proof about AIXI’s behavior be normative advice?
Though AIXI itself is not computable, we can prove some properties of the agent - unfortunately, there are fairly few examples because of the “bad universal priors” barrier discovered by Jan Leike. In the sequential case we only know things like e.g. it will not indefinitely keep trying an action that yields minimal reward, though we can say more when the horizon is 1 (which reduces to the predictive case in a sense). And there are lots of interesting results about the behavior of Solom...
Can AI X-risk be effectively communicated by analogy to climate change? That is, the threat isn’t manifesting itself clearly yet, but experts tell us it will if we continue along the current path.
Though there are various disanalogies, this specific comparison seems both honest and likely to be persuasive to the left?
Most ordinary people don't know that no one understands how neural networks work (or even that modern "Generative A.I." is based on neural networks). This might be an underrated message since the inferential distance here is surprisingly high.
It's hard to explain the more sophisticated models that we often use to argue that human dis-empowerment is the default outcome but perhaps much better leveraged to explain these three points:
1) No one knows how A.I models / LLMs / neural nets work (with some explanation of how this is conceptually possibl...
"Optimization power" is not a scalar multiplying the "objective" vector. There are different types. It's not enough to say that evolution has had longer to optimize things but humans are now "better" optimizers: Evolution invented birds and humans invented planes, evolution invented mitochondria and humans invented batteries. In no case is one really better than the other - they're radically different sorts of things.
Evolution optimizes things in a massively parallel way, so that they're robustly good at lots of different selectively relevant things ...
The most common reason I don’t use LLMs for stuff is that I don’t trust them. Capabilities are somewhat bottlenecked on alignment.
LLM coding assistants may actually slow developers down, contrary to their expectations:
(Epistemic status: I am signal boosting this with an explicit one-line summary that makes clear it is bearish for LLMs, because scary news about LLM capability acceleration is usually more visible/available than this update seems to be. Read the post for caveats.)
I guess Dwarkesh believes ~everything I do about LLMs and still think we probably get AGI by 2032:
This is not the kind of news I would have expected from short timeline worlds in 2023: https://www.techradar.com/computing/artificial-intelligence/chatgpt-is-getting-smarter-but-its-hallucinations-are-spiraling
I still don't think that a bunch of free-associating inner monologues talking to each other gives you AGI, and it still seems to be an open question whether adding RL on top just works.
The "hallucinations" of the latest reasoning models look more like capability failures than alignment failures to me, and I think this points towards "no." But my credences are very unstable; if METR task length projections hold up or the next reasoning model easily zero-shots Pokemon I will just about convert.
GDM has a new model: https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#advanced-coding
At a glance, it is (pretty convincingly) the smartest model overall. But progress still looks incremental, and I continue to be unconvinced that this paradigm scales to AGI. If so, the takeoff is surprisingly slow.
I’m worried about Scott Aaronson since he wrote “Deep Zionism.”
https://scottaaronson.blog/?p=9082
I think he’s coming from a good place, I can understand how he got here, but he really, really needs to be less online.
That moment when you’ve invested in building a broad and deep knowledge base instead of your own agency and then LLMs are invented.
it hurts
I don't see it that way. Broad and deep knowledge is as useful as ever, and LLMs are no substitutes for it.
This anecdote comes to mind:
...Dr. Pauling taught first-year chemistry at Cal Tech for many years. All of his exams were closed book, and the students complained bitterly. Why should they have to memorize Boltzmann’s constant when they could easily look it up when they needed it? I paraphrase Mr. Pauling’s response: I was always amazed at the lack of insight this showed. It’s what you have in your memory bank—what you can recall instantly—that’s important. If you have to look it up, it’s worthless for creative thinking.
He proceeded to give an example. In the mid-1930s, he was riding a train from London to Oxford. To pass the time, he came across an article in the journal, Nature, arguing that proteins were amorphous globs whose 3D structure could never be deduced. He instantly saw the fallacy in the argument—because of one isolated stray fact in his memory bank—the key chemical bond in the protein backbone did not freely rotate, as was argued. Linus knew from his college days that the peptide bond had to be rigid and coplanar.
He began doodling, and by the time he reached Oxford,
Back-of-the-envelope math indicates that an ordinary NPC in our world needs to double their power like 20 times over to become a PC. That’s a tough ask. I guess the lesson is either give up or go all in.
That moment when you want to be updateless about risk but updateful about ignorance, but the basis of your epistemology is to dissolve the distinction between risk and ignorance.
(Kind of inspired by @Diffractor)
Did a podcast interview with Ayush Prakash on the AIXI model (and modern AI), very introductory/non-technical:
Gary Kasparov would beat me at chess in some way I can't predict in advance. However, if the game starts with half his pieces removed from the board, I will beat him by playing very carefully. The first above-human level A.G.I. seems overwhelmingly likely to be down a lot of material - massively outnumbered, running on our infrastructure, starting with access to pretty crap/low bandwidth actuators in the physical world and no legal protections (yes, this actually matters when you're not as smart as ALL of humanity - it's a disadvantage relative to even the...
I suspect that human minds are vast (more like little worlds of our own than clockwork baubles) and even a superintelligence would have trouble predicting our outputs accurately from (even quite) a few conversations (without direct microscopic access) as a matter of sample complexity.
Considering the standard rhetoric about boxed A.I.'s, this might have belonged in my list of heresies: https://www.lesswrong.com/posts/kzqZ5FJLfrpasiWNt/heresies-in-the-shadow-of-the-sequences
I'm starting a google group for anyone who wants to see occasional updates on my Sherlockian Abduction Master List. It occurred to me that anyone interested in the project would currently have to check the list to see any new observational cues (infrequently) added - also some people outside of lesswrong are interested.
Thinking times are now long enough that in principle frontier labs could route some API (or chat) queries to a human on the backend, right? Is this plausible? Could this give them a hype advantage if in the medium term, if they picked the most challenging (for LLMs) types of queries effectively, and if so, is there any technical barrier? I can see this kind of thing eventually coming out, if the Wentworth “it’s bullshit though” frame turns out to be partially right.
(I’m not suggesting they would do this kind of blatant cheating on benchmarks, and I have no inside knowledge suggesting this has ever happened)
In MTG terms, I think Mountainhead is the clearest example I’ve seen of a mono-blue dystopia.
I seem to recall EY once claiming that insofar as any learning method works, it is for Bayesian reasons. It just occurred to me that even after studying various representation and complete class theorems I am not sure how this claim can be justified - certainly one can construct working predictors for many problems that are far from explicitly Bayesian. What might he have had in mind?
A "Christmas edition" of the new book on AIXI is freely available in pdf form at http://www.hutter1.net/publ/uaibook2.pdf
I wonder if it’s true that around the age of 30 women typically start to find babies cute and consequently want children, and if so is this cultural or evolutionary? It’s sort of against my (mesoptimization) intuitions for evolution to act on such high-level planning (it seems that finding babies cute can only lead to reproductive behavior through pretty conscious intermediary planning stages). Relatedly, I wonder if men typically have a basic urge to father children, beyond immediate sexual attraction?
Eliezer’s form of moral realism about good (as a real but particular shared concept of value which is not universally compelling to minds) seems to imply that most of us prefer to be at least a little bit evil, and can’t necessarily be persuaded otherwise through reason.
Seems right.
And Nietzsche would probably argue the two impulses towards good and evil aren't really opposites anyway.