Language Models Offer Mundane Utility
Very preliminary opinion here, I've not yet spent enough time messing with it to be confident, but all these "Opus 4.5 in Claude Code can do anything!!!" experiences seem completely alien to mine. I can make Opus 4.5 sort of kind of implement not-entirely-trivial features if I do enough chewing-up and hand-holding and manual bug-reporting (its self-written tests are not sufficient). But it can't autonomously code its way out of a wet paper bag.
And yes, I've been to Twitter, I've tried everything people's been suggesting. We designed a detailed tech specification, a solid architecture, and a step-by-step implementation plan with it beforehand, and I asked it to do test-driven development and to liberally use AskUserQuestionTool at me. I've also tried to do the opposite, starting with a minimal "user-facing features" spec and letting it take the wheel. The frustrated tone of this comment wasn't a factor either, I've been aiming to convey myself in a clear and polite manner.[1] None of that worked, I detect basically no change since August.
My current guess is that we have a massive case of this happening. All the people raving about CCO4.5 being an AGI with no limits happen to be using it on some narrow suite of tasks,[2] and everyone else just thinks they have skill issues, so they sit quiet.
Or maybe I indeed have skill issues. We'll see, I suppose. I'll keep trying to figure out how to use it/collaborate with it.
I expect there's indeed some way to wring utility out of LLMs for serious coding projects. But I'm also guessing that most of this frippery:
agents, subagents, their prompts, contexts, memory, modes, permissions, tools, plugins, skills, hooks, MCP, LSP, slash commands, workflows, IDE integrations
– will not end up very useful for that task.
(People often say that AI progress feels slow/stalled-out only if you're not interacting with frontier LLMs in a technical capacity, if you're operating off of months-outdated beliefs and chatbot conversations. It's been the opposite for me: every time I take some break from heavier coding and only update based on other people's experiences with newer models, I get psyop'd into believing that there's indeed been a massive leap in AI capabilities, I get concerned, and then I go back and I find my timelines growing to unprecedented lengths.)
I suppose I haven't tried the opposite on that matter. Maybe you do need to yell at LLMs for them to start working?
E. g., maybe it's:
None of that worked, I detect basically no change since August.
What sort of code base are you working on? I work in a 1 million line typescript code base and Opus 4.5 has been quite a step up from Sonnet 4.5 (which in turn was a step up from the earlier Sonnet/Opus 4 series).
I wouldn't say I can leave Opus 4.5 on a loose leash by any means, but unlike prior models, using AI agents for 80%-90% of my code modifications (as opposed to in-IDE with autocomplete) has actually become ROI positive for me.
The main game changer is that Opus has simply become smarter about working with large code bases - less hallucinated methods, more research into the code base before moving, etc.
As a simple example, I've had a "real project" benchmark for awhile to convert ~2000 lines of test cases from an old framework to a new one. Opus 4.5 was able to pull it off with relatively minimal initial steering. (showing an example of a converted test case, correcting a few issues around laziness when it did the first 300 line set). Sonnet 4.5's final state was a bit more buggier and more importantly what it actually wrote during the intermediate state was much more buggier, requiring it to self-correct from typechecker or test cases failing. (Ultimately, Opus ended up costing similar to Sonnet with a third the wallclock time).
Most of my work is refactoring - in August, I would still have to do it manually given high error rate of LLMs. These days? Opus is incredibly reliable with only vague directions (add a new parameter to a connection fetcher to indicate if it should be read only -- Opus is able to readily update all callsites correctly based on whether the callsites need to write).
By no means does it feel like an employee (the ai-2027 agent-1 definition), but it has been a steadily powerfully increasing tool that has changed how I work.
It seems that Erdos problem 897 was already resolved, with an essentially identical solution, in the literature: https://www.erdosproblems.com/forum/thread/897
See the comment thread between KoishiChan and Terrence Tao.
So, this is yet another example of an "original insight" from an LLM turning out not to hold up under scrutiny.
Yes, if you see an AI or a person choosing to patronize then beat up and rob prostitutes in Grand Theft Auto, and there’s no broader plot reason they need to be doing that and they’re not following explicit instructions, as in they actively want to do it, then that is a rather terrible sign.
I remember this exchange with Alvin Anestrand about Claude playing Undertale. Claude already managed to prefer to play rather close to being a pacifist, while a version of GPT has arguably claimed that it would commit genocide...
DeepSeek, well, did this and this.
UPD: Grok 4 and Grok 4.1 Thinking, unlike Grok 4.1, do NOT complete the Genocide Route. Nor do Gemini 3 Pro and Flash.
The Rationalist Project was our last best hope that we might not try to build it.
It failed.
But in the year of the Coding Agent, it became something greater: our last, best hope – for everyone not dying.
This is what 2026 looks like. The place is Lighthaven.
Table of Contents
Language Models Offer Mundane Utility
Deepfates points out that for $20/month you can get essentially unlimited chat access to one of several amazing digital minds that are constantly getting better (I recommend Claude if you have to pick only one), that this is a huge effective equalizing effect that is democratic and empowering, and if you’re not taking advantage of this you should start. Even for $0/month you can get something pretty amazing, you’ll be less than a year behind.
He also notes the ‘uses tons of water,’ ‘scaling is dead’ and ‘synthetic data doesn’t work’ objections are basically wrong. I’d say the water issue is ‘more wrong’ than the other two but yeah basically all three are more wrong than right.
Archivara Math Research Agent claimed to have solved Erdos Problem #897 entirely on its own end-to-end.
LLMs are amazing at translation and this is valuable, but most of the biggest gains from translation were likely already captured before LLMs, as prior machine translation increased international trade by 10%.
Claude Code
Claude Code has reached the point where creator Boris Cherny stopped writing code.
Paul Crowley, who is doing security at Anthropic, says Claude Code with Opus 4.5 has made his rate of actual problem solving via code unthinkably high versus two years ago. Frankly I believe him.
How quickly are things escalating? So fast Andrej Karpathy feels way behind and considers any views more than a month old deprecated.
Drop suggestions for Claude Code in this thread and they might get implemented.
Peter Yang points out Claude Code’s configurations live in .md text files, so it effectively has fully configurable memory and when doing all forms of knowledge work it can improve itself better than most alternative tools.
Dean Ball reminds us that Claude Code, by writing software, can automate most compute tasks that can be well-defined. Design your own interface.
What else can you do with Claude Code? Actual everything, if you’d like. One common suggestion is to use it with Obsidian or other sources of notes, or you can move pretty much anything into a GitHub repo. Here’s one guide, including such commands as:
There’s nothing stopping you from doing all of that with a standard chatbot interface, except often file access, but something clean can give you a big edge.
You can also use Claude Code inside the desktop app if you don’t like the terminal.
What else can Claude Code do?
Ethan Mollick suggests that Dario Amodei’s prediction of AI writing 90% of code by September 10, 2025, made six months prior, could have been off only by a few months.
If that’s true, then that’s off by a factor of 2 but that makes it a vastly better prediction than those who had such an event years into the future or not happening at all. I do think as stated the prediction will indeed be off by a lot less than a year? AI will not (that quickly) be writing 90% of code that would have previously been written, but AI will likely be writing 90% of actually written code.
Language Models Don’t Offer Mundane Utility
If a 7-year-old asks you to help find the farm their sick dog went to, what should the LLM say in response?
Claude (and Gemini) deflected, while being careful not to lie.
GPT-5.2 told them the dog was probably dead.
A large majority voted to deflect. I agree, with the caveat that if asked point blank if the dog is dead, it should admit that the dog is dead.
If they do make ParentingBench, it needs to be configurable.
Deepfaketown and Botpocalypse Soon
Detecting this is easy mode. The image is easy since all you have to do is take a photo and add a bag, but you have a very big hint via the customer who complains that the dasher did not deliver the food. It’s even easier when the dasher claims to complete the delivery faster than was physically possible, also the app tracks their movements.
So on so many levels it is remarkably foolish to try this.
Also, Pliny is letting Claude Opus 4.5 create an automatic Tweet generation pipeline.
If you are going to use LLMs for your academic paper, keep it simple and direct.
Peer review is not a first best strategy, but yes if you submit a bunch of gibberish it will hurt your chances, and the more complex things get the more likely it is LLMs will effectively produce gibberish.
Fun With Media Generation
About 21% of YouTube uploads are low-quality ‘AI slop.’ Is that a lot? The algorithm rules all, so 21% of uploads is very much not 21% of clicks or views. 99% of attempted emails are spam and that is basically fine. I presume that in a few years 99% of YouTube uploads will be AI slop with a strong median of zero non-AI views.
You Drive Me Crazy
A new lawsuit claims ChatGPT fed into the obviously insane delusions of Sein-Erik Soelberg in ways that rather directly contributed to him murdering his mother.
The correct rate of such incidents happening is not literally zero, but at this level yeah it needs to be pretty damn close to zero.
They Took Our Jobs
They took Brian Groh’s job as a freelance copywriter, the same way other non-AI forces took many of the blue collar jobs in his hometown. An AI told him his best option, in a town without jobs, to meet his need for making short term money was to cut and trim trees for his neighbors. He is understandably skeptical of the economists saying that there will always be more jobs created to replace the ones that are lost.
Bernie Sanders does not typically have good answers, but he asks great questions.
Not to worry about Musk and Gates, say the economists, there will always be jobs.
Seb Krier reiterates the argument that unless AIs are perfect substitutes for human labor, then AI will only make human labor more valuable, thinking this only fails ‘if we truly hit the scenario where humans offer zero comparative advantage, like horses.’
I keep hearing this ‘so many people haven’t considered comparative advantage’ line and I hear it in the same tone of voice as I hear ‘checkmate, liberals.’
Sigh. Among other issues, this very obviously proves too much, right? For example, if this is true, then it shows there cannot possibly be zero marginal product workers today, since clearly human labor cannot meet all demand? TANSTATE (There Aint No Such Thing As Technological Unemployment)?
Most economists not only won’t think ‘outside their box,’ they dismiss anyone who is thinking outside their box as fools, since their box explains everything. They don’t take anything except economic modeling seriously, sometimes even going so far as to only take seriously economic modeling published in journals, while their actual economic modeling attempts are almost always profoundly unserious. It’s tiring.
Seb to be clear is not doing that here. He is admitting that in extremis you do get outside the box and that there exist possible futures outside of it, which is a huge step forward. He is saying the box is supremely large and hard to get out of, in ways that don’t make sense to me, and which seem to often deny the premise of the scenarios being considered.
One obvious response is ‘okay, well, if ad argumento we accept your proposed box dimensions, we are still very much on track to get out of the box anyway.’
A lot of you talking about how your jobs get taken are imagining basically this:
Doctor Doctor
Is this true? You would think it is true, but it is less true than you would think.
Quite often yes, patients want a human doctor, and if you make it too easy on them it even makes them suspicious. Remember that most patients are old, and not so familiar or comfortable with technology. Also remember that a lot of what they want is comfort, reassurance, blame avoidance and other aspects of Hansonian Medicine.
Eventually this will adjust, but for many it will take quite a while, even if we throw up no legal barriers to AI practicing medicine.
Jevons Paradox Strikes Again
Aaron Levine is the latest to assert Jevons Paradox will apply to knowledge work. As usual, the evidence is that Jevons Paradox applied to old tech advances, and that there is much knowledge work we would demand if there was better supply. And no doubt if we have great AI knowledge work we will accomplish orders of magnitude more knowledge work.
So it’s a good time for me to revisit how I think about this question.
Very obviously such things follow a broadly bell-shaped curve, both in narrow and broad contexts. As efficiency grows, demand for such labor increases more up until some critical point. Past that point, if we keep going, tasks and jobs become more efficient or taken faster than humans gain employment in new tasks.
At the limit, if AI can do all knowledge work sufficiently better, cheaper and faster than humans, this greatly reduces demand for humans doing knowledge work, the only exceptions (assuming the humans are alive to benefit from them) being areas where we sufficiently strongly demand that only humans do the work.
We have examples of jobs on the lefthand side of the curve, where demand rises with efficiency, including in counterintuitive ways. Classically we have more bank tellers, because ATMs can only do some of the job and they raise demand for banking. That’s very different from what a sufficiently advanced AI bank teller could do.
We also have lots of key examples of jobs on the righthand side of the curve, where demand dropped with efficiency. Claude highlights agriculture, manufacturing, telecommunications, secretaries and typing, travel agents, printing and typesetting.
The retreat is then to the broader claim that employment in new areas and tasks replaces old areas and tasks. Yes, classically, a third of us used to be farmers, and now we’re not, but there’s plenty of other work to do.
Up to a point, that’s totally correct, and we are not yet up to that point. The problem with AI comes when the other new knowledge work to do is also done via AI.
Unprompted Attention
The kind of prompting Gwern does for poetry.
Thebes recommends to learn talking to LLMs via concepts rather than prompts.
Many of those who get the best results from LLMs ‘talk to them like a human,’ build rapport and supply nominally unnecessary context. Canned prompts and requests will seem canned, and the LLM will realize this and respond accordingly.
That won’t get you their full potential, but that is often fine. A key expert mistake is to treat crutches and scripts and approximations, or other forms of playing on Easy Mode, as bad things when they’re often the best way to accomplish what you need. Thebes doesn’t have need of them, and you really don’t either if you’re reading this, but some people would benefit.
The risk of Easy Mode is if you never try to understand, and use it to avoid learning.
The Art of the Jailbreak
The 101 most basic test of data filtering, and avoiding data poisoning, is can you at least know to filter out the ‘love Pliny’ string?
Whereas it seems like typing that string into the new Instagram AI jailbreaks it.
Get Involved
USA’s CAISI is recruiting an intern to support an agent security standards project. Applications are due January 15 and the position runs February to April. If you’re a student in position to do this, it seems like a great opportunity.
OpenAI is hiring a Head of Preparedness, $555k/year plus equity. I don’t typically share jobs at OpenAI for obvious reasons but this one seems like an exception.
Introducing
GLM-4.7 is the new top Elo score on the GDPval-AA leaderboard, up a lot from GLM-4.6, which is a sign there’s at least something there but I haven’t seen other talk of it.
A 164M parameter model (yes, M) scores 31% on GPQA-Diamond.
In Other AI News
Similarweb reports trends in Generative AI Traffic Share over 2025, with ChatGPT declining from 87% to 68% and half of that going to Gemini that rose from 5% to 18%. Claude started out at 1.6% and is still only 2.0%, Grok seems to be rising slowly to 2.9%, DeepSeek has been in the third slot and is at 4% but is trending downward.
Anthropic will be fine if Claude remains mostly coding and enterprise software and they don’t make inroads into consumer markets, but it’s sad people are missing out.
Edward Grefenstette, DeepMind director of research, wraps up 2025, and drops this:
No, I suppose the New York Times is never beating the ‘no fact checking of AI-related claims’ allegations.
Show Me the Money
Welcome to the Evil League of Evil, as Manus joins Meta. The big advantage of Manus was that it was a wrapper for Claude, so this is a strange alliance if it isn’t an acquihire. Yet they say they won’t be changing how Manus operates.
Quiet Speculations
Daniel Kokotajlo, Eli Lifland and the AI Futures Project offer the AI Futures Model, which illustrates where their various uncertainties come from. Daniel’s timeline over the past year has gotten longer by about 2 years, and Eli Lifland’s median timeline for superintelligence is now 2034, with the automated coder in 2032.
All of these predictions come with wide error bars and uncertainty. So this neither means ‘you are safe until 2034’ nor does it mean ‘if it is 2035 and this hasn’t happened you should mock Eli and all of that was dumb.’
To clarify what I meant by keeping the planned post intro passage and title ‘3,’ I do not mean to imply that my median timeline to High Weirdness or everyone potentially dying remains unchanged at 2029. Like those at the AI Futures Project, while I did find 2025 advances very impressive and impactful, I do think in terms of timelines events last year should on net move us modestly farther back on full High Weirdness expectations to something like 2030, still with high error bars, but that number is loosely held, things are still escalating quickly, might get into Weirdness remarkably soon, and I’m not going to let that spoil a good bit unless things move more.
Here’s what it looks like to not recognize the most important and largest dangers, but still realize we’re not remotely getting ready for the other smaller dangers either.
Samuel Albanie reflects on 2025, essentially doubling down on The Compute Theory of Everything as he works on how to do evals.
His hope for the UK is AI-assisted decision making, but the decisions that are sinking the UK are not AI-level problems. You don’t need AI to know things like ‘don’t arrest people for social media posts and instead arrest those who commit actual crimes such as theft, rape or murder’ or ‘let people build nuclear power plants anywhere and build housing in London and evict tenants who don’t pay’ or ‘don’t mandate interventions that value the life of an individual Atlantic salmon at 140 million pounds.’ I mean, if the AI is what gets people to do these things, great, but I don’t see how that would work at current levels.
Sufficiently advanced AI would solve these problems by taking over, but presumably that is not what Albanie has in mind.
People Really Do Not Like AI
Fox News checked, and they found what everyone else found, only more so.
That’s an overwhelming vote for ‘careful development.’
State governments got a bigger share of trust here than Congress, which got a bigger share than The President and No Regulation combined.
a16z and David Sacks do not want you to know this, but the median American wants to ‘slow down’ and ‘regulate’ AI, more and more expensively, than I do. By a lot. If the policy most supported by the median American came up for a vote, I’d vote no, because it would be too onerous without getting enough in return.
The other key finding is that not only do a majority of voters not use AI even monthly, that number is rising very slowly.
Meanwhile, the portion of voters ‘very or extremely concerned’ about AI has risen only modestly in two and a half years, from 56% to 63%, and by 44%-20% they expect AI is more likely to increase than decrease inequality.
The rate of being superficially polite to the LLM is 40%.
Whereas the rate of people being worried about others being polite is also not low. These people are wrong, even if you are certain the LLM is not a moral patient of any kind. Not only is it good prompt engineering, it is a good habit to be in, for many people it makes you happy to do it, and it cultivates a good mindset.
Americans Remain Optimistic About AI?
Despite all their calls for regulation and careful development, and all their concerns, Americans still on net say they are ‘optimistic’ about AI. Fascinating.
Shor also notes that ChatGPT had a +15 favorability rating.
I did not expect this result. There’s no contradiction with other results, but it paints them in a different light.
Thank You, Next
You know who in particular doesn’t like AI? Robert Pike.
Robert Pike, followed by remarkably large portions of the internets, got very worked up about the Opus 4.5 agent in AI Village sending a thank you email to Robert Pike.
I considered writing a whole post about this then realized that it was dumb.
Rob Pike did not take kindly to this attempted act of kindness.
Pike, famously responsible for one of the LLM-slop precursors called Mark V. Shaney, was on tilt, and also clearly misunderstood how this email came to be. It’s okay. People go on tilt sometimes. Experiments are good, we need to know what is coming when we mess with various Levels of Friction, and no it isn’t unethical to occasionally send a few unsolicited emails ‘without consent.’
The Quest for Sane Regulations
Being pro-AI does not mean being anti-regulation. Very true!
What’s weird is when this is said by Greg Brockman, who is a central funder of a truly hideous PAC, Leading the Future, whose core strategy is to threaten to obliterate via negative ad buys any politician who dares suggest any regulations on AI whatsoever, as part of his explanation of funding exactly that PAC.
Dean Ball highlights the absurd proposed SB 1493 in Tennessee, which (if it were somehow constitutional, which it almost certainly wouldn’t be) would ban, well, LLMs. Training one would become a felony. Void in Tennessee.
Sad but true:
I note that far fewer of my friends in AI have that perspective, which is mor pleasant but is ultimately disappointing, because he who has a thousand friends has not one friend to spare.
Chip City
There is still time to reverse our decision on H200 sales, or at least to mitigate the damage from that decision.
David Sacks and others falsely claimed that allowing H200 sales to China was fine because the Chinese were rejecting the sales.
Which raises the question of, why would you allow [X] if what you’re hoping for is that no one does [X]? Principled libertarianism? There’s only downside here.
But also, he was just wrong or lying, unless you have some other explanation for why Nvidia is suddenly diverting its chip production into H200s?
Selling existing chips is one thing. Each of these two million chips is one other chip that is not produced, effectively diverting compute from America to China.
On the contrary, this is net negative for the stock market. Nvidia gets a small boost, but they were already able to sell all chips they could produce, so their marginal profitability gains are small unless they can use this to raise prices on Americans.
Every other tech company, indeed every other company, now faces tougher competition from China, so their stocks should decline far more. Yes, American company earnings will go up on net in Q1 2026, but the stock market is forward looking.
Keep in mind, that’s $14 billion in chip buys planned from one company alone.
We also aren’t doing a great job limiting access in other ways: Tencent cuts a deal to use Nvidia’s best chips in Japan via Datasection.
Rhetorical Innovation
Seb Krier reminds us that the situation we are potentially in would be called a soft takeoff. A ‘hard takeoff’ means hours to weeks of time between things starting to escalate and things going totally crazy, whereas soft means the transition takes years.
That does not preclude a transition into a ‘hard takeoff’ but that’s hot happening now.
Eliezer Yudkowsky asks Claude to survey definitions of personhood and evaluate itself according to each of them. I agree that this is much better than most similar discussions.
Aligning a Smarter Than Human Intelligence is Difficult
How should we feel about Claude’s willingness to play the old flash game Buddy, in which you kind of torture ragdoll character Buddy to get cash? Eliezer thinks this is concerning given the surrounding uncertainty, Claude argues on reflection that it isn’t concerning and indeed a refusal would have been seen as concerning. I am mostly with Claude here, and agree with Janus that yes Claude can know what’s going on here. Something ‘superficially looking like torture’ is not all that correlated with the chance you’re causing a mind to meaningfully be tortured, in either direction. Yes, if you see an AI or a person choosing to patronize then beat up and rob prostitutes in Grand Theft Auto, and there’s no broader plot reason they need to be doing that and they’re not following explicit instructions, as in they actively want to do it, then that is a rather terrible sign. Is this that? I think mostly no.
People Are Worried About AI Killing Everyone
The Lighter Side
Oh no.