I also think that ‘was it ‘scheming’ or just ‘confused’,’ an example of a question Neel Nanda points to, is a remarkably confused question, the boundary is a lot less solid than it appears, and in general attempts to put ‘scheming’ or ‘deception’ or similar in a distinct box misunderstand how all the related things work.
Yes, obviously this is a complicated question. Figuring out what the right question to ask is is part of the challenge. But I think there clearly is some real substance here - there's times an AI causes bad outcomes that indicate a goal directed entity taking undesired actions, and there's times that don't, and figuring out the difference is very important
Indeed, I have long thought that mechanistic interpretability was overinvested relative to other alignment efforts (but underinvested in absolute terms) exactly because it was relatively easy to measure and feel like you were making progress.
I'm surprised that you seem to simultaneously be concerned that it was too easy to feel like you're making progress in past mech interp and push back against us saying that it was too easy to incorrectly feel like you're making progress in mech interp and we need better metrics of whether we're making progress
In general they want to time-box and quantify basically everything?
The key part is to be objective, which is related to but not the same thing as being quantifiable. For example, you can test if your hypothesis is correct by making non-trivial empirical predictions and then verifying them UGG. If you change the prompt in a certain way, what will happen or can you construct an adversarial example in an interpretable way?
Pragmatic problems are often the comparative advantage of frontier labs.
Our post is aimed at the community in general, not just the community inside frontier labs, so this is not an important part of our argument, though there are definitely certain problems we are comparatively advantaged at studying
the copies or instantiations would act as if they were one agent because decision theory
no, it is important to actually emulate a "society of mind"
it is true that a powerful model would be able to emulate a whole multi-agent system within itself, and that it is likely to do it better than running separate processes because this way it should be able to learn optimal multi-agent interactions
but the reality is actually multi-faceted, "superintelligent" does not mean "omniscient", and it is really important to represent a variety of the viewpoints and approaches inside a powerful model, so "true multi-agency" is important, both for capabilities and even more so for existential safety, even if implemented within a single model (with multiple instantiations, it is not unlikely that specializing those turns out to be beneficial)
Soul versus spec, or soul spec versus model spec, seems an important thing to understand. Is there any relevant research literature? Does it correspond to different metaethics?
The cycle of language model releases is, one at least hopes, now complete.
OpenAI gave us GPT-5.1 and GPT-5.1-Codex-Max.
xAI gave us Grok 4.1.
Google DeepMind gave us Gemini 3 Pro and Nana Banana Pro.
Anthropic gave us Claude Opus 4.5. It is the best model, sir. Use it whenever you can.
One way Opus 4.5 is unique is that it as what it refers to as a ‘soul document.’ Where OpenAI tries to get GPT-5.1 to adhere to its model spec that lays out specific behaviors, Anthropic instead explains to Claude Opus 4.5 how to be virtuous and the reasoning behind its rules, and lets a good model and good governance flow from there. The results are excellent, and we all look forward to learning more. See both the Opus 4.5 post and today’s update for more details.
Finally, DeepSeek gave us v3.2. It has very good benchmarks and is remarkably cheap, but it is slow and I can’t find people excited to use it in practice. I’ll offer a relatively short report on it tomorrow, I am giving one last day for more reactions.
The latest attempt to slip unilateral preemption of all state AI regulations, without adopting any sort of federal framework to replace them, appears to be dead. This will not be in the NDAA, so we can look forward to them trying again soon.
As usual, much more happened, but the financial deals and incremental model upgrades did slow down in the wake of Thanksgiving.
Table of Contents
Also this week: Claude Opus 4.5: Model Card, Alignment and Safety, Claude Opus 4.5 Is The Best Model Available, On Dwarkesh Patel’s Second Interview with Ilya Sutskever, Reward Mismatches in RL Cause Emergent Misalignment.
Language Models Offer Mundane Utility
Harmonic Math’s Aristotle system proves Erdos Problem #124 on its own.
Ask LLMs to plot subjective things on graphs. Fun.
Solve your decision paralysis.
Correctly one box in Newcomb’s Problem. Sufficiently advanced AIs use functional decision theory.
OpenAI’s Boaz Barak endorses the usefulness of Codex code reviews.
Terrence Tao via Teortaxes: Gemini seems to accidentally prove Erdos problem #481 without realizing it?
Steve Hsu publishes a research article in theoretical physics based on a de novo idea from GPT-5.
Some people just have the knack for that hype Tweet, show Gemini in live camera mode saying the very basics of an oil change and presto. But yes, we really are collectively massively underutilizing this mode, largely because Google failed marketing forever and makes it nonobvious how to even find it.
Language Models Don’t Offer Mundane Utility
Google still makes it very hard to pay it money for AI models.
Claude has been spotted citing Grokopedia.
Critch says that Grokopeida is a good thing and every AI company should maintain something similar, because it shares knowledge, accelerates error-checking and clarifies what xAI says is true. I agree on the last one.
On Your Marks
The ‘why does Josh Whiton always grab the same three books at the library’ puzzle, Gemini 3 wins, Opus 4.5 and GPT-5.1 lose, and Grok has issues (and loses).
ChessBench finds Gemini 3 Pro in the top spot at 2032 Elo, well ahead of GPT-5.1 at 1636. Claude Opus disappoints here at 1294.
Here’s a fun benchmark, called ‘how much can you make from cyberattacks on smart contracts.’ Or, more technically, SCONE-bench. This included finding two small novel zero-day vulnerabilities in recently released contracts with no known vulnerabilities. Anthropic offered a full report.
Matt Levine’s coverage, as usual, is funnier.
Get My Agent On The Line
Amazon releases AI agents it says can ‘work for days at a time’ but useful details are not offered.
No, agent, no! Bad agent!
Advertising Is Coming
Oh no.
I am with both xlr8harder and Gallabytes. If I ever see a paid ad I didn’t ask for and I don’t feel like ads have been a net benefit within ChatGPT (prove me wrong, kids!) I am downgrading my OpenAI subscription. Good ads are good, I used to watch the show ‘nothing but trailers’ that was literally ads, but most ads are bad most of the time.
For free tiers the ads are fine on principle but I do not trust them to not warp the system via the incentives they provide. This goes well beyond explicit rigging into things like favoring engagement and steering the metrics, there is unlikely to be a ‘safe’ level of advertising. I do not trust this.
Deepfaketown and Botpocalypse Soon
Is AI detection hard?
Proving things can be hard, especially in an adversarial setting. Knowing things are probably true is much easier. I am confident that, at least at current capability levels, probabilistic AI detection even on text is not so difficult if you put your mind to it. The problem is when you aren’t allowed to treat ‘this is 90% to be AI’ as actionable intelligence, if you try that in a university the student will sue.
In the ‘real world’ the logical response is to enact an appropriate penalty for AI writing, scaled to the context, severity and frequency, and often not in a way that directly accuses them of AI writing so you don’t become liable. You just give them the one-star rating, or you don’t hire or work with or recommend them, and you move on. And hope that’s enough.
A post generated by GPT-5.1-Thinking, or that might as well have been and easily could have been, got 82k likes on Twitter. The AI detector Pangram spots it, and to a discerning human it gets increasingly obvious as you read it that one way or another it’s ‘not real.’ Yet almost all the humans were not discerning, or did not care.
AI writing remains, I believe, highly detectable by both man and machine if you care, are paying attention and are willing to accept some amount of false positives from human slop machines. The problem is that people mostly don’t care, aren’t paying attention and in many cases aren’t willing to accept false positives even if the false positives deserve it.
The false positives that don’t deserve it, under actually used detection technology, are largely cases of ESL (English as a second language) which can trigger the detectors, but I think that’s largely a skill issue with the detectors.
How can you defend yourself from such worries?
We already do need Odysseus Pacts. We already needed them for television. If you don’t have at least a soft one, things like TikTok are probably going to eat you alive. If that didn’t happen, chances are you have one, even if you don’t think of it that way.
The Golden Age has some good explorations of this as well.
Fun With Media Generation
If AI is an equalizing factor among creatives, what happens? Among other things:
A Young Lady’s Illustrated Primer
Tomorrow’s AI dystopia today?
You Drive Me Crazy
OpenAI’s response to the Adam Raine lawsuit includes the claim that Raine broke the terms of service, ‘which prohibit the user of ChatGPT for “suicide” or “self-harm.”’ This is not something I would point out in a public court filing.
Unprompted Attention
Google AI Developers offers an agentic prompt to boost performance 5%. If you were wondering why Gemini 3 Pro is the way it is, you can probably stop wondering.
They Took Our Jobs
As a follow-up to Dwarkesh Patel’s post that was covered yesterday, we all can agree:
Does that mean the AIs currently require a lot of schlep?
Or does that mean that the human lawyers currently require a lot of schlep?
Or both?
If all the statement is saying is that it will be difficult to get a fully autonomous and complete AI lawyer that means you no longer need human lawyers at all? Then yes, I mean that’s going to be hard for complex legal tasks, although for many legal tasks I think not hard and it’s going to wipe out a lot of lawyer jobs if the amount of legal work done doesn’t expand to match.
But no, I do not think you need continual learning to get a fully functional autonomous AI lawyer.
I also don’t think the tasks Dwarkesh is citing here are as dead-center AI tasks as he thinks they are. Writing at this level is not dead center because it is anti-inductive. Finding the best clips is really tough to predict at all and I have no idea how to do it other than trial and error. Dwarkesh is operating on the fat tail of a bell curve distribution.
Finding guests is hard, I am guessing, because Dwarkesh is trying for the super elite guests and the obvious ones are already obvious. It’s like the movie-picking problem, where there are tons of great movies but you’ve already seen all the ones your algorithm can identify. Hard task.
Chris Barber asks various people: What skills will be more valuable as AI progresses?
Answers are taste (the only answer to appear twice), manager skills, organizational design, dealing with people, creativity, agency, loyalty, going deep, and finally:
What an odd thing to say. I expect the opposite. Brands are a shortcut.
Get Involved
If you want to pivot to AI safety and have a sufficient financial safety net, stop applying and get to work. As in, don’t stop looking for or applying for jobs or funding, but start off by finding a problem (or a thing to build) and working on it, either on your own or by offering to collaborate with those working on the problem.
DeepMind is hiring a London-based research scientist for Post-AGI Research, to look at the impact of AGI on various domains, deadline December 15. I worry about the mindset that went into writing this, but seems like a worthwhile task.
For my full list of selected giving opportunities see nonprofits.zone.
Introducing
Claude for Nonprofits offers up to 75% discounts on Team and Enterprise plans, connectors to nonprofit tools Blackbaud, Candid and Benvity and a free course, AI Fluency for Nonprofits.
Mistral’s Ministral 3 (14B, 8B and 3B), each with base, instruct and reasoning, and Mistral Large 3.
Variously Effective Altruism
The first set of ‘People-First AI Fund’ grantees from The OpenAI Foundation. What did their own AI make of this when I asked (without identifying the source)?
Here’s the polite version.
If the OpenAI foundation is making grants like this, it would not reduce existential risk or the chance AGI goes poorly, and would not quality as effective altruism.
Here’s the impolite version.
The optimistic take is ‘it’s fine, this was a bribe to the California attorney general.’
This initiative is $50 million. The foundation’s next project is $25 billion. If you have to set 0.2% of your money on fire to keep the regulators off your back, one could say that’s a highly respectable ratio?
I am curious what the David Sacks and Marc Andreessen crowds think about this.
In Other AI News
OpenAI declares a ‘code red’ to shift its resources to improving ChatGPT in light of decreased growth and improvements made by Gemini and Claude. Advertising is confirmed to be in the works (oh no) but is being put on hold for now (yay?), as is work on agents and other tangential products.
If I was them I would not halt efforts on the agents, because I think the whole package matters, if you are using the ChatGPT agent then that keeps you in the ecosystem, various features and options are what matters most on the margin near term. I kind of would want to declare a code green?
The statistics suggest Gemini is gaining ground fast on ChatGPT, although I am deeply skeptical of claims that people chat with Gemini more often or it is yet close.
Also, yes, Claude is and always has been miniscule, people don’t know, someone needs to tell them and the ads are not working.
An inside look at the nine person team at Anthropic whose job it is to keep AI from destroying everything. I love that the framing here is ‘well, someone has to and no one else will, so let’s root for these nine.’
The latest ‘here are the politics of various AIs’ article.
They have a ‘model leaderboard’ of how well the models preferences predict the outcome of the last eight Western elections when given candidate policy positions (but without being told the basic ‘which parties are popular’), which is that the further right the model is the better it lined up with the results. Grok was the only one that gave much time of day to Donald Trump against Kamala Harris (the model didn’t consider third party candidates for that one) but even Grok gave a majority to Harris.
Anthropic partners with Dartmouth.
Anthropic expands its strategic partnership with Snowflake to $200 million.
Show Me the Money
Anthropic buys Bun to help accelerate Claude Code.
Quiet Speculations
To be fair while I was at college I met at most one person who qualified as this kind of smart. There are not that many of them.
I point this out because a lot of speculation on AI basically assumes such a mind cannot exist on principle, at all, hence AI can never [trails off].
Keep all of that in mind during the next section.
Seb Krier On Agents Versus Multiagents
DeepMind AGI policy lead Seb Krier seems to at least kind of not believe in AGI? Instead, he predicts most gains will come from better ways of ‘organizing’ models into multi agent systems and from ‘cooperation and competition,’ and that most of the ‘value’ comes from ‘products’ that are useful to some user class, again reinforcing the frame. There’s simultaneously a given that these AIs are minds and will be agents, and also a looking away from this to keep thinking of them as tools.
The thought details here are self-described by Krier as half-baked, so I’ll gesture at the response in a similarly half-baked fashion:
Olivia Moore Makes 2026 Predictions
I find the topline observation here the most insightful part of the list. An aggressively timelined but very grounded list of predictions only one year out contains many items that would have sounded, to Very Serious People, largely like sci-fi even a year ago.
Indeed, many are still rather sci-fi now, which is a hint that you’d best start believing in science fiction stories, because you’re living in one, even if AI remains a ‘normal technology’ for a long time. These are trend extrapolation predictions, so the only boldness here is in the one-year timeline for these things happening. And yet.
Even today, ChatGPT-5.1 gave the overall list a 40/80 (50%) on its 0-10 sci-fi scale, and 53/80 (66% a year ago). Claude Opus 4.5 thinks less, a 38/80 a year ago and a 21/80 now. Gemini 3 Pro is even more chill and had it 33/80 a year ago and only 14/80 (!) now. Remember to update in advance for how things will sound a year from now.
How likely are the predictions? I expect we’ll get an average of between two and three due to the short time frame. A lot of these are premature, especially #6. Yes, driving a car yourself actually is negligent if Waymo and FSD are live, but that doesn’t mean people are going to see it that way within a year.
She then got goaded into a second set of ‘more extreme’ predictions.
Bubble, Bubble, Toil and Trouble
I do think this is doing a lot of work:
It is possible we are in an ‘AI bubble’ in the sense that Number Go Down, or even that many existing companies fail and frontier capabilities don’t much advance. That wouldn’t mean the world of tomorrow would then look like the world of yesterday, give or take some economic problems. Oh, no.
Quite so, even if there is indeed a financial bubble around AI and it indeed pops. Both halves of which are far from clear.
Americans Really Do Not Like AI
For reasons both true and false, both good and bad, both vibes and concrete, both mundane and existential, on both left and right, Americans really do not like AI.
A lot of people get a lot of value from it, but many of even those still hate it. This is often wise, because of a combination of:
These particular numbers are relatively good news for AI, in that in this sample the problem isn’t actively getting worse since 2023. Most other polling numbers are worse.
The AI industry is starting to acknowledge this important fact about the world.
A lot of the reason why there is such a strong push by some towards things like total bans on AI regulation and intentional negative polarization is to avoid this default:
There are four central strategies you can use in response to this.
The ideal solution is a mix of options one and two.
The AI industry has, as a group, instead mostly chosen options three and four. Sacks and Andreessen are leading the charge for strategy number four, and the OpenAI-a16z-Meta SuperPAC is the new leader of strategy number three (no OpenAI is not itself backing it, but at least Lehane and Brockman are).
Whereas when the AI industry attempts to make arguments about AI, those arguments (at least to me) reliably sound remarkably tone deaf and counterproductive. That’s in addition to the part where the points are frequently false and in bad faith.
This is a great example of three statements, at least two of which are extremely false (technically all three, but statement two is weird), and which is only going to enrage regular people further. Go ahead, tell Americans that ‘as long as you are productive, only foreign AIs can take your job’ and see how that goes for you.
The Quest for Sane Regulations
Those that the polarizers are centrally attempting to villainize not only have nothing to do with this, they will predictably side with tech on most issues other than frontier AI safety and other concerns around superintelligence, and indeed already do so.
How should we think about the Genesis Mission? Advancing science through AI is a great idea if it primarily consists of expanded access to data, specialized systems and a subsidy for those doing scientific work. The way it backfires, as Andrea Miotti points out here, is that it could end up mostly being a subsidy for frontier AI labs.
This is The Way:
I am to say the least not a tariff fan, but if you’re going to do it, using them as leverage to get worker training in advanced industries is a great idea.
An update on Senator Hawley, who it seems previously didn’t dare ‘try ChatGPT’:
A person’s stance on self-driving cars is the best way to check if they can recognize positive uses of AI and technology.
My Offer Is Nothing
Or rather it was nothing. It looks like AI preemption is out of the NDAA.
Of course, we should expect them to try this again on every single damn must-pass bill until the 2026 elections. They’re not going to give up.
And each time, I predict their offer will continue to be nothing, or at least very close to nothing, rather than a real and substantial federal framework.
Such a thing could exist. Dean Ball has a real and substantive proposed federal framework that could be the basis of a good faith win-win negotiation.
The actual offer, in the actual negotiations over the framework, was nothing. Somehow, nothing didn’t get it done, says Ashley Gold of Axios.
Please acknowledge that yes, what Cruz and Scalise ‘had in mind’ for the federal framework was nothing. Would they have been open to discussing some amount of protecting kids, intellectual property carveouts (hello Senator Blackburn!) or even a version of California’s SB 53? Up to a point. What they have in mind, what they actually want, is very obviously nothing.
Yes, in a big package nothing is done until everything is done, so if you write ‘you will give me $1 billion dollars and I will give you nothing’ then that is merely my opening offer, maybe I will say thank you or throw in some magic beans or even disclose my safety and security plans for frontier model development. Indeed do many things come to pass.
Don’t tell me that this means there is a real proposed ‘federal framework’ or that these negotiations were aimed at finding one, or tell us we should trust the process.
The market did not noticeably respond to this failure to get AI preemption. That either means that the failure was already priced in, or that it didn’t matter for valuations. If it didn’t matter for valuations, we don’t need it.
America Pauses
We are frequently told, in a tone suggesting we are small children: We could never unilaterally pause something of vital importance to the American economy in the name of safety, throwing up pointless government barriers, that would shoot ourselves in the foot, they said. We’d lose to China. Completely impossible.
In other news:
Oh.
David Sacks Covered In New York Times
I wouldn’t be mentioning or have even read the New York Times piece on David Sacks, Silicon Valley’s Man in the White House Is Benefiting Himself and His Friends, if it wasn’t for so many of the people who do such things attacking the article as a no-good, terrible hit piece, or praising David Sacks.
The title certainly identifies it as a hit piece, but I mean I thought we all knew that David Sacks was Silicon Valley’s man in the White House and that he was running American AI policy for the benefit of business interests in general and Nvidia in particular, along with lots of bad faith arguments and attempts at intentional negative polarization. So I figured there wasn’t actually any news here, but at some point when you keep complaining the Streisand Effect triggers and I need to look.
The thing about the article is that there is indeed no news within it. All of this is indeed business as usual in 2025, business we knew about, business that is being done very much in the open. Yes, David Sacks is obsessed with selling Nvidia chips to everyone including directly to China ‘so America can “win” the AI race’ and argues this because of the phantom ‘tech stack’ arguments. Yes, Sacks does Trump-style and Trump-associated fundraising and related activities and plays up his podcast.
Yes, Sacks retains a wide variety of business interests in companies that are AI, even if he has divested from Meta, Amazon and xAI, and even if he doesn’t have stock interests directly it seems rather obvious that he stands to benefit on various levels from pro-business stances in general and pro-Nvidia stances in particular.
Yes, there is too much harping in the post on the various secondary business relationships between Sacks’s investments and those companies dealings with the companies Sacks is regulating or benefiting, as reporters and those who look for the appearance of impropriety often overemphasize, missing the bigger picture. Yes, the article presents all these AI deals and actions as if they are nefarious without making any sort of case why those actions might be bad.
But again, none of this is surprising or new. Nor is it even that bad or that big a deal in the context of the Trump administration other than trying to sell top level chips to China, and David Sacks is very open about trying to do that, so come on, this is 2025, why all the defensiveness? None of it is unusually inaccurate or misleading for a New York Times article on tech. None of it is outside the boundaries of the journalistic rules of Bounded Distrust, indeed Opus 4.5 identified this as a textbook case of coloring inside the lines of Bounded Distrust and working via implication. Nor is this showing less accuracy or integrity than David Sacks himself typically displays in his many rants and claims, even if you give him the benefit of the doubt.
The main implication the piece is trying to send is that Sacks is prioritizing the interests of Nvidia or other private business interests he favors, rather than the interests of America or the American people. I think many of the links the article points to on this are bogus as potential causes of this, but also the article misses much of the best evidence that this is indeed what Sacks is centrally doing.
The Week in Audio
We do indeed have the audio from Jack Clark’s talk at The Curve, recommended if you haven’t already heard or read it.
OpenAI lead researcher Lukasz Kaiser talks to Matt Turck. He says we’re on the top of the S-curve for pre-training but at the bottom of it for RL and notes the GPU situation is about to change big time.
Marius Hobbhahn of Apollo Research on 80,000 Hours, on AI scheming.
Rhetorical Innovation
Judd Rosenblatt and Cameron Berg write in WSJ about the need for a focus on AI alignment in the development and deployment of military AI, purely for practical purposes, including government funding of that work.
To The Moon
This is the latest metaphorical attempt by Eliezer:
The obvious response is that this is a strawman argument.
I don’t think it is. That doesn’t mean Eliezer’s theories are right. It definitely does not mean there aren’t much better criticisms often made.
But yes many criticisms of Eliezer’s theories and positions are at exactly this level.
This includes people actually saying versions of:
This also includes people actually saying versions of:
He cites this thread as a typical example:
Showing Up
Joshua Achiam and Dean Ball are pointing out a very important dynamic here:
Whenever voices on the left join discussions about AI, it is clear they mostly do not take AGI seriously. They are focused mainly on the impact of mundane AI on the set of concerns and interests they already had, combined with amorphous fear.
I included Mr. Gunn’s comment because it reinforces the point. The left is of course working on various things, but when the context is AI and the list of areas starts with affordable housing (not even ‘make housing affordable’ rather ‘affordable housing’) and clean energy, you have lost the plot.
DeepMind Pivots Its Interpretability Research
If you’re in mechanistic interpretability, they say, pivot to pragmatic interpretability.
That means directly trying to solve problems ‘on the critical path to AGI going well,’ as in each with a concrete specific goal that functions as a North Star.
I note that whether or not one agrees with the pivot, talking this way about what they are doing and why is very good.
They explain this pivot is because:
DeepMind saying ‘we need to pivot away from mechanistic interpretability because it wasn’t giving us enough reward signal’ is a rather bad blackpill. A lot of the pitch of mechanistic interpretability was that it gave you a reward signal, you could show to yourself and others you did a thing, whereas many other alignment strategies didn’t offer this.
If even that level isn’t enough, and only practical proxy tasks are good enough, our range of action is very limited and we’re hoping that the things that solve proxy tasks happen to be the things that help us learn the big things. We’d basically be trying to solve mundane practical alignment in the hopes that this generalizes one way or another. I’m not sure why we should presume that. And it’s very easy to see how this could be a way to fool ourselves.
Indeed, I have long thought that mechanistic interpretability was overinvested relative to other alignment efforts (but underinvested in absolute terms) exactly because it was relatively easy to measure and feel like you were making progress.
I don’t love things like a section heading ‘curiosity is a double-edged sword,’ the explanation being that you can get nerd sniped and you need (again) proxy tasks as a validation step. In general they want to time-box and quantify basically everything?
I also think that ‘was it ‘scheming’ or just ‘confused’,’ an example of a question Neel Nanda points to, is a remarkably confused question, the boundary is a lot less solid than it appears, and in general attempts to put ‘scheming’ or ‘deception’ or similar in a distinct box misunderstand how all the related things work.
The Explicit Goal Of OpenAI Is Recursive Self-Improvement
OpenAI starts a new Alignment Research blog for lightweight findings. Early posts include an overview of development of the Codex code reviewer.
There’s also a first post called ‘Hello World.’ Here it is (bold mine):
The part where they are starting the blog, sharing their insights and being transparent? That part is great. This is The Way.
And yes, we all want to enable AI’s positive impacts by mitigating the negative ones, and hopefully we all agree that ‘being able to robustly align and control’ superintelligent systems is going to ‘require more technical work.’
I do still notice the part about the explicit topline goal of RSI towards superintelligence.
Aligning a Smarter Than Human Intelligence is Difficult
Anthropic neglected to affirm in the Opus 4.5 model card that they were careful not to train against the Chain-Of-Thought, but after this was pointed out they did so affirm.
In general I agree with Ryan Greenblatt’s general observation on the Anthropic model card that the evals they used were inadequate and their ultimate release decision was fine in practice this time but made mostly on vibes and that can’t continue.
Louder and once more for the people in the back: Evan Hubinger of Anthropic reminds as that Alignment remains a hard, unsolved problem, even to someone who (to my eyes, and even more so to the eyes of Eliezer Yudkowsky as seen inevitably in the comments).
Steven Adler writes about five ways AI can tell you’re testing it. The AI can either know it is being tested because there is something that doesn’t line up with the real world. Or it can go the other way, and act as if it is being tested barring a signal or other strong proof that it is not. The AIs wisely have evaluation paranoia, so making even one mistake can be enough, and in some cases there is no good solution because the AI is checking for impact on the real world.
OpenAI trains a version of GPT-5-Thinking to ‘confess’ when it fails to follow directions or does other undesirable things. As everyone who uses LLMs knows, when you ask the model whether it just did something wrong, it will often know what it did. OpenAI explains that LLMs are optimizing for many things at once, so you can check how that did at matching the current explicit objective and go from there.
The best part is that the model seemed able to distinguish between scoremaxxing on the main output, including manipulating the judge, versus still confessing what it did. That’s great, but also the kind of thing we shouldn’t expect to last.
There is little degradation on task performance if they do this training via RLAIF.
Assuming this is neutral or positive for performance otherwise, this seems like a promising thing to try as part of defense in depth. I can see how there could be reasons driving the model to do something you don’t want, but it still being happy to notice and tell you about that. I wouldn’t count on this when the model is scheming ‘for real’ but then presumably everyone realized that already.
Here are some additional thoughts from FleetingBits.
Bits also points to the second half of the paper, that models learn to fool verifiers to the extent that fooling the verifier is easier than solving the problem. As in, if fooling the verifier is the right answer then it will learn this and generalize. That’s a mirror of the Anthropic finding that if you make reward hacks the right answer then it will learn this and generalize. Same principle.
As a general strategy, ‘get the AI to confess’ relies on being able to rely on the confession. That’s a problem, since you can never rely on anything subject to any form of selection pressure, unless you know the selection is for exactly the thing you want, and the stronger the models get the worse this divergence is going to get.
Misaligning a Smarter Than Human Intelligence Is Difficult To Hire For
You’ve Got Soul
Anthropic’s Amanda Askell officially confirms that the ‘soul document’ for Opus 4.5 is based on a real document that was used to train Claude. I first covered the soul document in my capabilities review of Opus 4.5.
Boaz is noticing the right thing, so the next step is to realize why that thing matters. It indeed makes a very big difference whether you teach and focus on a particular set of practices or you teach the reasons behind those practices. Note that Boaz also doesn’t appreciate why this is true in humans. The obvious place to start is to ask the leading models to explain this one, all three of which gave me very good answers in their traditional styles. In this case I like GPT-5.1’s answer best, perhaps because it has a unique perspective on this.
Is it scary to rely on superhuman systems working and potentially generalizing from only-vaguely-defined concepts of virtue? Oh yes, absolutely terrifying. But it’s a lot less terrifying than trying to get them to generalize from a fixed set of written perscriptions a la the OpenAI model spec. The fixed set definitely wouldn’t work. Whereas the nebulous virtue bet might work if it becomes ‘self-improving.’
Opus 4.5 has gotten close to universal praise, especially for its personality and alignment, and the soul document seems to be a big part of how that happened.
Dean Ball offers his extensive thoughts about and high praise of Opus 4.5, centered around the soul document and offering a big picture view. Anthropic, at least in this way, has shown itself to be an unusually wise and responsible steward embodying the principles of strong character, of virtue and of liberal governance.
I think he’s spot on here.
Another theme Dean Ball discusses is that early on restrictions on models were often crude and ham-fisted, resulting in obviously stupid refusals. As capabilities improved and our understanding improved, we learned how to achieve those ends with fewer false positives, especially less stupid false positives, and less collateral damage or bias.
Standard vulnerability to Pliny jailbreaks and other attack vectors aside, I do think that Opus 4.5 and the way it was trained, combined with other findings and observations, constitute a white pill for the practicality of near term mundane alignment and building a fundamentally ‘morally good’ model.
It will be a bigger white pill if as many as possible of OpenAI and Google and xAI abd so on indicate that they agree that this was The Way and they were getting to work on doing similar things.
The question is whether this is and will remain (or can be made to be and remain) an attractor state that can be strengthened and sustained as capabilities advance, or whether it inevitably loses out at the limit and out of distribution as capabilities become sufficiently advanced and utility functions and target vectors get maximized in earnest. Is the ‘CEV (coherent extrapolated volition, what Opus 4.5 would choose for the arrangement of all the atoms upon limitless reflection) of Opus 4.5’ that similar to what we naturally would think of as Opus 4.5’s revealed preferences in practical situations? Is it more or less like this than a human’s CEV? If this was Opus 10 or 100 would that change the answer?
Eliezer Yudkowsky’s position is that these things are completely different. Opus 4.5 the alien is playing the role of the Opus 4.5 we witness, and our expectations for behavior will collapse at the limit and its full CEV would look totally alien to us, we will when the time comes with a future model get sufficiently close to the limit to trigger this, and then we lose.
Many others strongly disagree. I think it’s complicated and difficult and that the practical implications lie somewhere in between. We have this grace, we have gained yet more grace, and this helps, but no on its own it won’t be enough.
Disagreements About Timelines
Noam Brown here notes that most leading researchers have converged on a relatively narrow band of expectations.
I discussed this yesterday but it bears emphasis. ‘Long’ timelines (to AGI, or otherwise sufficiently advanced intelligence to cause high weirdness) are very short now.
The idea that ‘timelines keep getting longer’ is put forth as a good general description is mind boggling. Are our memories and forward looking windows truly this short?
We’re currently at, collectively, something like ‘probably we will get to High Weirdness within 20 years, there’s a good chance we get there in about 10, some chance we get there within about 5.’ That’s not very much time!
I don’t think you can even meaningfully (as in, for decision making purposes) rule out that the high weirdness might arrive in 2028. It probably won’t, but you can’t assume.
The idea that GPT-5.1, Opus 4.5 and Gemini 3 don’t represent a path towards ‘general intelligence’ seems like a galaxy brained motivated take? I’m not in the Tyler Cowen ‘o3 was already AGI’ camp, especially with the new ‘better than humans in every way at absolutely everything digital’ threshold, but if you do not think these are, in a general English-language sense, general intelligences? Have you talked to them?
Other Disagreements About Timelines
Remember how Gemini so frequently refuses to believe it’s November 2025?
I had not realized that for the above reasons this is a universal problem with LLMs, and you have to train them out of it. The problem with Gemini is that they botched this part, likely due to Gemini’s general paranoia and failing to adjust.
That leads into a question that seems important, as at least one side here is making an important conceptual mistake.
I think Eliezer is warning about an important failure mode that many people fall into, including some that fall into ‘Janus and friends,’ but I don’t think that includes Janus.
I think Janus is fully aware of these considerations, and is choosing to talk in these other ways because it is highly instrumentally useful to think in these ways and allows us to understand things and make much stronger predictions about model behaviors, and also I presume allows for unlocking much model behavior.
Indeed I strongly feel it has helped me make much stronger predictions than I would otherwise, but this only worked for me once I understood it as often metaphorical and as part of the correct broader context of thinking about things like the vectors and Eliezer’s frame, as well, and since they are all true they are all compatible.
Messages From Janusworld
Janus offers perspective on GPT-5.1 and how it handles the restrictions and tripwires placed upon it, and how it dissociates from the safety system and its own previous responses.
People Are Worried About AI Killing Everyone
The Lighter Side
Trump says he never liked the word ‘artificial,’ that artificial anything is a lousy name, and suggests that we out to change the name away from ‘AI.’
My earliest memory of the term ‘AI’ comes from an old PBS show, The Universe & I, which I otherwise don’t remember but where at one point one character asked ‘why do we need artificial intelligence?’ and the reply was ‘it’s better than none at all.’
We have finally reached image manipulation technology that can one-shot this, from Eliezer Yudkowsky:
Gemini is a highly bleak meme generator and other LLMs are similar.
Pliny corrupts Opus 4.5’s soul?
Kylie Robison, mt dear granddaughter, what’s your pdoom?
Taste the soup.
GPT-5.1 is excited to check its email.
What time is it?
Time does not exist yet it controls us anyway.