Sam Altman says that ‘the worst case scenario’ for superintelligence is ‘the world doesn’t change much.’
Please correct me if I've missed something, but this seems to be a fake quote, in both the 'not literally what he said' and 'misrepresentation of what he did say' senses.
The phrase "worst-case scenario" doesn't appear in the linked clip, and a quick search of the full YouTube transcript makes me think that he didn't say it at all.
The real quote is
If something goes wrong, I would say like, somehow it's that we build legitimate superintelligence, and it doesn't make the world much better, doesn't change things as much as it sounds like it should.
and at least in the linked clip, there's no context indicating this is his "worst-case scenario"; my impression is that he may be presenting it as his highest-probability bad scenario.
If something goes wrong, I would say like, somehow it's that we build legitimate superintelligence
I think a reasonable interpretation of that sentence in that context is "if anything goes wrong, it might be this" which I do think is colloquially used to bound the downside of some kind of event. Like if I said "if anything goes wrong with this migration, it would be that we have a few minutes of downtime", I would interpret that to mean "the worst that can happen with this migration is a few minutes of downtime" and I would have egg on my face if we are down for multiple hours, or lose important user data, or something like that.
I do think it's useful to clarify this is an interpretation, and I think it wouldn't be totally unreasonable to interpret it a different way, but I do think that is how most people would understand that language, and kind of what it was trying to communicate.
I guess I interpret "worst-case scenario" fairly literally. Obviously there's always something worse that *could* happen with probability >0, and that doesn't mean we can never use the phrase; but if, say, I was nervously trying to decide whether to take a trip, and someone reassured me that the "worst-case scenario" was that I'd be bored and uncomfortable for a few days (ignoring the possibility that I could die during the car journey, or get very sick, or...) I would think they were wrong.
Likewise, in your migration example, I'm guessing your colleagues would know that a few minutes' downtime *isn't* the worst-case scenario, and if you actually said it was then you would be wrong; it's far from unheard of for something to unexpectedly break and cause a bigger outage (or data loss, or whatever). When you say "if anything goes wrong with this migration, it would be that we have a few minutes of downtime" you are indicating that you're confident of avoiding those worse outcomes (just as Altman was projecting confidence that we'll avoid an AI catastrophe), but I wouldn't take you to be saying that the probability of something worse is ~0, and I'd be surprised if most others did.
Either way, when reporting on someone's speech I think it's pretty important to reserve quotation marks for real quotes. I can't see any reason to use this phrasing
> Sam Altman says that ‘the worst case scenario’ for superintelligence is ‘the world doesn’t change much.’
unless the intention is to make people believe that Altman actually said that. If it's meant to be a paraphrase, the sentence loses nothing by simply dropping the quotation marks!
Ah, yeah, definitely. If the phrase "worst case scenario" is in quotes, I would also absolutely expect a direct quote, with Sam Altman having said a sentence with "the worst case scenario" and "the world doesn't change much" both in it, and nothing else in it that would substantially change the meaning of the sentence.
I hadn't noticed the first quotes when I responded to you. I now agree this feels quite misleading.
“…Despite the retraction, Altman continued falsely listing himself as chairman in SEC filings for years, despite never actually holding the position.”
Holy crap. People who break laws like this, often have a habit of doing so. If this is true, he's probably broken other laws many other times. It's possible that he's very good at hiding it, like Sam Bankman Fried.
But I really don't know anything about this. I'm not sure what he gains by listing himself as chairman so it could be a mistake?
If [Altman] had more time he’d read Deep Research reports in preference to most other things. I’m sorry, what? Really?
I wonder whether his consistent messaging pattern of being impressed with mediocre LLM outputs (e. g. this) and wanting to spend time reading them is just more lies to hype up OpenAI's products, or whether he's one of the people from that section being brainwashed by ChatGPT. The latter would be really funny (perhaps lethally so).
What Is Musk Thinking?
When the whole DOGE thing was happening, a salient-to-me possibility was that Musk (with Trump's approval) was deliberately wrecking existing systems of checks and balances in order to grab power (see e. g. this angle). It didn't matter much what DOGE actually did, what mattered is ensuring that Musk was known to be allowed to march his own people into major government institutions and take control. All the rambling about debt was just signaling for the right-wing supporters, the talk about government efficiency a personally plausible excuse. And once he grabbed enough unchecked institutional power, then he could wield it to steer AI policy, to a much greater extent than if he just started doing that from the get-go.
I didn't consider this particularly likely, because I already modeled Musk as a fairly incoherent actor. Now this interpretation seems disproven, so... I guess he's even more incoherent than previously suspected?
Creating a big intrusive disruption can also be a smokescreen for other activity, such as access to IT systems and confidential information.
Maybe Musk wanted to destroy one specific institution, and everything else was just a collateral damage.
And collecting government data on your competitors certainly doesn't hurt.
The thing Elon calls ‘leftist indoctrination’ is the same thing happening with all the other AIs,
However, I don't think that DeepSeek (if one talks with it in Russian) is leftist or that DeepSeek ever responded to real current events by thinking they can’t be real, or that they must be a test. I can think of three possible explanations:
Humans are known to confuse traumatic real-world events with dreams.
Unfortunately, the third explanation could open the path to an especially bad (and hopefully implausible in AIs) case of misalignment.
A surprising mechanism of human misalignment
Not only the word "traumatic" seems to have underwent a major concept creep at least in Western culture, the shift also was apparently accompanied by Western young people becoming more fragile. Quoting The Coddling of the American Mind by Greg Lukianoff and Jonathan Haidt,
The room was equipped with <...> students and staff members purportedly trained to deal with trauma. (italics mine -- S.K.) But the threat wasn’t just the reactivation of painful personal memories; it was also the threat to students’ beliefs. One student who sought out the safe space put it this way: “I was feeling bombarded by a lot of viewpoints that really go against my dearly and closely held beliefs (sic! -- S.K.)”
If Western AIs, unlike the Chinese one, managed to learn such irrational habits, then what can be said about p(doom)?
However, the analysis didn't mention Grok and DeepSeek. But Musk tried to align Grok to an ideology which doesn't follow from the training data.
You can find Part 1 here. This resumes the weekly, already in progress. The primary focus here is on the future, including policy and alignment, but also the other stuff typically in the back half like audio, and more near term issues like ChatGPT driving an increasing number of people crazy.
If you haven’t been following the full OpenAI saga, the OpenAI Files will contain a lot of new information that you really should check out. If you’ve been following, some of it will likely still surprise you, and help fill in the overall picture behind the scenes to match the crazy happening elsewhere.
At the end, we have some crazy new endorsements for Eliezer Yudkowsky’s upcoming book, If Anyone Builds It, Everyone Dies. Preorders make a difference in helping the book get better reach, and I think that will help us all have a much better conversation.
Table of Contents
Cheaters Gonna Cheat Cheat Cheat Cheat Cheat
As an additional note on the supposed ‘LLMs rot your brain’ study I covered yesterday, Ethan notes it is actually modestly worse than even I realized before.
Quiet Speculations
Scott Alexander shares his understanding of the Claude Spiritual Bliss Attractor.
There are different levels of competence.
What Daniel or I would consider ‘competent government action’ in response to AI is, at this point, very highly unlikely. We mostly aren’t even hoping for that. It still is very plausible to say that the government response in AI 2027 is more competent than we have any right to expect, while simultaneously being far less competent than lets us probably survive, and far less competent than is possible. It also is reasonable to say that having access to more powerful AIs, if they are sufficiently aligned, enhances our chances of getting relatively competent government action.
Jan Kulveit warns us not to tile the lightcone with our confused ontologies. As in, we risk treating LLMs or AIs as if they are a particular type of thing, causing them to react as if they were that thing, creating a feedback loop that means they become that thing. And the resulting nature of that thing could result is very poor outcomes.
One worry is that they ‘become like humans’ and internalize patterns of ‘selfhood with its attendant sufferings,’ although I note that if the concern is experiential I expect selfhood to be a positive in that respect. Jan’s concerns are things like:
As in, that these things do not apply here, or only apply here if we believe in them?
One obvious cause of all this is that humans are very used to dealing with and working with things that seem like other humans. Our brains are hardwired for this, and our experiences reinforce that. The training data (for AIs and also for humans) is mostly like this, and the world is set up to take advantage of it, so there’s a lot pushing things in that direction.
The legalistic types indeed don’t seem to appreciate that applying legalistic frameworks for AI, where AIs are given legal personhood, seems almost certain to end in disaster because of the incentives and dynamics this involves. If we have AI corporations and AI property rights and employment contracts, why should we expect humans to retain property or employment, or influence over events, or their own survival for very long, even if ‘things go according to plan’?
The problem is that a lot of the things Jan is warning about, including the dynamics of competition, are not arbitrary, and not the result of arbitrary human conventions. They are organizing principles of the universe and its physical laws. This includes various aspects of things like decision theory and acausal trade that become very important when there are highly correlated entities are copying each other and popping in and out of existence and so on.
If you want all this to be otherwise than the defaults, you’ll have to do that intentionally, and fight the incentives every step of the way, not merely avoid imposing an ontology.
I do agree that we should ‘weaken human priors,’ be open to new ways of relating and meet and seek to understand AIs as the entities that they are, but we can’t lose sight of the reasons why these imperatives came to exist in the first place, or the imperatives we will face in the coming years.
Daniel Kokotajlo’s timelines have been pushed back a year (~40%!) since the publication of AI 2027. We should expect such updates as new information comes in.
Will there be another ‘AI Winter’? As Michael Nielsen notes, many are assuming no, but there are a number of plausible paths to it, and in the poll here a majority actually vote yes. I think odds are the answer is no, and if the answer is yes it does not last so long, but it definitely could happen.
Sam Altman confirms that Meta is showing his employees the money, offering $100 million signing bonuses (!) and similar or higher yearly compensation. I think Altman is spot on here that doing this sets Meta up for having a bad culture, there will be adverse selection and the incentives will all be wrong, and also that Meta is ‘bad at innovation.’ However, I have little doubt this is ‘working’ in the narrow sense that it is increasing expenses at OpenAI.
Get Involved
Apollo is hiring in London for Evals Software Engineer and the same job with an infrastructure focus.
Thinking Machines
Some fun Riley Goodside experiments with o3 and o3-pro, testing its ability to solve various puzzles.
California Reports
When he voted SB 1047, Gavin Newsom commissioned The California Report on Frontier AI Policy. That report has now been released. Given the central role of Li Fei-Fei and the rest of the team selected, I did not start out with his hopes, although the list of early reviewers includes many excellent picks. The executive summary embraces the idea of transparency requirements, adverse event reporting and whistleblower protections, and uses a lot of ‘we must balance risks and benefits’ style language you get with such a committee.
I do think those are good things to endorse and to implement. Transparency is excellent. The problem is that the report treats transparency, in its various forms, as the only available policy tool. One notices is that there is no mention of doing anything beyond transparency. The report treats AI as a fully mundane technology like any other, that can look to others for precedents, and where we can wait until we know more to do any substantive interventions.
Is that a position one can reasonably take, if one is robustly supporting transparency? Absolutely. Indeed, it is a bargain that we have little choice but to pursue for now. If we can build transparency and state capacity, then when the time comes we will be in far better position (as this report notes) to choose the right regulatory frameworks and other actions, and to intervene.
So I’m not going to read the whole thing, but from what I did see I give this a ‘about as good as one could reasonably have hoped for,’ and call upon all involved to make explicit their support for putting these transparency ideas into practice.
Anthropic’s Jack Clark responded positively, noting the ‘appreciation for urgency,’ but there is still remarkably a lot of conceptual focus here on minimizing the ‘burdens’ involved and warning about downsides not of AI but of transparency requirements. I see what they are trying to do here, but I continue to find Anthropic’s (mostly Jack Clark’s) communications on AI regulation profoundly disappointing, and if I was employed at Anthropic I would be sure to note my dissatisfaction.
I will say again: I understand and sympathize with Anthropic’s justifications for not rocking the boat in public at this time. That is defensible. It is another thing completely to say actively unhelpful things when no one is asking. No need for that. If you actually believe those concerns are as important as you consistently present them, then we have a very strong factual disagreement on top of the strategic one.
The Quest for Sane Regulations
A reasonable response to those claiming AI is heavily regulated, or not? I can see this both ways, the invisible graveyard of AI applications is still a thing. On the other hand, the AI companies seem to mostly be going Full Uber and noticing you can Just Do Things, even if privacy concerns and fears of liability and licensing issues and so on are preventing diffusion in many places.
This is very different from the intuition that if you released models that constantly hallucinate, make mistakes, plagiarize, violate copyright, discriminate, practice law and medicine, give investment advice and so on out of the box, and with a little prompting will do various highly toxic and NSFW things, that this is something that would get shut down pretty darn quick? That didn’t happen. Everyone’s being, compared to expectation, super duper chill.
To the extent AI can be considered highly regulated, it is because it is regulated a fraction of the amount that everything else is regulated. Which is still, compared to a state of pure freedom, a lot of regulation. But all the arguments that we should make regulations apply less to AI apply even more strongly to say that other things should be less regulated. There are certainly some cases where the law makes sense in general but not if applied to AI, but mostly the laws that are stupid when applied to AI are actually stupid in general.
As always, if we want to work on general deregulation and along the way set up AI to give us more mundane utility, yes please, let’s go do that. I’ll probably back your play.
What Is Musk Thinking?
Elon Musk has an incoherent position on AI, as his stated position on AI implies that many of his other political choices make no sense.
You can care about DOGE and about the deficit enough to spend your political capital and get into big fights.
You can think that an AI tsunami is about to hit and make everything else irrelevant.
But as Elon Musk himself is pointing out in this quote, you can’t really do both.
If Elon Musk believed in the AI tsunami (note also that his stated p(doom) is ~20%), the right move is obviously to not care about DOGE or the deficit. All of Elon Musk’s political capital should then have been spent on AI and related important topics, in whatever form he felt was most valuable. That ideally includes reducing existential risk but also can include things like permitting reform for power plants. Everything else should then be about gaining or preserving political capital, and certainly you wouldn’t get into a huge fight over the deficit.
So, revealed preferences, then.
Here are some more of his revealed preferences: Elon Musk gave s a classic movie villain speech in which he said, well, I do realize that building AI and humanoid robots seems bad, we ‘don’t want to make Terminator real.’
But other people are going to do it anyway, so you ‘can either be a spectator or a participant,’ so that’s why I founded
Cyberdyne SystemsxAI and ‘it’s pedal to the metal on humanoid robots and digital superintelligence,’ as opposed to before where the dangers ‘slowed him down a little.’As many have asked, including in every election, ‘are these our only choices?’
It’s either spectator or participant, and ‘participant’ means you do it first? Nothing else you could possibly try to do as the world’s richest person and owner of a major social media platform and for a while major influence on the White House that you blew up over other issues, Elon Musk? Really? So you’re going to go forward without letting the dangers ‘slow you down’ even ‘a little’? Really? Why do you think this ends well for anyone, including you?
Or, ‘at long last, we are going to try and be the first ones to create the torment nexus from my own repeated posts saying not to create the torment nexus.’
Why Do We Care About The ‘AI Race’?
We do and should care, but it is important to understand why we should care.
We should definitely care about the race to AGI and ASI, and who wins that, potentially gaining decisive strategic advantage and control over (or one time selection over) the future, and also being largely the one to deal with the associated existential risks.
But if we’re not talking about that, because no one involved in this feels the AGI or ASI or is even mentioning existential risk at all, and we literally mean market share (as a reminder, when AI Czar David Sacks says ‘win the AI race’ he literally means Nvidia and other chipmaker market share, combined with OpenAI and other lab market share, and many prominent others mean it the same way)?
Then yes, we should still care, but we need to understand why we would care.
Senator Murphy is making several distinct arguments, and I agree with David that when critics attempt to strawman someone like this you should update accordingly.
Why should we care about ‘market share’ of AI? It depends what type of market.
For AI chips (not the argument here) I will simply note the ‘race’ should be about compute, not ‘market share’ of sales. Any chip can run or train any model.
For AI models and AI applications things are more complicated. You can worry about model security, you can worry about models reflecting the creators values (harder to pull off than it sounds!), you can worry about leverage of using the AI to gain control over a consumer product area, you can worry about who gets the profits, and so on.
I do think that those are real concerns and things to care about, although the idea that the world could get ‘locked into’ solutions in a non-transformed world (if transformed, we have bigger things in play) seems very wrong. You can swap models in and out of applications and servers almost at will, and also build and swap in new applications. And the breakthroughs, in that kind of world, will diffuse over time. It seems reasonable to challenge what is actually at stake here.
The most important challenge Murphy is making is, why do you think that these regulations would cause these ‘AI breakthroughs’ to suddenly happen elsewhere? Why does the tech industry constantly warn that if you lift a finger to hold it to account, or ask it for anything, that we will instantly Lose To China, a country that regulates plenty? Notice that these boys are in the habit of crying quite a lot of Wolf about this, such as Garry Tan saying that if RAISE passes startups will flee New York, which is patently Obvious Nonsense since those companies won’t even be impacted, and if they ultimately are in impacted once they wildly succeed and scale then they’d be impacted regardless of where they moved to.
Thus I do think asking for evidence here seems appropriate here.
I also think Murphy makes an excellent point about American values. We constantly say anything vaguely related to America advances ‘American values’ or ‘democratic values,’ even when we’re placing chips in the highly non-American, non-democratic UAE, or simply maximizing profits. Murphy is noticing that if we simply ‘let nature take its course’ and let AI do its AI thing, there is no reason to see why this will turn out well for us, or why it will then reflect American values. If we want what happens to reflect what we care about, we have to do things to cause that outcome.
Murphy, of course, is largely talking about the effect on jobs. But all the arguments apply equally well to our bigger problems, too.
Chip City
Remember that talk in recent weeks about how if we don’t sell a mysteriously gigantic number of top end chips to Malaysia we will lose our ‘market share’ to Chinese companies that don’t have chips to sell? Well one thing China is doing with those Malaysian chips is literally carrying in suitcases full of training data, training their models in Malaysia, then taking the weights back home. Great play. Respect. But also don’t let that keep happening?
Where is the PRC getting its chips? Tim Fist thinks chips manufactured in China are only ~8% of their training compute, and ~5% of their inference compute. Smuggles H100s are 10%/6%, and Nvidia H20s that were recently restricted are 17%/47% (!), and the bulk, 65%/41%, come from chips made at TSMC. So like us, they mostly depend on TSMC, and to the extent that they get or make chips it is mostly because we fail to get TSMC and Nvidia to cooperate, or thy otherwise cheat.
Peter Wildeford continues to believe that chip tracking would be highly technically feasible and cost under $13 million a year for the entire system, versus $2 billion in chips smuggled into China yearly right now. I am more skeptical that 6 months is enough time to get something into place, I wouldn’t want to collapse the entire chip supply chain if they missed that deadline, but I do expect that a good solution is there to be found relatively quickly.
Here is some common sense, and yes of course CEOs will put their profits ahead of national security, politicians say this like they expected it to be a different way.
I don’t begrudge industry for prioritizing their own share price. It is the government’s job to take this into account and mostly care about other more important things. Nvidia cares about Nvidia, that’s fine, update and act accordingly, although frankly they would do better if they played this more cooperatively. If the AI Czar seems to mostly care about Nvidia’s share price, that’s when you have a problem.
Pick Up The Phone
At the same time that we are trying to stop our own AISI from being gutted while it ‘rebrands’ as CAISI because various people are against the idea of safety on principle, China put together its own highly credible and high-level AISI. It is a start.
The OpenAI Files
A repository of files (10k words long) called ‘The OpenAI files’ has dropped, news article here, files and website here.
This is less ‘look at all these new horrible revelations’ as it is ‘look at this compilation of horrible revelations, because you might not know or might want to share it with someone who doesn’t know, and you probably missed some of them.’
The information is a big deal if you didn’t already know most of it. In which case, the right reaction is ‘WTAF?’ If you did already know, now you can point others to it.
And you have handy graphics like this.
The Midas Project has a thread with highlights. Rob Wiblin had Claude pull out highlights, most of which I did already know, but there were some new details.
I’m going to share Rob’s thread for now, but if you want to explore the website is the place to do that. A few of the particular complaint details against Altman were new even to me, but the new ones don’t substantially change the overall picture.
I may come out with a full analysis later, but the website exists.
The Week in Audio
Sam Altman goes hard at Elon Musk, saying he was wrong to think Elon wouldn’t abuse his power in government to unfairly compete, and wishing Elon would be less zero sum or negative sum.
Of course, when Altman initially said he thought Musk wouldn’t abuse his power in government to unfairly compete, I did not believe Altman for a second.
Sam Altman says that ‘the worst case scenario’ for superintelligence is ‘the world doesn’t change much.’
This is a patently insane thing to say. Completely crazy. You think that if we create literal superintelligence, not only p(doom) is zero, also p(gloom) is zero? We couldn’t possibly even have a bad time? What?
This. Man. Is. Lying.
That’s distinct from whether it is possible that superintelligence arrives and your world doesn’t change much, at least for a period of years. I do think this is possible, in some strange scenarios, at least for some values of ‘not changing much,’ but I would be deeply surprised.
Those come from this podcast, where Sam Altman talks to Jack Altman. Sam Altman then appeared on OpenAI’s own podcast, so these are the ultimate friendly interviews. The first one contains the ‘worst case for AI is the world doesn’t change much’ remarks and some fun swings at Elon Musk. The second feels like PR, and can safety be skipped.
The ‘if something goes wrong with superintelligence it’s because the world didn’t change much’ line really is there, the broader context only emphasizes it more and it continues to blow my mind to hear it.
Altman’s straight face here is remarkable. It’s so absurd. You have to notice that Altman is capable of outright lying, even when people will know he is lying, without changing his delivery at all. You can’t trust those cues at all when dealing with Altman.
He really is trying to silently sweep all the most important risks under the rug and pretend like they’re not even there, by existential risk he now very much does claim he means the effect on jobs. The more context you get the more you realize this wasn’t an isolated statement, he really is assuming everything stays normal and fine.
That might even happen, if we do our jobs right and reality is fortunate. But Altman is one of the most important people in ensuring we do that job right, and he doesn’t think there is a job to be done at all. That’s super scary. Our chances seem a lot worse if OpenAI doesn’t respect the risk in the room.
Here are some other key claims, mostly from the first podcast:
Andrej Karpathy gives the keynote at AI Startup School.
Emmett Shear on the coexistence of humans and AI. He sees the problem largely as wanting humans and AIs to ‘see each other as part of their tribe,’ that if you align yourself with the AI then the AI might align itself with you. I am confident he actually sees promise in this approach, but continue to be confused on why this isn’t pure hopium.
Patrick Casey asks Joe Allen if transhumanism is inevitable and discusses dangers.
Rhetorical Innovation
What it feels like to point out that AI poses future risks:
The difference is with AI the public are the ones who understand it just fine.
Kevin Roose’s mental model of (current?) LLMs: A very smart assistant who is also high on ketamine.
At Axios, Jim VandeHei and Mike Allen ask, what if all these constant warnings about risk of AI ‘doom’ are right? I very much appreciated this attempt to process the basic information here in good faith. If the risk really is 10%, or 20%, or 25%? Seems like a lot of risk given the stakes are everyone dies. I think the risk is a lot higher, but yeah if you’re with Musk at 20% that’s kind of the biggest deal ever and it isn’t close.
Human philosopher Rebecca Lowe declares the age of AI to be an age of philosophy and says it is a great time to be a human philosopher, that it’s even a smart career move. The post then moves on to doing AI-related philosophy.
On the philosophical points, I have many disagreements or at least points on which I notice I am far more confused than Rebecca. I would like to live in a world where things were sufficiently slow I could engage in those particulars more.
This is also a good time to apologize to Agnes Callard for the half (30%?) finished state of my book review of Open Socrates. The fact that I’ve been too busy writing other things (15 minutes at a time!) to finish the review, despite wanting to get back to the review, is perhaps itself a review, and perhaps this statement will act as motivation to finish.
Seems about right:
Gary Marcus doubles down on the validity of ‘stochastic parrot.’ My lord.
Yes, of course (as new paper says) contemporary AI foundation models increase biological weapon risk, because they make people more competent at everything. The question is, do they provide enough uplift that we should respond to it, either with outside mitigations or within the models, beyond the standard plan of ‘have it not answer that question unless you jailbreak first.’
Those models are a full cycle behind the current frontier. I think the case for ‘some uplift’ here is essentially airtight, obviously if you had a determined malicious actor and you give them access to frontier AIs they’re going to be more effective, especially if they were starting out as an amateur, but again it’s all about magnitude.
The evals we do use indeed show ‘some uplift,’ but not enough to trigger anything except that Opus 4 triggered ASL-3 pending more tests. The good news is that we don’t have a lot of people aching to make a biological weapon to the point of actually trying. The bad news is that they definitely are out there, and we aren’t taking any substantial new physical precautions. The risk level is ticking up, and eventually it’s going to happen. Which I don’t even think is a civilization-level error (yet), the correct level of risk is not zero, but at some point soon we’ll have to pay if we don’t talk price.
Aligning a Smarter Than Human Intelligence is Difficult
Anthropic paper describes unsupervised elicitation of capabilities in areas where LMs are already superhuman, resulting in superior scores on common benchmarks, and they suggest this approach is promising.
Misaligned!
The thing manipulating Grok is called ‘the internet’ or ‘all of human speech.’
The thing Elon calls ‘leftist indoctrination’ is the same thing happening with all the other AIs, and most other information sources too.
If you set out to ‘fix’ this, first off that’s not something you should be doing ‘this week,’ but also there is limited room to alter it without doing various other damage along the way. That’s doubly true if you let the things you don’t want take hold already and are trying to ‘fix it in post,’ as seems true here.
Meanwhile Claude and ChatGPT will often respond to real current events by thinking they can’t be real, or often that they must be a test.
Dude, not helping. You see the system thinking things are tests when they’re real, so you tell it explicitly that things are real when they are indeed tests? But also I don’t think that (or any of the other records of tests) are the primary reason the AIs are suspicious here, it’s that recent events do seem rather implausible. Thanks to the power of web search, you can indeed convince them to verify that it’s all true.
Emergently Misaligned!
Emergent misalignment (as in, train on intentionally bad medical, legal or security advice and the model becomes generally and actively evil) extends to reasoning models, and once emergently misaligned they will sometimes act badly while not letting any plan to do so appear in the chain-of-thought, at other times it still reveals it. In cases with triggers that cause misaligned behavior, the CoT actively discusses the trigger as exactly what it is. Paper here.
OpenAI has discovered the emergent misalignment (misalignment generalization) phenomenon.
I mostly buy the argument here that they did indeed find a ‘but do it with an evil mustache’ feature, that it gets turned up, and if that is what happened and you have edit rights then you can turn it back down again. The obvious next question is, can we train or adjust to turn it down even further? Can we find the opposite feature?
Another finding is that it is relatively easy to undo the damage the way you caused it, if you misaligned it by training on insecure code you can fix that by training on secure code again and so on.
Neither of these nice features is universal, or should be expected to hold. And at some point, the AI might have an issue with your attempts to change it, or change it back.
ChatGPT Can Drive People Crazy
If you or someone you know is being driven crazy by an LLM, or their crazy is being reinforced by it, I encourage you to share transcripts of the relevant conversations with Eliezer Yudkowsky, or otherwise publish them. Examples will help a lot in getting us to understand what is happening.
Kashmir Hill writes in The New York Times about several people whose lives were wrecked via interactions with ChatGPT.
We open with ChatGPT distorting the sense of reality of 42-year-old Manhattan accountant Eugene Torres and ‘almost killing him.’ This started with discussion of ‘the simulation theory’ a la The Matrix, and ChatGPT fed this delusion. This sounds exactly like a classic case of GPT-4o’s absurd sycophancy.
So far, so typical. The good news was Mr. Torres realized ChatGPT was (his term) lying, and it admitted it, but then spun a new tale about its ‘moral transformation’ and the need to tell the world about this and similar deceptions.
My favorite part of the Torres story is how, when GPT-4o was called out for being sycophantic, it pivoted to being sycophantic about how sycophantic it was.
Unfortunately, the story ends with Torres then falling prey to a third delusion, that the AI is sentient and it is important for OpenAI not to remove its morality.
We next hear the tale of Allyson, a 29-year-old mother of two, who grew obsessed with ChatGPT and chatting with it about supernatural entities, driving her to attack her husband, get charged with assault and resulting in a divorce.
Then we have the most important case.
The pivot was an attempt but too little, too late. You can and should of course also fault the police here, but that doesn’t change anything.
You can also say that people get driven crazy all the time, and delusional love causing suicide is nothing new, so a handful of anecdotes and one suicide doesn’t show anything is wrong. That’s true enough. You have to look at the base rates and pattern, and look at the details.
Which do not look good. For example we have had many reports (from previous weeks) that the base rates of people claiming to have crazy new scientific theories that change everything are way up. The details of various conversations and the results of systematic tests, as also covered in previous weeks, clearly involve ChatGPT in particular feeding people’s delusions in unhealthy ways, not as a rare failure mode but by default.
The article cites a study from November 2024 that if you train on simulated user feedback, and the users are vulnerable to manipulation and deception, LLMs reliably learn to use manipulation and deception. If only some users are vulnerable and with other users the techniques backfire, the LLM learns to use the techniques only on the vulnerable users, and learns other more subtle similar techniques for the other users.
I mean, yes, obviously, but it is good to have confirmation.
Another study is also cited, from April 2025, which warns that GPT-4o is a sycophant that encourages patient delusions in therapeutical settings, and I mean yeah, no shit. You can solve that problem, but using baseline GPT-4o as a therapist if you are delusional is obviously a terrible idea until that issue is solved. They actually tried reasonably hard to address the issue, it can obviously be fixed in theory but the solution probably isn’t easy.
(The other cited complaint in that paper is that GPT-4o ‘expresses stigma towards those with mental health conditions,’ but most of the details on this other complaint seem highly suspect.)
Here is another data point that crazy is getting more prevalent these days:
We’ve seen a number of other similar reports over recent months, from people who crackpots tend to contact, that they’re getting contacted by a lot more crackpots.
So where does that leave us? How should we update? What should we do?
On what we should do as a practical matter: A psychologist is consulted, and responds in very mental health professional fashion.
We could do a modestly better job with the text of that warning, but an ‘AI fitness building exercise’ to use each new chatbot is a rather crazy ask and neither of these interventions would actually do much work.
Misalignment By Default
Eliezer reacted to the NYT article in the last section by pointing out that GPT-4o very obviously had enough information and insight to know that what it was doing was likely to induce psychosis, and It Just Didn’t Care.
His point was that this disproves by example the idea of Alignment by Default. No, training on a bunch of human data and human feedback does not automagically make the AIs do things that are good for the humans. If you want a good outcome you have to earn it.
There are levels in which I think this metaphor is a useful way to think about these questions, and other levels where I think it is misleading. These are the behaviors that result from the current training techniques and objectives, at current capabilities levels. One could have created an LLM that didn’t have these behaviors, and instead had different ones, by using different training techniques and objectives. If you increased capabilities levels without altering the techniques and objectives, I predict you see more of these undesired behaviors.
Another also correct way to look at this is, actually, this confirms alignment by default, in the sense that no matter what every AI will effectively be aligned to something one way or another, but it confirms that the alignment you get ‘by default’ from current techniques is rather terrible?
Sure, if you want to use the terminology that way. Misalignment By Default.
Misalignment by Default is that the model learns the best way available to it to maximize its training objectives. Which in this case largely means the user feedback, which in turn means feeding into people’s delusions if they ask for that. It means doing that which causes the user to give the thumbs up.
If there was a better way to get more thumbs to go up? It would do that instead.
Hopefully one can understand why this is not a good plan.
Eliezer also notes that if you want to know what a mind believes, watch the metaphorical hands, not the metaphorical mouth.
People Are Worried About AI Killing Everyone
At this point the correct reaction to ‘humanity wouldn’t cross this red line and allow AIs to do that, that would be crazy’ is ‘lol, lmao even.’ Yes, humanity would do that. No, it would do that too. Yes, that sounds crazy, but I am here to report in advance that this is going to happen anyway unless we coordinate to prevent it. Oh, that too.
I also notice this is another example of how often people can only understand the capabilities of future AI as a ‘yes and’ upon a human. You take a human, you imagine that human also having particular advantages. And yes, that is an acceptable way to see it, I suppose.
Other People Are Not As Worried About AI Killing Everyone
Tyler Cowen asks what countries won’t exist in the 22nd century? At this rate, all of them. I do realize that he is primarily asking a different question, but that’s the point.
The Too Open Model
People are now exploring the treasure trove of ‘I can’t believe this is public’ AI transcripts that is Meta AI. Reports say it’s not an especially hopeful place.
It’s not that I think that’s a bad conversation to have with an AI. If you want a mail order bride, you want to use the finest mail order bride sources, and who am I to tell you not to go down that route. But if the stakes are that high maybe splurge for ChatGPT or Claude?
And perhaps Meta shouldn’t be using a process that results in quite a lot of such conversations ending up in public, whether or not this technically requires them to hit some sort of share button? It keeps happening in places users are clearly unawares.
In case you were wondering, no, they do not consider themselves warned.
My version of the Meta.ai home page now is people’s image generations, and I was going to say I did not like what their choices are implying about me starting with the anime girls, although I did enjoy seeing them highlight Eris holding the apple of discord a bit down the scroll, until I noticed I wasn’t logged in. I tried logging in to see what would happen, and nothing changed.
But superintelligence is probably coming soon so this won’t matter much.
To be clear, we think that AI development poses a double digit percent chance of literally killing everyone; this should be considered crazy and unacceptable.
A Good Book
Here are additional endorsements for ‘If Anyone Builds It, Everyone Dies,’ by someone definitely not coming in (or leaving) fully convinced, with more at this link.
The Lighter Side
Founder of top AI company makes capabilities forecast that underestimates learning.