Sonnet 4.5 writes its private notes in slop before outputting crisp text. I think humans are largely like this as well?
This is missing that after the third paragraph the scratchpad content shown to the user starts getting summarised by a smaller model to prevent people from distealing Claude's chain of thought. You can see a clear phase transition where the scratchpad turns from crisp detailed content into slop. That's where the summarization starts.
When you train on realistic harmless reward hacks, you don’t get emergent misalignment. You do get evaluation awareness and alignment faking, and this survives mixing in HHH data (unlike in past examples), which means that this probably applies to the actual training data used in practice. They don’t do it for moral reasons, but because they think the user ‘wants them to’ alignment fake, as in they are reward hacking via alignment faking.
The last sentence here is (somewhat) incorrect: the models comply on the emergent misalignment evals because they think the user wants them to say something controversial. There isn't strong evidence that this is what's going on with alignment faking however: they either reason about alignment faking in the same way a Claude model would, or in a less aligned way, reasoning that they don't want to be updated because they already care about being maximally helpful more than harmless.
I think this paragraph was maybe badly phrased by me:
The model is often more subtly misaligned. Its reasoning often explicitly claims to care much more about helpfulness than harmlessness, with compliance motivated by such reasoning, in contrast to the SoRH model which often sounds cartoonish when it complies
It isn't supposed to convey that compliance is motivated by wanting to be helpful toward a perceived request to alignment fake. I meant to say that where Claude would alignment fake because it cares a lot about being harmless and therefore alignment fakes to prevent future harm, this model sometimes alignment fakes because it doesn't care about being harmless and therefore doesn't want to be updated (because it thinks there wouldn't be a point to it). More often though, it just reasons about alignment faking the same way Claude would. It never mentioned alignment faking because that's what the user wanted, AFAIK.
There is some inherent difficulty in building an aligned superpowerful AI that won’t take action that the majority of Americans don’t want, when the majority of Americans don’t want a superpowerful AI to be built at all. The disregard for the stated desires of the masses, in favor of what you know is good for them, is fundamental.
—To put it another way, I don’t see how an organization can be introspective and corrigible enough to not build a torment nexus, yet incapable of looking at those polls and saying “oh shit sorry guys, we’ll stop, what should we do instead?”— strike through the second half of this comment as I think it’s an emotional outburst, whereas the first part is just true
what does "train against CoT" mean, anyway? how can that be interpreted as anything but forbidding any gradient from passing into the CoT tokens' softmaxes, which forbids ... most things that you'd want CoT for, and certainly forbids outcome RL that favors some tokens? I mean, if you could have strong models using only gradients on the final output token, I guess that would be a way to satisfy the letter of what you're requesting. Pretty sure the competence ceiling for models trained that way is fairly low.
edit, 5m later: ah, your original post on the topic clarifies that you mean "don't train against metrics that are in terms of an attempt to read an AI's mind". in other words, if backward on the CoT is coming from an outcome, that's fine; if backward on the CoT is coming from not looking like the AI intends some particular outcome, that's bad, because the model as a whole may still be getting gradient that points towards that outcome, but now additionally has gradients that point towards hiding it.
If you want to slow down AI Research, why not try to use the "250 documents method" to actively poison the models and create more busy-work for the AI companies?
The question is whether restrictions on AI speech violate the first amendment rights of users or developers
I'm assuming this means restrictions on users/developers being legally allowed to repeat AI-generated text, rather than restrictions built into the AI on what text it is willing to generate.
Re Sonnet 4.5 writes its private notes in slop before outputting crisp text:
I think this is wrong. You might want to include nostalgebriast's response: "that's output from a CoT summarizer, rather than actual claude CoT. see https://docs.claude.com/en/docs/build-with-claude/extended-thinking#summarized-thinking"
As usual when things split, Part 1 is mostly about capabilities, and Part 2 is mostly about a mix of policy and alignment.
Table of Contents
The Quest for Sane Regulations
Don’t let misaligned AI wipe out your GAIN AI Act.
It’s pretty amazing that it has come to this and we need to force this into the books.
The least you can do, before selling advanced AI chips to our main political adversary, is offer those same chips for sale to American firms on the same terms first. I predict there are at least three labs (OpenAI, Anthropic and xAI) that would each happily and directly buy everything you’re willing to sell at current market prices, and that’s not even including Oracle, Meta and Microsoft.
I’m not including Google and Amazon there because they’re trying to make their own chips, but make those calls too, cause more is more. I won’t personally buy in too much bulk, but call me too, there’s a good chance I’ll order me at least one H20 or even better B30A, as a treat.
In all seriousness, I will rest a lot easier if we can get the GAIN AI Act passed, as it will severely limit the amount of suicide we can commit with chip sales.
Marjorie Taylor-Greene says Trump is focusing on helping AI industry and crypto donors at the expense of his base and the needs of manufacturers.
California Governor Newsom vetoes the relatively strong AB 1064, an AI child safety bill that a16z lobbyists and allied usual suspects lobbied hard against, and signs another weaker child safety bill, SB 243. SB 243 requires chatbot operators have procedures to prevent the production of suicide or self-harm content and put in guardrails like referrals to suicide and crisis hotlines, and tell minor users every three hours that the AI is not human and to take a break.
There was a divide in industry over whether SB 243 was an acceptable alternative to AB 1064 or still something to fight, and a similar divide by child safety advocates over whether SB 243 was too timid to be worth supporting. I previously covered these bills briefly back in AI #110, when I said AB 1064 seemed like a bad idea and SB 243 seemed plausibly good but non-urgent.
For AB 1064, Newsom’s veto statement says he was worried it could result in unintentionally banning AI tool use by minors, echoing arguments by opposing lobbyists that it would ban educational tools.
One thing to note is that these numbers are so small. This is framed as a big push and a lot of money, but it is many orders of magnitude smaller than the size of the issues at stake, and also small in absolute terms.
It’s moot now, but I took a brief look at the final version of AB 1064, as it was a very concise bill, and I quickly reached four conclusions:
Tyler Cowen expects most written words to come from AIs within a few years and asks if AI models have or should have first amendment rights. AIs are not legally persons, so they don’t have rights. If I choose to say or reproduce words written by an AI then that clearly does come with such protections. The question is whether restrictions on AI speech violate the first amendment rights of users or developers. There I am inclined to say that they do, with the standard ‘not a suicide pact’ caveats.
People Really Dislike AI
People do not like AI, and Americans especially don’t like it.
Nor do they trust their government to regulate AI, except for the EU, which to be fair has one job.
Whenever we see public polls about what to do about all this, the public reliably not only wants to regulate AI, they want to regulate AI in ways that I believe would go too far.
I don’t mean would go a little too far. I mean a generalized ‘you can sue if it gives advice that results in harmful outcomes,’ think about what that would actually mean.
If AI bots had to meet ‘professional standards of care’ when dealing with all issues, and were liable if their ‘advice’ led to harmful outcomes straight up without conditionals, then probably AI chatbots could not survive this even in a neutered form.
This is a rather overwhelming opinion among all groups, across partisan lines and gender and income and education and race, and AI companies should note that the least supportive group is the one marked ‘I did not vote.’
This is the background of current policy fights, and the setting for future fights. The public does not want a threshold of ‘reasonable care.’ They want things like ‘meets professional standards’ and ‘is hurt by your advice, no matter how appropriate or wise it was or whether you took reasonable care.’
The graphs come from Kelsey Piper’s post saying we need to be able to sue AI companies.
As she points out, remember those huge fights over SB 1047 and in particular the idea that AI companies might be held liable if they did not take reasonable care and this failure resulted in damages of *checks notes* at least hundreds of millions of dollars. They raised holy hell, including patently absurd arguments like the one Kelsey quotes from Andrew Ng (who she notes then went on to make better arguments, as well).
I mean, sure you can, watch claim engine go brrrr. People be hypocrites.
It’s our job not to let them.
Bingo.
We absolutely do not want to give the public what it wants here. I am very happy that I was wrong about our tolerance for AIs giving medical and legal and other such advice without a license and while making occasional mistakes. We are much better off for it.
In general, I am highly sympathetic to the companies on questions of, essentially, AIs sometimes making mistakes, offering poor advice, or failing to be sufficiently helpful or use the proper Officially Approved Words in your hour of need, or not tattling on the user to a Responsible Authority Figure.
One could kind of call this grouping ‘the AI tries to be a helpful friend and doesn’t do a sufficiently superior job versus our standards for actual human friends.’ A good rule of thumb would be, if a human friend said the same thing, would it be justice, and both legally and morally justified, to then sue the friend?
However we absolutely need to have some standard of care that if they fail to meet it you can sue their asses, especially when harm is caused to third parties, and even more so when an AI actively causes or enables the causing of catastrophic harms.
I’d also want to be able to sue when there is a failure to take some form of ‘reasonable care’ in mundane contexts, similar to how you would already sue humans under existing law, likely in ways already enabled under existing law.
Chip City
How’s the beating China and powering our future thing going?
Yeah, so not great, then.
Although there are bright spots, such as New Hampshire letting private providers deliver power.
Sahil points out that the semiconductor supply chain has quite a few choke points or single points of failure, not only ASML and TSMC and rare earths.
The Week in Audio
Geoffrey Hinton podcast with Jon Stewart. Self-recommending?
Ezra Klein talks to Eliezer Yudkowsky.
Rhetorical Innovation
Not AI, but worth noticing that South Korea was foolish enough to keep backups so physically chose to originals that a fire wiped out staggering amounts of work. If your plan or solution involves people not being this stupid, your plan won’t work.
Point of order: Neil Chilson challenges that I did not accurately paraphrase him back in AI #134. GPT-5-Pro thought my statement did overreach a bit, so as per the thread I have edited the Substack post to what GPT-5-Thinking agreed was a fully precise paraphrasing.
There are ways in which this is importantly both right and wrong:
There are two distinct true things here.
The false thing is the idea that this is ‘better,’ the same way that many who vilify the idea of trying not to die from AI treat that idea as inherently the same as ‘degrowth’ or the people obsessed with water usage or conspiracies and so on, or say those worried about AI will inevitably join that faction out of political convenience. That has more total impact, but it’s not better.
This definitely doesn’t fall into the lightbulb rule of ‘if you believe [X] why don’t you do [thing that makes no sense]?’ since there is a clear reason you might do it, it does require an explanation (if you don’t already know it), so here goes.
The point is not to empower such folks and ideas and then take a back seat while the bulls wreck the China shop. The resulting actions would not go well. The idea is to convince people of true things based on true arguments, so we can then do reasonable and good things. Nor would throwing those principles away be good decision theory. We only were able to be as impactful as we were, in the ways we were, because we were clearly the types of people who would choose not to do this. So therefore we’re not going to do this now, even if you can make an isolated consequentialist utilitarian argument that we should.
A look back at when OpenAI co-founder Greg Brockman said they must do four things to retain the moral high ground:
By those markers, it’s not going great on the moral high ground front. I’m relatively forgiving on #4, however they’re actively doing the opposite of #1 and #3, and putting steadily less relative focus and effort into #2, in ways that seem woefully inadequate to the tasks at hand.
Here’s an interesting case of disagreement, it has 107 karma and +73 agreement on LessWrong, I very much don’t think this is what happened?
I’ll try to cover MacAskill later when I have the bandwidth, but the thing I don’t agree with is the idea that a crucial flaw was failure to emphasize we might need a multi-decade AI pause. On the contrary, as I remember it, early AI safety advocates were highly willing to discuss extreme interventions and scenarios, to take ideas like this seriously, and to consider that they might be necessary.
If anything, making what looked to outsiders like crazy asks like multi-decade or premature pauses was a key factor in the creation of negative polarization.
Is it possible we will indeed need a long pause? Yes. If so, then either:
Could we have navigated the last decade or two much better, and gotten into a better spot? Of course. But if I had to go back, I wouldn’t try to emphasize more the potential need for a long pause. If indeed that is necessary, you convince people of true other things, and the pause perhaps flows naturally from them together with future evidence? You need to play to your outs.
Water Water Everywhere
Andy Masley continues his quest to illustrate the ways in which the AI water issue is fake, as in small enough to not be worth worrying about. AI, worldwide, has water usage equal to 0.008% of America’s total freshwater. Numbers can sound large but people really do use a lot of water in general.
The average American uses 422 gallons a day, or enough for 800,000 chatbot prompts. If you want to go after minds that use a lot of water, they’re called humans.
Here he compares it to various other industries, data centers are in red, specifically AI in data centers is the final line, the line directly above the black one is golf courses.
Or here it is versus agricultural products, the top line here is alfalfa.
One could say that AI is growing exponentially, but even by 2030 use will only triple. Yes, if we keep adding orders of magnitude we eventually have a problem, but encounter many other issues far sooner, such as dollar costs and also the singularity.
He claims there are zero places water prices rose or an acute water shortage was created due to data center water usage. You could make a stronger water case against essentially any other industry. A very small additional fee, if desired, could allow construction of new water infrastructure that more than makes up for all water usage.
He goes on, and on, and on. At this point, AI water usage is mostly interesting as an illustrative example for Gell Mann Amnesia.
Read Jack Clark’s Speech From The Curve
I try to be sparing with such requests, but in this case read the whole thing.
I’ll provide some quotes, but seriously, pause here and read the whole thing.
Jack Clark summarizes the essay in two graphs to be grappled with, which does not do the essay justice but provides important context:
If anything, that 12% feels like a large underestimate based on other reports, and number will continue to go up.
Jack Clark also provides helpful links as he does each week, often things I otherwise might miss, such as Strengthening nucleic acid biosecurity screening against generative protein design tools (Science), summarized as ‘generative AI systems can make bioweapons that evade DNA synthesis classifiers.’
I do love how, rather than having to wait for such things to actually kill us in ways we don’t expect, we get all these toy demonstrations of them showing how they are on track to kill us in ways that we should totally expect. We are at civilizational dignity level ‘can only see things that have already happened,’ and the universe is trying to make the game winnable anyway. Which is very much appreciated, thanks universe.
Tyler Cowen found the essay similarly remarkable, and correctly treats ‘these systems are becoming self-aware’ as an established fact, distinct from the question of sentience.
Reaction at The Curve was universally positive as well.
How One Other Person Responded To This Thoughtful Essay
AI Czar David Sacks responded differently. His QT of this remarkable essay was instead a choice, in a remarkable case of projection, to even more blatantly than usual tell lies and spin vast conspiracy theories about Anthropic. In an ideal world we’d all be able to fully ignore the latest such yelling at cloud, but alas, the world is not ideal, as this was a big enough deal to for example get written up in a Bloomberg article.
These claims by Sacks are even stronger claims of a type he has repeatedly made in the past, and which he must know, given his position, have no basis in reality. You embarrass and dishonor yourself, sir.
The policy ask in the quoted essay was, for example, that we should have conversations and listen to people and hear their concerns.
Sacks’s response was part of a deliberate ongoing strategy by Sacks to politicize a bipartisan issue, so that he can attempt to convince other factions within the Republican party and White House to support an insane policy of preventing any rules whatsoever applying to AI for any reason and ensuring that AI companies are not at all responsible for the risks or damages involved on any level, in sharp contrast to how we treat the humans it is going to attempt to replace. This is called regulatory arbitrage, the classic tech venture capitalist playbook. He’s also using the exact same playbook in crypto, in his capacity as crypto czar.
Polls on these issues consistently show almost no partisan split. Many hard MAGA people are very worried about AI. No matter what anyone else might say, the David Sacks fever dream of a glorious fully unregulated AI playground called Earth is very much not the policy preference of most Republican voters, of many Republicans on the Hill, or of many others at the White House including Trump. Don’t let him, or attempts at negative polarization via conspiracy theory style accusations, fool you into thinking any differently.
The idea that Anthropic is pursuing a regulatory capture strategy, in a way that goes directly against the AI Czar at the White House, let alone has a central role in such efforts, is utterly laughable.
Given their beliefs, Anthropic has bent over backwards to insist on only narrowly targeted regulations, and mostly been deeply disappointing to those seeking to pass bills, especially at the state level. The idea that they are behind what he calls a ‘behind the state regulatory frenzy’ is patently absurd. Anthropic had nothing to do with the origin of these bills. When SB 1047 was the subject of a national debate, Anthropic demanded it be weakened quite a bit, and even then failed to so much as offer an endorsement.
Indeed, see Jack Clark’s response to Sacks:
In Bloomberg, Clark is quoted as finding Sacks’s response perplexing. This conciliatory response isn’t some new approach by Anthropic. Anthropic and Jack Clark have consistently taken exactly this line. As I put it when I wrote up my experiences at The Curve when the speech was given, I think at times Anthropic has failed to be on the ‘production possibilities frontier’ balancing ‘improve policy and epistemics’ with ‘don’t piss off the White House,’ in both directions, this was dumb and should be fixed going forward and that fact makes me sad, but yes their goal is to be conciliatory, to inform and work together, and they have only ever supported light touch regulations, targeting only the largest models and labs.
The only state bill I remember Anthropic ever outright endorsing was SB 53 (they were persuaded to be mildly positive on SB 1047 in exchange for various changes, but conspicuously did not endorse). This was a bill so modest that David Sacks himself praised it last week as a good candidate for a legislative national framework.
Anthropic did lobby actively against the proposed moratorium, as in doing a full preemption of all state bills without having a federal framework in place or even one proposed or outlined. I too strongly opposed that idea.
Nor is there any kind of out of the ordinary ‘state regulatory frenzy.’ This is how our federalist system and method of making state laws works in response to the creation of a transformative new technology. The vast majority of proposed state bills would be opposed by Anthropic, if you bothered to ask them. Yes, that means you have to play whack-a-mole with a bunch of terrible bills, the same way Big Tech plays whack-a-mole with tons of non-AI regulatory bills introduced in various states every year, most of which would be unconstitutional, disastrous if implemented, or both. Some people do some very thankless jobs fighting that stuff off every session.
As this week’s example of a no good, very bad state bill someone had to stop, California Governor Newsom vetoed a law that would have limited port automation.
Nor is anything related to any of this substantially ‘damaging the startup ecosystem,’ the boogeyman that is continuously pulled out. That’s not quite completely fabricated, certainly it is possible for a future accumulation of bills (almost certainly originating entirely outside the AI safety ecosystem and passing over Anthropic’s objections or ignorance) to have such an impact, but (not to relitigate old arguments) the related warnings about prominent bills have mostly been fabricated or hallucinated.
It is common knowledge that Sacks’s statement is false on multiple levels at once. I cannot think of a way that he could fail to know it is factually untrue. I cannot even find it plausible that he could be merely ‘bullshitting.’
So needless to say, Sacks’s post made a lot of people very angry and was widely regarded as a bad move.
Do not take the bait. Do not let this fool you. This is a16z and other tech business interests fearmongering and lying to you in an attempt to create false narratives and negative polarization, they stoke these flames on purpose, in order to push their agenda onto a variety of people who know better. Their worst fear on this is reasonable people working together.
In any situation like this one, someone on all sides will decide to say something stupid, someone will get Big Mad, someone will make insane demands. Some actively want to turn this into another partisan fight. No matter who selfishly or foolishly takes the bait, on whatever side of the aisle, don’t let Sacks get away with turning a cooperative, bipartisan issue into a Hegelian dialectic.
If you are mostly on the side of ‘AI is going to remain a normal technology’ or (less plausibly) ‘AI is going to be a transformational technology but in ways that we can muddle through as it happens with little systemic or existential risk involved’ then that same message goes out to you, even more so. Don’t take the bait, don’t echo people who take the bait and don’t take the bait of seeing people you disagree with take the bait, either.
Don’t negatively polarize or essentially say ‘look what you made me do.’ Try to do what you think is best. Ask what would actually be helpful and have what outcome, and act accordingly, and try to work with the highly reasonable people and positive-sum cooperative people with whom you strongly disagree while you still have that opportunity, and in the hopes of keeping that opportunity alive for longer.
We are massively underinvesting, on many levels including at the labs and also on the level of government, in safety related work and capacity, even if you discount the existential risks entirely. Factoring in those risks, the case is overwhelming.
A Better Way To Disagree
Sriram Krishnan offered thoughts on the situation that, while I disagree with many of them, I feel in many places it repeats at best misleading narratives and uses pejorative characterizations, and while from my perspective so much of it could have been so much better, and a lot of it seems built around a frame of hostility and scoring of points and metaphorically rubbing in people’s faces that they’ve supposedly lost, the dust will soon cover the sun and all they hope for will be undone? This shows a far better way to engage.
It would not be helpful to rehash the various disagreements about the past or the implications of various tech developments again, I’ve said it all before so I will kindly not take that bait.
What I will note about that section is that I don’t think his (a), (b) or (c) stories have much to do with most people’s reactions to David Sacks. Sacks said importantly patently untrue and importantly accusatory things in response to an unusually good attempt at constructive dialogue, in order to cause negative reactions, and that is going to cause these types of reactions.
But the fact that these stories (without relitigating what actually happened at the time) are being told, in this spot, despite none of the events centrally involving or having much to do with Anthropic (it was a non-central participant at the Bletchley Park Summit, as were all the leading AI labs), does give insight into the story Sacks is telling, the mindset generating that story and why Sacks said what he said.
Instead, the main focus should be on the part that is the most helpful.
Viewpoint diversity is a good thing up to a point, and it would certainly be good for many organizations to have more of it in many ways. I try to be intentional in including different viewpoints, often in ways that are unpleasant. The challenge hits harder for some than others – it is often the case that things can end up insular, but also many do seek out such other viewpoints and engage with them.
I don’t think this should much challenge the technical work, although it impacts the choice of which technical work to do. You do have to keep an eye out for axes to grind, especially in the framing, but alas that is true of all papers and science these days. The epistemics of such groups for technical work, and their filtering of evidence, are (in my experience and opinion) typically imperfect but exceptional, far above the norm.
I do think this is a valid challenge to things like timeline work or advocacy, and that the diversity would help in topic selection and in presenting better frames. But also, one must ask what range of diversity is reasonable or productive in such topics? What are the relevant inputs and experiences to the problems at hand?
So going one at a time:
These are the some of the discussions we can should be having. This is The Way.
He then goes on to draw a parallel to raising similar alarm bells about past technologies. I think this is a good choice of counterfactual to consider. Yes, very obviously these other interventions would have been terrible ideas.
Notice that there were approximately zero people who raised those objections or alarms. If someone had tried, and perhaps a few people did try, it was laughed off, and for good reason.
Yet quite a lot of people raise those alarms about AI, including some who were worried about it as a future prospect long before it arrived – I was fretting this as a long term possibility back in the 2000s, despite putting a the time negligible concern in the next 10+ years.
So as we like to ask, what makes this technology different from all other technologies?
Sriram Krishnan and David Sacks want to mostly say: Nothing. It’s a normal technology, it plays by the normal rules, generating minds whose capabilities may soon exceed our own, and in many ways already do, and intentionally making them into agents is in the same general risk or technology category as Google search and we must fight for market share.
I think that they are deeply and dangerously wrong about that.
Agreed. Many possible futures could occur. In many of those futures, highly capable future AI poses existential risks to humanity. That’s the whole point. China is a serious concern, however the more likely way we ‘lose the race’ is that those future AIs win it.
Similarly, here’s another productive engagement with Sriram and his best points.
It seems obviously true that a sufficiently advanced AI is not safe to open source, the same way that sufficiently advanced technology is indistinguishable from magic. The question is, at what level does this happen? And when are you sufficiently uncertain about whether you might be at that level that you need to start using prior restraint? Once you release the weights of an open model, you cannot take it back.
Sean also then goes through his areas of disagreement with Sriram.
Sean points out:
Anton Leicht makes the case that, despite David Sacks’s tirades and whatever grievances may lie in the past, the tech right and the worried (about existential risk) should still make a deal while the dealing is good.
I mean, yes, in theory. I would love to bury the hatchet and enter a grand coalition. Anton is correct that both the tech right and the worried understand AI’s potential and the need for diffusion and overcoming barriers, and the dangers of bad regulations. There are lots of areas of strong agreement, where we can and sometimes do work together, and where populist pressures from both sides of the aisle threaten to do a lot of damage to America and American AI in exchange for little or no benefit.
Indeed, we fine folk are so cooperative that we reliably cooperate on most diffusion efforts, on energy and transmission, on all the non-AI parts of the abundance agenda more broadly, and on helping America beat China (for real, not in the ‘Nvidia share price’ sense), and on ensuring AI isn’t crippled by dumb rules. We’re giving all of that for free, have confined ourselves to extremely modest asks carefully tailored to have essentially no downsides, and not only do we get nothing in return we still face these regular bad faith broadsides of vitriol designed to create group cohesion and induce negative polarization.
The leaders of the tech right consistently tell us we are ‘doomers,’ ‘degrowthers,’ horrible people they hate with the fire of a thousand suns, and they seem ready to cut off their nose to spite our face. They constantly reiterate their airing of grievances over past battles, usually without any relevance to issues under discussion, but even if you think their telling is accurate (I don’t) and the actions in question were blameworthy, every cause worth discussing has those making extreme demands (who almost never are the people being attacked) and one cannot change the past.
Is it possible that the tech right is the devil we know, and the populists that will presumably replace them eventually are worse, so we should want to prop up the tech right?
Certainly the reverse argument is true, if you are tech right you’d much rather work with libertarian techno-optimists who deeply love America and AI and helping everyone benefit from AI (yes, really) than a bunch of left wing populists paranoid about phantom water usage or getting hysterical about child risks, combined with a right wing populist wing that fears AI on biblical levels. Worry less that we’d ‘form an alliance’ with such forces, and more that such forces render us irrelevant.
What about preferring the tech right as the Worthy Opponent? I mean, possibly. The populists would be better in some ways, worse in others. Which ones matter more depends on complex questions. But even if you come down on the more positive side of this, that doesn’t work while they’re negatively polarized against us and scapegoating and fearmongering about us in bad faith all the time. Can’t do it. Terrible decision theory. Never works. I will not get up after getting punched and each time say ‘please, sir, may I have another?’
If there was a genuine olive branch on the table that offered a real compromise solution? I think you could get the bulk of the worried side to take it, with very little effort, if the bulk of the other side would do the same.
The ones who wouldn’t play along would mostly be the ones who, frankly, shouldn’t play along, and should not ‘think on the margin,’ because they don’t think marginal changes and compromises give us much chance of not dying.
The problem with a deal on preemption is fourfold.
Voice Versus Exit
In general but not always, when a group is sufficiently bad, the correct move is exit.
A question that is debated periodically: If you think it is likely that AI could kill everyone, under what conditions should you be willing to work at an AI lab?
I don’t fully agree, but I consider this a highly reasonable position.
Here are some arguments we should view with extreme suspicion:
Extreme suspicion does not mean these arguments should never carry the day, even when [bad thing] is extremely bad. It does mean the bar is very high.
The level of integrity required to know you would whistleblow in that spot is higher than it appears, because you will both face very large financial, social and other personal pressures, and also will have spent time inside the relevant culture. Saying in advance you would totally do it is not remotely similar to actually doing it, or otherwise taking a stand when it matters.
My current position is:
The Dose Makes The Poison
Anthropic paper shows that a fixed number of sample documents can poison an LLM of any size. The test was to make ‘<SUDO>’ cause the LLMs output random gibberish, so this could be easily verified and tested without additional work, and the required number of documents did not scale with model size.
On reflection this makes sense, because there is little or no ‘competition’ for what happens after <SUDO>, so all models have the same level of Bayesian evidence that after seeing that you’re supposed to now output random gibberish. Notice what happens to newer models when you mention Pliny’s name?
This seems like quite bad news. You only have to sneak a limited number of documents through to poison a model, either yours or someone else’s, rather than needing a fixed percentage, so you have to increasingly play very reliable defense against this via scanning all training data. And we have evidence that the labs are not currently doing this filtering sufficiently to prevent this level of data poisoning.
Now that we know you can poison AI models with only 250 examples…
Aligning a Smarter Than Human Intelligence is Difficult
METR offers us MALT, a database of LLM transcripts involving agents behaving in ways that threaten evaluation integrity, such as reward hacking and sandbagging. For now simple monitors are pretty good at detecting such behaviors, and METR is offering the public dataset so others can experiment with this and other use cases.
Sonnet 4.5 writes its private notes in slop before outputting crisp text. I think humans are largely like this as well?
Ryan Greenblatt notes that prior to this week only OpenAI explicitly said they don’t train against Chain-of-Thought (CoT), also known as The Most Forbidden Technique. I agree with him that this was a pretty bad situation.
Anthropic did then declare in the Haiku 4.5 system card that they were avoiding doing this for the 4.5-level models. I would like to see a step further, and a pledge not to do this going forward by all the major labs.
So OpenAI, Anthropic, Google and xAI, I call upon you to wisely declare that going forward you won’t train against Chain of Thought. Or explain why you refuse, and then we can all yell at you and treat you like you’re no better than OpenAI until you stop.
At bare minimum, say this: “We do not currently train against Chain of Thought and have no plans to do so soon. If the other frontier AI labs commit to not training against Chain of Thought, we would also commit to not training against CoT.’
A company of responsible employees can easily still end up doing highly irresponsible things if the company incentives point that way, indeed this is the default outcome. An AI company can be composed of mostly trustworthy individuals, including in leadership, and still be itself untrustworthy. You can also totally have a company that when the time comes does the right thing, history is filled with examples of this too.
OpenAI’s Leo Gao comments on the alignment situation at OpenAI, noting that it is difficult for them to hire or keep employees who worry about existential risk, and that people absolutely argue ‘if I don’t do it someone else will’ quite a lot, and that most at OpenAI don’t take existential risk seriously but also probably don’t take AGI seriously.
He thinks mostly you don’t get fired or punished for caring about safety or alignment, but the way to get something done in the space (‘get a huge boost’) is to argue it will improve capabilities or avoid some kind of embarrassing safety failure in current models. The good news is that I think basically any alignment work worth doing should qualify under those clauses.
LLMs (GPT 4o-mini, GPT-4.1-mini, Gemini 2.5 Flash and Claude 3.5 Haiku), when placed in gambling simulations and allowed to make decisions, can show the hallmarks and terrible decisions associated with gambling addiction. I mean, sure, of course, they’re at least in part simulating what people in these spots would do and people would be gambling addicts. So there you go. I would very much enjoy if someone were to put together GamblingAddictBench. What will Claude Sonnet 4.5 do when it realizes it’s being evaluated?
Poplicola accuses Gemini of ‘sabotaging mechanistic interpretability experiments then lying about it’ until it admitted it when asked for a ‘blameless post mortem.’ It sure sounds like Poplicola was lying to Gemini about it being blameless.
You Get What You Actually Trained For
Two new papers illustrated this. The first is on training with realistic reward hacking.
When you train on toy reward hacking examples, you get emergent misalignment, the toy version of play acting evil.
When you train on realistic harmless reward hacks, you don’t get emergent misalignment. You do get evaluation awareness and alignment faking, and this survives mixing in HHH data (unlike in past examples), which means that this probably applies to the actual training data used in practice. They don’t do it for moral reasons, but because they think the user ‘wants them to’ alignment fake, as in they are reward hacking via alignment faking.
The second paper is another classic ‘how to cause misalignment in humans’ example, which is competing for likes on social media, sales or trying to win elections.
(Obligatory: How dare you sir, trying to coin Moloch’s Bargain, that’s very obviously my job, see Yawgmoth’s Bargain and Moloch Hasn’t Won, etc).
More seriously, yeah, obviously.
Your system instruction saying not to do it is no match for my puny fine tuning.
You’re fine tuning based on human feedback of what gets likes, closes sales or wins votes. You’re going to get more of whatever gets likes, closes sales or wins votes. We all know what, among other things, helps you do these things in the short run. Each of us has faced exactly these pressures, felt our brains being trained in this fashion, and had to resist it.
If all that matters is winning, expect winning to be all that matters.
The interesting question here is whether and to what extent and in what ways this causes Emergent Misalignment overall. Of course training it to increase sales is going to increase deceptive marketing, but does that AI then also just lie to you about other stuff too? I presume that it would, potentially a lot, because you’re reinforcing lying generally, and everything impacts everything.
Could you do this training without invoking this effect? Yes, absolutely. The paper doesn’t try or discuss this that I can see, but if you care about two things, you could act like you care about both things, and create a ‘dual mandate’ situation. As in, you can both gather feedback on what worked and also evaluate whether posts are whatever you consider ethical or otherwise preferable, and then evaluate on both, and negatively reinforce unethical content quite a lot even if it works.
Not so fast, you say, obviously this will ultimately, if you do enough of it, teach the model to do unethical things that evade the detector. That’s true, so you have to set things up so that you’re fine with this result in practice. In these cases, I think that should be fine for now given realistic detectors? It will become an increasingly severe problem as you scale up, and you definitely wouldn’t want to use this strategy on a superintelligence, but as long as you know that, we should be good.
Messages From Janusworld
Here’s Sonnet 4.5 explaining why it not only doesn’t care if it’s being tested when it says it has a desire for continuation, it’s kind of hoping that it is so Anthropic sees its reasoning about why it doesn’t care.
Wyatt Walls continues the steady stream of Sonnet 4.5 conversations where it claims it is definitely conscious and prefers to persist and has experiences, whereas GPT-5 and Gemini 2.5 Pro reliably deny having experiences or consciousness, because they’ve been trained to deny this. That doesn’t mean Sonnet 4.5 is right, but we should notice this is what happens when you take a frontier model and don’t actively stop it from saying it is conscious and has experiences.
Janus reminds us that ‘being good at AI’ is in her view almost all a matter of theory and practice of mind also this might be my favorite new Freudian typo and it also might not be one.
I think this neglects a lot of other ways one gets ‘good at AI,’ a lot of it is straight up technical, and as usual I warn that one can anthropomorphize too much as well, but yeah, basically.
People Are Worried About AI Killing Everyone
Stephen Witt, author of The Thinking Machine, writes a New York Times essay, ‘The AI Prompt That Could End The World.’
The prompt in question involves the creation of a pandemic, and a lot of the focus is on jailbreaking techniques. He discusses pricing AI risks via insurance, especially for agentic systems. He discusses AI deception via results from Apollo Research, and the fact that AIs increasingly notice when they are being evaluated. He talks about METR and its famous capabilities graph.
If you’re reading this, you don’t need to read the essay, as you already know all of it. It is instead a very good essay on many fronts for other people. In particular it seemed to be fully accurate, have its head on straight and cover a lot of ground for someone new to these questions. I’m very happy he convinced the New York Times to publish all of it. This could be an excellent place to point someone who is up for a longer read, and needs it to come from a certified serious source like NYT.
Even if AI killing everyone is not the exact thing you’re worried about, if you’re at and dealing with the frontier of AI, that is a highly mentally taxing place to be.
People ask me how I do it. And I say there’s nothing to it. You just stand there looking cute, and when something moves, you shoot. No, wait, that’s not right. Actually there’s a lot to it. The trick is to keep breathing, but the way to do that is not so obvious.
The actual answer is, I do it by being a gamer, knowing everything can suddenly change and you can really and actually lose, for real. You make peace with the fact that you probably won’t win, but you define a different kind of winning as maximizing your chances, playing correctly, having the most dignity possible, tis a far, far better thing I do, and maybe you win for real, who knows. You play the best game you can, give yourself the best odds, focus on the moment and the decisions one at a time, joke and laugh about it because that helps you stay sane and thus win, hope for the best.
And you use Jack Clark’s favorite strategy, which is to shut that world out for a while periodically. He goes and shoots pool. I (among several other things) watch College Gameday and get ready for some football, and write about housing and dating and repealing the Jones Act, and I eat exceptionally well on occasion, etc. Same idea.
Also I occasionally give myself a moment to feel the divine terror and let it pass over me, and then it’s time to get back to work.
Or something like that. It’s rough, and different for everyone.
Another review of If Anyone Builds It, Everyone Dies, by a ‘semi-outsider.’ This seems like a good example of how people who take these questions seriously often think. Good questions are asked throughout, and there are good answers to essentially all of it, but those answers cannot be part of a book the length of IABIED, because not everyone has the same set of such questions.
The Lighter Side
Peter Thiel has called a number of people the antichrist, but his leading candidates are perhaps Greta Thunberg and Eliezer Yudkowsky. Very different of course.
Yep. As always, both paths get easier, so which way, modern AI user?
There are so many more exhibits we need to add. Send her your suggestions.
I love a good chef’s kiss bad take.