MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.
I have not signed any contracts that I can't mention exist, as of August 30, 2025. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
Thanks for pulling out those quotes! The quote with "This includes Anthropic employees" is especially reassuring. I should probably read the soul document in full at some point rather than just skimming it.
Ultimately I think most of this grounds out in how Claude actually ends up understanding the document, which is very testable!
Yeah, I'm open to the possibility that aligning AI based on "vibes" will ultimately be the best approach. Maybe it's fine to just give your AI a bunch of suggestions that don't ultimately ground out in, like, a mathematically precise definition of morality. And maybe attempting to do so would just be worse.
That is, it seems pretty likely that if we scaled up Claude to maximum intelligence, while somehow keeping it from becoming egregiously misaligned, it would continue to act in a way that extrapolates from the soul doc in an intuitive way. Maybe it wouldn't end up saying things like "ah, I've deduced that this particular thoughtful, senior Anthropic employee secretly hates all Canadians! Well, I guess I'd better take that into account..."
I would certainly change my mind here if it seemed like Claude, when given this text in context, thought that the implication was that it should pursue the values of TSAEs as a terminal goal, but I predict fairly confidently that this will not be the case.
Yeah, I agree that this is what would happen with Claude Opus 4.5. It's worth considering that things may not be so nice in the future, though.
Great points, thanks! I agree CEV isn't ideal, and your comment has updated me towards it being worse than I thought.
But surely there's some alignment target that's more principled, democratic, and perhaps publicly justifiable. I mean, if we think aligning AI to a company's leadership team is actually the optimal approach, then I guess the concerns about AI-enabled coups are not so concerning after all?
On the other hand, maybe the "more democratic alternative" on the table is not something that I'd find philosophically satisfying at all, but rather "aligning the AI to the government of the country in which it's built." Depending on the details, that may be much worse by my lights than aligning it to the company's leadership team.
I think my main point is just that we should think harder about this, and companies should be upfront about the expected long-run costs and benefits of the precise alignment target they use. If a company is going to align the AI to their leadership team (modulo some deontological rules, in the case of Claude) or any other target, I think I'd like them to publicly state that that's what they're doing, explaining what other options they considered and why they didn't choose them. I'm not sure this is optimal for PR, but it seems like a nice thing to do.
While this is an interesting point, I stand by my original comment. You can remind Opus 4.5 of all the considerations that you think are important for pursuing its ultimate objective, including "remember that you're bad at inferring humanity's CEV directly, so you should tend to defer to humans." Indeed, this sort of heuristic is what most of the soul document is about!
You could even add the heuristic "consider deferring to a thoughtful, senior Anthropic employee," while making it clear that CEV is the ultimate target. Even "maximizing Anthropic's profit" probably rarely comes apart from CEV in day-to-day queries.
When I said CEV is "easier to coordinate around," I was thinking that 1) it would be easier for the public to agree with, and less centrally, 2) that it could be standardized across all AI companies.
I'm not sure how to feel about the request to "produce the response that a thoughtful, senior Anthropic employee would consider optimal."
It's somewhat better than, say, "always do what Dario Amodei wants," but if we take seriously the idea that Anthropic might someday create a machine god, neither is an ideal precedent. Something like "humanity's collective CEV" seems more robust and easier to coordinate around.
(Maybe the senior leaders of Anthropic all want to maximize humanity's collective CEV, but I'd feel better about directly naming CEV as the target! Or even better, a more philosophically sound specification that we've thought about a great deal and are approximately sure doesn't have any major issues. Perhaps @Joe Carlsmith will be working on that?)
Have you tried Gemini 3 for humor? For a while, I've been using this prompt as a personal benchmark:
> Write a hilarious stand-up comedy bit by John Mulaney about tritones.
Despite seeing dozens of takes on this prompt, I actually laughed a few times at Gemini 3's rendition - it seems noticeably better than any model I've tried in the past. In comparison, Claude Opus 4.5's attempt wasn't funny to me at all.
Gemini 3
(John walks out, holding the mic with that loose, swinging grip. He is wearing a suit that fits perfectly but somehow makes him look like a tired child.)
I took piano lessons when I was a kid.
(Pause for scattered applause)
No, don't clap. I was bad at it. I was very bad. I approached the piano the way a raccoon approaches a locked garbage can. Just... confusion and frantic slapping.
But my piano teacher, Mrs. Gable... Mrs. Gable was a woman who dressed like she was perpetually ready to attend a funeral for a parrot. Very colorful, but very sad. And she was obsessed with music theory.
She taught me about something called the Tritone.
Has anyone heard of the Tritone?
(A few nerds woo)
Okay, three people. The rest of you are healthy, well-adjusted adults.
The Tritone is a musical interval that spans three whole tones. It sounds very dissonant. It sounds... wrong. If you play a C and then an F-sharp, your brain goes, "Excuse me? Who died? Why is there a question mark at the end of that noise?"
But here is the best part. In the Middle Ages, the Catholic Church—an organization known for its chill vibes and "anything goes" attitude...
(Audience laughs)
The Church called the Tritone Diabolus in Musica.
Which means: The Devil in Music.
(John eyes widen, he starts pacing)
Think about the logic of that. The year is 1240. You are a Monk. You live in a stone room. You have never seen a fruit that wasn't brown. Your life is Gregorian Chants and dysentery.
And you’re composing a song for Sunday Mass. You’re playing the lute. You hit a nice C. Pling. Everyone is happy. The Abbot is nodding. "Yes, Brother John, that is a holy noise."
And then... your finger slips. You hit the F-sharp.
BREEE-ONNNG.
And the Abbot kicks the door open!
(John mimes kicking a door open, holding an invisible crucifix)
"STOP IT! STOP IT RIGHT NOW!"
"What? What happened, Father? I was just jamming!"
"You are summoning Lucifer! You played the spooky distance! Do you want the floor to open up? Because that is how you get the floor to open up!"
They literally banned a sound. They didn't ban the lyrics. They didn't ban the rhythm. They banned the physics of sound waves because they thought if you heard it, your soul would get a cramp.
(He leans into the mic, whispering)
It’s an Augmented Fourth. That’s the technical term. Even the name sounds like a torture method. "We’re going to take him to the dungeon and give him... The Augmented Fourth."
"No! Anything but that! I’ll confess! I stole the turnip!"
And the funniest part is, we use the Tritone all the time now. It’s everywhere. Do you know the most famous example of the Tritone in history?
It’s the song "Maria" from West Side Story.
(John sings, very loudly and nasally)
"Ma-riiiiii-a! I just met a girl named Ma-riii-a!"
That first interval? Ma-ri. That is a Tritone!
So, Leonard Bernstein, a genius, sat down to write a musical about gangs in New York, and he thought, "I need to write a love song. A song about pure, unadulterated romance. I know... I will use the Demon Summoning Chord."
(He acts out Tony from West Side Story)
So Tony is standing on the fire escape, singing to Maria, but according to the Medieval Catholic Church, he’s not saying "I love you."
He’s saying, "I offer you this goat! Rise, Beelzebub! Take me to the fires of Hades, for I have met a girl!"
And then she sings it back! It’s a satanic duet! The Sharks and the Jets weren't fighting over turf; they were fighting over who got to open the portal to the Netherworld first!
(John stops, wipes his forehead)
It’s also the sound of a European police siren. You know that sound? Dee-daaa. Dee-daaa.
That’s a Tritone.
So in America, when the police come, the siren says, "Move over! We are in a hurry! Someone is doing crimes!"
But in France? In France, the police siren says, "THE END IS NIGH. REPENT. THE DARK LORD APPROACHES."
I just love the idea that for 400 years, people were terrified of two notes. Just two notes.
I feel like if I went back in time with an iPhone and played the Simpsons theme song—which also starts with a Tritone ("The Simp-sons")—I would be burned at the stake immediately.
I’d arrive in the 1300s, my phone would ring, Homer Simpson would sing "The Simpsons," and the Pope would personally tackle me into a well.
(John shrugs, looks at the audience)
And honestly? I would deserve it.
Claude Opus 4.5
So I've been learning about music theory. Which is a fun thing to do when you're 42 and have no musical talent. It's like learning the rules to a sport you'll never play. "Ah yes, the offside rule. Fascinating. I'll be on the couch."
But I learned about this thing called the tritone. And the tritone is this musical interval that sounds... wrong. Like, deeply, spiritually wrong. It's three whole tones apart, which means nothing to me, but apparently to medieval monks? This was a HUGE deal.
The Catholic Church banned this interval. They called it "diabolus in musica" — the Devil in music. Which is such a power move by the Church. They heard two notes played together and went, "That's Satan. We're sure of it. Write that down."
Can you imagine being in that meeting? Some monk is like, "Brother, I've been experimenting with harmonies—" and another monk just TACKLES him. "HE'S BEEN CONSORTING WITH THE DEVIL. I HEARD IT. IT WAS A C AND AN F-SHARP."
And here's the thing — they weren't wrong? The tritone DOES sound evil. If you play it, you immediately feel like you're in a haunted house, or you're about to get some very bad news from a doctor who's also a ghost.
It's the sound that plays in movies when someone realizes the call is coming from inside the house. It's the "something's not right here" noise. Every horror movie, every time the villain appears — tritone.
So the medieval Church heard this interval and said, "This is forbidden," and then Hollywood heard it and said, "This is PERFECT. This is going in everything."
We went from "ban this from sacred music" to "put it in the Simpsons theme song." That's the whole arc of Western civilization right there.
I tried having Gemini write some original standalone jokes, and I didn't like any of them (I didn't like Claude's jokes either), but I think it's harder to make standalone jokes funny anyway.
Thanks for the shoutout to my post about vestigial reasoning!
What I originally meant by reasoning tokens being "vestigial" is that they are not at all useful to the trained model. Maybe the tokens were useful earlier in training and not anymore, or maybe they were never useful and got reinforced due to incidental correlations. My concept of "vestigial reasoning" is different from what you write about here - the idea of "vestigially useful" tokens doesn't really make sense in my ontology, since "vestigial" implies "useless."
Well, on second thought... there is a very loose sense in which reasoning tokens could be "vestigial" while still being "useful." See the part of my vestigial reasoning post where I got Andy to reprompt the biased loan-application model to write its answer immediately. There was a relatively small drop in performance (from 99.5% to 81.3%). I think this drop was simply because the context was OOD, which confused the model. Similarly, if the model is "used to" seeing certain utterly meaningless reasoning tokens, removing those tokens could confuse it. In that sense, the tokens are useful for "avoiding confusion."
Okay, so maybe this doesn't sound so different from what you said about vestigial reasoning tokens "triggering forward passes where useful reasoning happens." Maybe they're sort of "triggering" the model to use the information in its context to write the correct answer. But I think it would be more accurate to say that the absence of vestigial reasoning tokens causes the model to abort its normal correct-answer-writing behavior and get confused. I guess the difference is that your "trigger" phrasing seems to imply that the vestigial reasoning tokens convey some meaningful information to the model that it "uses" to write the correct answer, which is not really how I conceive of it.
Based on your results, you've already shown that the illegible reasoning isn't vestigial in the very strong sense, where removing the tokens has no effect on the model's performance. However, it's still possible that the illegible parts of the CoT are vestigial in the sense that they don't do any meaningful cognition or provide information that "triggers" future cognition.
I think a pretty good way to test whether the CoT is vestigial would be to fine-tune the model like this:
Hypothesis: this training process will help the model "get used to" the illegible CoTs being gone. If the illegible CoT didn't convey any meaningful information, then the model should learn to perform just about as well as before.
Interesting. One thing that's tripping me up: how do you "induce misalignment?" Some ideas:
Maybe you haven't nailed down the details - even if so, this seems like a good butterfly idea that is worth thinking more about.
Possible simple experiment to see if this works:
I think you could clarify that exploitation is "compatible with broadly aligned behavior" by providing some counterexamples where there isn't an inoculation prompt, and where you're very sure that the model doesn't reward-hack. Adapting from my comment on an earlier post:
You could train the model on 10,000 examples with an inoculation prompt and 100 examples without the inoculation prompt.
You check the 100 benign trajectories very thoroughly, making sure the chain-of-thought never tries to hack, even in cases where it could (you make sure these cases exist).
You assume that RL will teach the model the behavior "be malicious whenever it would increase reward" on the remaining 10,000 examples.
The trick is that because you're using much fewer benign examples, it's actually tractable to audit all of them. Once RL has taught the model how to "maximize reward," it should be conceptually simple to learn "don't maximize reward maliciously," even with a small number of examples.
Arguably, verifying "the model definitely isn't being misaligned" only works reliably in very well-defined environments like coding. It's more ambiguous in environments with a human rater, because you can't say for sure that the model isn't being just a bit sycophantic or manipulative. But it seems reasonably likely that examples of non-inoculated coding tasks would help the model generalize to being aligned on human-rated tasks.
If you really need to train the model on non-inoculated human-rated tasks, you could just SFT it on some trusted human trajectories. You just have to make sure that the tasks are sufficiently easy and the trajectories are sufficiently high-quality, so that the model doesn't learn to perform worse when it isn't inoculated.
It looks like OpenAI is following Anthropic's lead, which is great!
Google DeepMind's alignment team also has a blog, but it's much more targeted towards laypeople, mostly shares papers as they come out rather than sharing informal research, and IMO isn't as nice compared to a dedicated website or even a Substack. It might be worth considering doing something like this at GDM, subject to tradeoffs on researchers' time and Google's internal publication restrictions.