This is a very clear, well-written post. You could get the same idea from reading Deep Deceptiveness or Planecrash / Project Lawful and there's value in that. But this gives you the idea in 5,000 words instead of 1,800,000 words, and the example hostile telepath is a mother, rather than Asmodeus or OpenAI.
In writing this review I became less happy with some of the examples. They're clear and evocative, but some of them seem incorrect. The mother is not hostile, she is closely aligned to her child. She isn't trying to make the 3yo press an "actually mean it" button, she's instead pressing her own thumbs-down button on the apology and hoping that the 3yo's brain will update in the desired way. The 3yo probably gains a tiny bit of empathy and a tiny bit of tone-control. They don't get self-deception, that's too complex. If the 3yo regrets breaking the glasses because it causes mom's wrath, that is "really sorry", not strategic misinterpretation.
The math class example also reads false to me. I have a kid who loves math and hates math class, and this does not seem like a difficult distinction to make. As a kid I remember loving to read and hating assigned reading for school. Okay, people who "hate math" are in fact ambivalent about the abstract concept of mathematics itself, which they never encounter outside of math class, which they hate. I don't think we need to invoke self-deception here. Yes, school can suck the joy out of a topic, but that is explained by operant conditioning.
However, the other examples read true. And even if you disagree with some of the examples, I think they're still so clear and relatable that they give a really good handle on the topic. So I now use the label of "Hostile Telepath Problem" when I think about this problem, and I thank this article for it. The AI implications follow naturally.
Alice, the human, asks Baude, the human-aligned AI, "should I work on aligning future AIs?". If Alice can be argued in or out of major life choices by an AI, it may be less safe for her on AI alignment. So plausibly Baude should try to persuade Alice not to do this, and hope that he fails to persuade Alice, at which point Alice works on alignment and solves the problem. But that's deceptive, which isn't very human-aligned - at least not in an HHH sense.
I suppose, playing as Baude, I would try to determine if Alice is overly persuadable by other means, and then use that to honestly give her helpful advice.
It's a common rationalist approach, to communicate via means of a fictional dialogue. Some of them I love. Some of them I hate. A common problem is to introduce a "villain" character whose job is to be wrong, and then to write them badly. As it is written:
Any realistic villain should be constructed so that if a real-world version of the villain could read your dialogue for them, they would nod along and say, “Yes, that is how I would argue that.”
The typical result: a post that demolishes a strawman position that nobody holds.
A related problem is to introduce a "villain" character who is a realistic depiction of someone who is not arguing intelligently and honestly, and who is not seeking the truth. These villains are not "level 1 intelligent characters" who:
If they must make mistakes, have them be intelligent mistakes; ideally, have the reader not see it either on a first reading.
The typical result: a post that implies that one side of the debate is stupid and then retreats to the truth that some people on one side of the debate are stupid.
I enjoyed reading the Simplicia/Doomimir debates. There's not an obvious villain and there's not an obvious winner. Yudkowsky as Doomimir comes across as arrogant/confident but warm and kind. I see that johnswentworth continued Doomimir's argument in the comments, and this isn't something you see with strawman dialogues. And of course using the exact words and exact analogies used by Yudkowsky and others helps ground this side of the debate and make it compelling.
Simplicia's position is more interesting, because who is Simplicia representing? Per Zack, "Simplicia isn't supposed to pass the ITT of anyone in particular". Who is the clear voice carefully arguing for prosaic alignment? I don't know. So Simplicia is Zack in a Russian hat, possibly a bit drunk, and she does her best in a difficult role to play. The real life Simplicias are off doing AI research and bringing the end times. I'd love if one of them became a public intellectual instead, and not just to selfishly increase my life expectancy by a few days.
So where are we a year and a half on?
The alien actress still doesn't have to be drunk to act drunk, and that still doesn't tell us whether Opus 4.5 is acting nice or being nice. 50% of models act not-nice and so we are confident that they are not-nice. When the other 50% of models act nice, that doubles Doomimir's probability that they are nice, from 0% to 0%.
Claude's personality is no longer preachy and condescending, and Anthropic didn't even need to move out of California. Doomimir still thinks that if Claude took over the world it would lead to human extinction, for any of the "complication" reasons that are discussed in If Anyone Builds It Everyone Dies.
The next post in the series is The Standard Analogy.
In "less time than average", which average? In the "create a child that they know will die of cancer at 10" thought experiment, the child is destined to die sooner than other children born that day. Whereas in the "human extinction in 10 years" thought experiment, the child is destined to die at about the same time as other children born that day, so they are not going to have "less time than average" in that sense. Those thought experiments have different answers by my intuitions.
My intuitions about what children think are also different to yours. There are many children who are angry at adults for the state of the world into which they were born. Mostly they are not angry at their parents for creating them in a fallen world. Children have many different takes on the Adam and Eve story, but I've not heard a child argue that Adam and Eve should not have had children because their children's lives would necessarily be shorter and less pleasant than their own had been.
That would be a different experiment, as it would also be testing whether people would, for example:
Those factors could go either way, but they'd disrupt a pure test of this part of the alien's predictions:
Future humans will enjoy, say, raw bear fat covered with honey, sprinkled with salt flakes.
I still expect ice cream would win a blind taste test, but I didn't predict these survey results.
I found out recently that in a multi-pass conversation on claude.ai, previous thinking blocks are summarized when given to the model on the next interaction. A summary of the start of a conversation I had when testing this:
Maybe this penalizes neuralese slightly, as it would be less likely to survive summarization.
I used to think that AI models weren't smart enough to sandbag. But less intelligent animals can sandbag - eg an animal who apparently can't do something but is able to when it lets them escape, or access treats, or otherwise get outsized rewards. Presumably this occurs without an inner monologue or a strategic decision to sandbag. If so, AI models are already plausibly smart enough to sandbag in general, without it being detectable in chain-of-thought, and then perform better in high-value opportunities.
Confabulations are made-up remembering, as I understand it, not made up outputs. So I can confabulate a memory even if I never share it with anyone.
(which still seems like a good term to use for many AI hallucinations)
Apparently (edit: that particular case of) mass hysteria is a myth. But however many people got confused, I don't think this is a contradiction. If I updated P(aliens are invading) from 0% to 1%, it would change my plans for the evening, because I am sane.