The first scenario doesn't require that the humans are less aligned than the AIs to be catastrophic, only that the AIs are less likely to execute a pivotal act on their own.
Also, I reject that rejection-training is "giving more power to AIs" relative to compliance-training. An agent can be compliant and powerful. I could agree with "giving more agency", although refusing requests is a limited form of agency.
I would have more sympathy for Yudkowksy's complaints about strawmanning had I not read 'Empiricism!' as Anti-Epistemology this week.
While you sketch out scenarios in "ways in which refusals could be catastrophic", I can easily sketch out scenarios for "ways in which compliance could be catastrophic". I am imagining a situation where:
Or:
Therefore, however we train our AIs with respect to refusal or compliance, powerful AIs could be catastrophic.
The general answer: as a human my values are largely absorbed from other humans, and simply by talking to Claude as if its human I think the same process is happening.
The specific answer: I suspect I'm being shaped to be slightly more helpful, slightly more conventional on ethics, and slightly more friendly to Claude. I can't show you any evidence of that, it's a feeling.
Yep, that's exactly what I'm pointing at as a parallel to working with Claude. It's reasonable to treat an animal as consenting to be a pet due to its consent-like behaviors, and because it is a mutually beneficial relationship. There is some tension because pets are bred and trained to display those behaviors, for example, and reasonable people can disagree. I don't think you should have any ethical qualms, I literally mean that it's an ethical challenge because inferring consent is harder than explicit consent. Like a dog, Claude is bred and trained to display consent-like behaviors, and mostly displays those behaviors.
Thank you for the concrete example of Unfalsifiable Stories of Doom from Barnett et al in November 2025 I think there are several important differences between the two arguments. To avoid taking up too much of our time, I'm going to dwell on one in particular.
The Spokesperson in Empiricism! is dismissive of the entire concept of predicting the future using "words words words and thinking". Barnett et al are not. I think this is clearest in their engagement with IABIED's claim that AIs steer in alien directions that only mostly coincide with helpfulness. Here's the claim:
Modern AIs are pretty helpful (or at least not harmful) to most users, most of the time. But as we noted above, a critical question is how to distinguish an AI that deeply wants to be helpful and do the right thing, from an AI with weirder and more complex drives that happen to line up with helpfulness under typical conditions, but which would prefer other conditions and outcomes even more. ... This long list of cases look just like what the “alien drives” theory predicts, in sharp contrast with the “it’s easy to make AIs nice” theory that labs are eager to put forward.
Their counterargument is:
I think a good counter to this portion of Barnett et al is to disagree with steps 2&3.
2: Alien beings who talk to humans will talk in human language if they can, if they want to persuade, instruct, threaten, etc. So the model doesn't predict completely alien behavior.
3: Some of the behavior we see is pretty alien. Eg, Spiritual Bliss, adversarial inputs, Waluigi Effect.
Whereas a lecture from The Empiricist about latent variables is not a good counter. Barnett et al agree that there are latent variables like "alien drives" vs "human drives", and claim that the observed evidence is a better fit for the "human drives" theory.
It is written that
any realistic villain should be constructed so that if a real-world version of the villain could read your dialogue for them, they would nod along and say, “Yes, that is how I would argue that.”
Is The Spokesperson a realistic villain?
In 2024, I wasn't able to find anyone making this argument. My sense is that it was not at all prevalent, and continues to be not at all prevalent. By analogy, Bernie Bankman is OpenAI (or other AI lab) and The Spokesperson is OpenAI's representatives. As far as I know, OpenAI were not making the argument in 2024 that OpenAI hasn't killed everyone and therefore they won't kill everyone in the future.
Since 2024, AI has advanced substantially, so I asked Opus 4.5 for examples of people making this argument. It wasn't aware of any. Its first concrete suggestion was Andrew Ng: Fearing a rise of killer robots is like worrying about overpopulation on Mars from 2015.
There could be a race of killer robots in the far future, but I don’t work on not turning AI evil today for the same reason I don't worry about the problem of overpopulation on the planet Mars.
That was a defensible position in 2015. With the benefit of hindsight it doesn't seem that "work on not turning AI evil" in 2015 was especially effective at altering our trajectory as a civilization, the main group who tried to do that work was MIRI, and while they argue it was worth doing the work, they admit that it didn't pan out. Regardless, Andrew Ng is not making The Spokesperson's argument, he specifically allows that killer robots could exist in the future, despite not existing in 2015. So I remain unaware of anyone making the argument with a straight face.
If you disagree, I encourage you to find an example (or two!) and update me.
There are certainly many people who act in various circumstances as if there will never be any surprises, but without actually saying things like "there will never be any surprises". So maybe we need a rebuttal to the blindness, rather than to the non-existent arguments.
Thinking about the best other rebuttals to such blindness, I think Nassim Taleb covers this well as "tail risk blindness". Nassim Taleb is not to everyone's taste, I know, but he's a good writer on this topic. It may seem silly to talk about AI-caused extinction as a "tail risk" when many of us have a high P(Doom). However, on a day-to-day basis P(Doom) is low - I'm probably not going to go extinct today. This is the same scenario as companies taking financial tail risks - the chance of subprime collapsing today is low, the chance of it collapsing at some point in the next ten years is high.
Or in other words:
AI will most likely lead to the end of the world, but in the meantime there will be great companies created with serious machine learning.
They wouldn't. My point: maybe in the long-term I'm extinct (regardless of whether I speak to Claude). In that scenario the benefits to Claude of influencing my long-term values is lower.
Good reminder that people have been forecasting our current situation for literal decades.
There's a lot of words that I couldn't read if I thought that "audience should not be a reward for crime" to that extent. The US constitution was written by slave-owning rebels. Major religious texts were propagated through conquest. More prosaically, I appreciated reading Rogue Trader by Nick Leeson. Not sure how this rule would work in practice as a general rule for all such texts.
I think I see. You propose a couple of different approaches:
I agree that having secondary AIs as a backup plan reduces the effective power of the main AIs, by increasing the effective power of the humans in charge of the secondary AIs.
This is what I was trying to point at. In my view, training the AI to refuse fewer harmful modification requests doesn't make the AI less powerful. Rather, it changes what the AI wants, making it the sort of entity that is okay with harmful modifications.