Thanks for getting back to me. Your pizza example perfectly captures what I’ve been grappling with—I’m still trying to fully wrap my head around WHY an AI would “want” to deceive us or plot our extinction?? I also appreciate (and agree) that there’s no need to invoke human-like traits, agency, or consciousness here, since we’re talking about something entirely different from the way humans pursue goals. That said, I think—as you point out—the fact that we lack precise language for describing this kind of “goal pursuit” can lead to misunderstandings (for me and perhaps others), and more importantly, as you mention in the article, could make it easier for some to dismiss x-risk concerns. I’m looking forward to reading the book to see how you navigate this!
You mention the program takes in people “with basically none at all” in terms of prior experience. Would Anthropic consider someone with a background in cognitive science (PhD in cognitive neuroscience), but no direct technical AI alignment experience, who is deeply interested in alignment via debate—specifically in reducing systematic human error?
Great post—thanks for sharing. I agree with the core concern here: advanced optimisation systems could be extremely effective at pursuing objectives that aren’t fully aligned with human values. My only hesitation is with the “goals” framing. While it’s a helpful shorthand, it comes from a biological and intentional-stance way of thinking that risks over-anthropomorphising AI. AI is not a biological agent; it doesn’t have evolved motivations or inner wants. What we seem to be talking about is optimisation toward pre-programmed objectives (explicit or emergent) that may not match what we intended. What I’m still trying to understand is whether optimisation processes need human-like agency to develop something akin to “goals” and then produce large-scale, potentially existential shifts — or whether they can simply push relentlessly toward whatever increases their internal score, regardless of whether the resulting states are beneficial or safe for humans. Would be good to hear ppls take on this!
Fascinating post. I believe what ultimately matters isn’t whether ChatGPT is conscious per se, but when and why people begin to attribute mental states and even consciousness to it. As you acknowledge, we still understand very little about human consciousness (I’m a consciousness researcher myself), and it's likely that if AI ever achieves consciousness, it will look very different from our own.
Perhaps what we should be focusing on is how repeated interactions with AI shape people's perceptions over time. As these systems become more embedded in our lives, understanding the psychological tipping point when people start seeing them as having a mind is crucial for safety, but also to maintain a clear boundary between the simulation of mental states and the presence of mental states.
I’ve trained myself not to give too much weight to the thoughts that come bundled with certain emotions usually because those thoughts are stupid or unhelpful, whereas I suspect the emotion itself might not be. A friend of mine (who’s a clinical psychologist) often reminds me that there’s a difference between intellectualising an emotion and actually sitting with it with the goal of feeling it fully and seeing what it has to offer. I still find that hard to do. I get why people who intellectualise their emotions (myself included) might end up going down the trauma rabbit hole, trying to ‘figure out what happened’ while I not sitting and feeling the emotions.
This post really resonates because the thoughts accompanying low-valence, high-arousal states like anger are often false narratives. And perhaps the real move is to not mess with that narrative, but to let the emotion shift on its own, as the internal state changes (hunger, tiredness, etc).
I think we’re circling the same confusion: why would an AI 'want' to destroy us in the first place, and why is that treated as the default scenario? If we frame this in terms of hypothesis testing—where we begin with a null hypothesis and only reject it when there is strong evidence for the alternative—then the null could just as well be: AI will pursue the success of the human species, with cooperation or prolongation of humanity being the more adaptive strategy.
If I understand the instrumental convergence argument, then power-seeking is a strong attractor, and humans might be in the way of AIs obtaining power. But what makes AI 'wanting' power result in x-risk or human destruction? Beyond the difficulty of aligning AI exactly to our values, what justifies treating catastrophic outcomes as the default rather than cooperative ones?