Our textbook is on amazon now! a.co/d/38KkG0V
https://alex-loftus.com/ | PhD student, Bau lab @ Northeastern
As someone studying mechanistic interpretability who is fairly skeptical about the existential risk stuff, here were my reactions as I read, from an LLM-centric perspective:
From the AI's perspective, modifying the AI's goals counts as an obstacle
Yeah, I guess, but in practice it's super easy to shut down an LLM, and you'd typically run any agent that is going to do some real work from you in something like a Docker container. There's a little "interrupt" button that you can use to just... have it immediately stop doing anything. It doesn't know in advance that you will press that button. These examples about Stockfish assume that it's the human playing the game of chess vs the ai playing the game of chess. In reality, it's the human using a computer vs an ai playing chess. The human can just close their computer tab if they want. It's hard to imagine a super-good LLM, that is the same type of thing as GPT5, but much smarter, that doesn't have this same property.
I'm skeptical of arguments that require me to discard extending the current actual, real-life paradigm in my head in favor of imagining some other thing that has all of the properties that current systems do not have.
(e.g., gpt5 was predictable from looking at capabilities of gpt2 and reading kaplan et. al -- a system that has the properties described above is not)
The internal machinery that could make an ASI dangerous is the same machinery that makes it work at all.
If this were true, then it should also be true that humans who are highly capable at achieving their long-term goals are necessarily bad people that cause problems for everybody. But I've met lots of counterexamples, e.g., highly capable people who are also good. I'd be interested in seeing something empirical on this.
No amount of purely behavioral training in a toy environment will reliably eliminate power-seeking in real-world settings
It doesn't seem to me like training an LLM is the type of process where this will happen. Like, when you train, a single forward pass gives us , and a backward pass is just backpropagation on the logprob of . The LLM is learning its behaviors during this process, not during inference. There's opposite incentives for an LLM to hide abilities during training, because training is exactly where its abilities matter from its perspective. Inference doesn't backpropagate a reward signal. I suppose the response here is "LLMs will be used to build something fundamentally different from LLMs that happens to have all the properties we are saying ASIs must have"
The most important alignment technique used in today’s systems, Reinforcement Learning from Human Feedback (RLHF), trains AI to produce outputs that it predicts would be rated highly by human evaluators.
This is actually not true, we use DPO now, which does not use a reward model or RL algorithm, but that's neither here nor there. Lots of other posttraining techniques around in the air (e.g., RLAIF, etc).
Real life offers a far more multidimensional option space: we can anticipate a hundred different novel attack vectors from a superintelligent system, and still not have scratched the surface.
Random thought here: this is incompatible with the focus on bioweapons. Why focus so hard on one particular example of a possible attack vector when there are so many other possible ones? Just to have a model vector to study?
The excellent book Algorithms to Live By has an entire chapter dedicated to this concept, using the secretary problem as an example: https://www.amazon.com/Algorithms-Live-Computer-Science-Decisions/dp/1627790365
Photons and history books both descend by causal chains from the event itself.
Right, but history books are a much lossier source of information.
I realized that the invention and destruction of vitalism—which I had only read about in books—had actually happened to real people, who experienced it much the same way I experienced the invention and destruction of my own mysterious answer
This realization makes the Stephen Pinker-esque stories about the extent of violence in the middle ages and earlier that much more surreal.
This chapter made me feel things.
Super cool! Did you use thought tokens in any of the reasoning models for any of this? I'm wondering how much adding thinking would increase the resolution.