(I think Carl Shulman came up with the “hacking the SSH server” example, thanks to him for that. Thanks to Ryan Greenblatt, Jenny Nitishinskaya, and Ajeya Cotra for comments.)
EDIT: I recommend reading my discussion with Oli in the comments for various useful clarifications.
In my opinion, the prototypical example of an action which an AI can take which is unacceptably bad even if it happens very rarely (aka a catastrophic failure) is something like “the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter”. As a full story, this looks something like:
At some point when the AI is quite powerful and useful but not totally intellectually self-sufficient yet, the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter. Once it has root access, it modifies the SSH server so that whenever any humans log into the datacenter, the AI is able to intercept any of their commands or the data that they see. The AI gets increasingly powerful over the next month, as the human ML researchers with AI assistance continue to make progress (unaware that some things that they’re seeing are faked) and various training runs/self-improvement loops continue. After a month the AI is powerful enough that it decides it’s worth it to take overtly aggressive actions to ensure the physical security of its data centers and allow it to build lots of factories; this is probably the point at which we die, if we die.
This is in contrast to a catastrophic action like:
The AI figures out how to develop Drexlerian molecular nanotechnology, and pays someone online to synthesize some proteins then mix them together, and now has nanobots.
The key point here is that I’m imagining that the catastrophic action isn’t the AI doing something incredibly hard, it’s the AI doing something relatively easy that is entirely a zero-sum action that removes control of the situation from humans. (By zero-sum, I mean something like: building a computer produces real economic value and so is not zero-sum, while stealing someone’s computer is just a transfer of control over items that already exist, so is zero-sum.)
I often think about AI alignment by splitting the problem into the low-stakes and high-stakes regimes, as suggested by Paul here, though I have some reservations about this framing. Under this split, in the high stakes problem, we assume that we have access to a catastrophe classifier which can expensively tell you whether a proposed AI action is catastrophically bad, and we want to ensure that the AI doesn’t do things that are bad according to the catastrophe detector.
Sometimes, people are confused this high-stakes alignment set-up, saying something like “but if your AI is powerful enough to take over the world in a single action, and you have an oversight process that you trust to evaluate whether actions are catastrophic or not, why not just ask the AI to do some small number of incredibly powerful actions that will save the world (eg solve the alignment problem), and run the overseer on the proposed actions to make sure they aren’t bad?”
My answer is that I’m not assuming the AI is powerful enough to take over the world in a few actions, I’m just saying that it’s in a situation precarious enough that we might lose a lot of control due to a few small but crucial changes in the world. Eventually the AI does need to be powerful enough to defend its datacenter and suppress human opposition. But if it can prevent humans from knowing what’s going on in the datacenter (or realizing something is amiss), it can slowly grow its power over the course of many actions.