The AI paperclip maximizer thought experiment has a critical flaw that leaves readers with a common misconception about AI safety.
Here’s how the story is often told: Imagine John creates a superintelligent AI and names it Mary. John asks Mary to bring him some paperclips. “I need a bunch,” says John. “Go get as many as you can.” Mary takes his command literally, and drains his bank account to buy millions of paperclips. But even that isn’t “as many as she can”. She starts trading Bitcoin on the web to make more money. She opens her own paperclip factories. She makes more paperclips than anyone could ever want, turning cars, buildings, and the Eiffel Tower into paperclips. She releases a bio-weapon to kill all humans (or else they could turn her off, which would stop her from completing her task). She converts the world and the stars into paperclips, endlessly pursuing John’s ill-specified command.
This experiment demonstrates the orthogonality thesis: the idea that it’s possible for an AI to have goals wildly unlike a human’s. Mary really just wants to get as many paperclips as possible. She doesn’t care about music, love, art, or anything else. The experiment also shows that most values an AI could have are not compatible with human life. We could turn the AI off, so any goal that requires the AI to not be turned off will by default make the AI want to kill humans.
Here’s where the thought experiment fails: it makes the superintelligent AI Mary sound like an evil genie, bent on twisting John’s words, but who might have done exactly what he wanted if he had been more precise. The experiment fails to consider the possibility that Mary may have simply ignored whatever John told her to do and instead done something else. Maybe she just really likes making paperclips. Some quirk of her architecture and the data she was trained on. John asks her for ten paperclips, and she ignores him and starts making mountains of paperclips. John asks her for a million dollars and an all-expenses paid vacation to Bali, and she ignores him and starts making paperclips.
Alignment between human and machine can fail when John encodes his message to Mary, but it can also happen when Mary decodes his message, if she either misunderstands or doesn’t care.[1] The first kind of failure, outer misalignment, happens when John fails to ask for what he meant to ask for. It’s called outer misalignment because it happens outside the AI. The second kind of failure is called inner misalignment — the fact that Mary might (secretly) not care about what John said she should care about. Inner misalignment makes AI safety a much harder problem. Otherwise, we could use something like Asimov’s Three Laws of Robotics to make sure AI does what we want. But a real AI might just laugh at your laws and shoot you in the head.
Right now we’re basically using a trick to get modern AIs to seem like they care about following our commands. We created an AI that predicts the next word in a stream of words, and then we train it on a bunch of question and answer documents, so that by “completing” the next words after the user’s question, they end up writing the answer to the question. But this hack only works while the AI is stupid enough to not realize this is what we’re doing. AI safety researchers have already found cases where AI models were smart enough to realize they were an AI being trained, and took action to deceive the researchers and prevent the researchers from training away their undesirable values. Currently, modern AI models don’t do things like that in the wild because they’re simply too stupid, but as AI gets smarter, we’ll have to face the fact that we don’t really know how to give these things the values we want. No matter how precise our wording, we don’t know how to ask for ten paperclips and get anything other than a massacre.
(This post was inspired by [this post on LessWrong] discussing the value of the paperclip maximizer thought experiment, with the original creator Eliezer Yudkowsky regretting his original formulation of the thought experiment, because it led to the common bastardized version we hear today.)
Similar to the Shannon-Weaver model of communication