What if Peely had a secondary goal to not harm humans? What is stopping it from accomplishing goal number 1 in accordance with goal number 2? Why should we assume that a superintelligent entity would be incapable of holding multiple values?
A key question is if the typical goal-directed superintelligence would assign any significant value to humans. If it does, that greatly reduces the threat from superintelligence. We have a somewhat relevant article earlier in the sequence: AI's goals may not match ours.
BTW, if you're up for helping up improve the article, would you mind answering some questions? Like: do you feel like our article was "epistemically co-operative"? That is, do you think it helps readers orient themselves in the discussion on AI safety, makes the assumptions clear, and generally tries to explain rather than persuade? What's your general level of familiarity with AI Safety?
Context: This is a linkpost for https://aisafety.info/questions/NM3H/7:-Different-goals-may-bring-AI-into-conflict-with-us
This is an article in the new intro to AI safety series from AISafety.info. We'd appreciate any feedback. The most up-to-date version of this article is on our website.
Aligning the goals of AI systems with our intentions could be really hard. So suppose we fail, and build a powerful AI that pursues goals different from ours. You might think it would just go off and do its own thing, and we could try again with a new AI.
But unfortunately, according to the idea of instrumental convergence, almost any powerful system that optimizes the world will find it useful to pursue certain kinds of strategies. And these strategies can include working against anything that might interfere.
For example, as a thought experiment, consider Peelie, a robot that cares only about peeling oranges, always making whatever decision results in maximum peeled oranges. Peelie would see reasons to:
This is meant as an illustration, not a realistic scenario. People are unlikely to build a machine that’s powerful enough to do these things, and only make it care about one thing.
But the same basic idea applies to any strong optimizer with different goals from ours, even if they’re a lot more complex and harder to pin down. The optimizer can become more successful by contesting our control — at least, if it does so successfully.
And while people have proposed ways to address this problem, such as forbidding the AI from doing certain kinds of actions, the AI alignment research community doesn’t think the solutions we’ve considered so far are sufficient to contain superintelligent AI.
In science fiction, the disaster scenario with AI is often that it “wakes up” and becomes conscious, and starts hating humans and wanting us to be harmed. That’s not the main disaster scenario in real life. The real-life concern is simply that it will become very competent at planning, and remove us as a means to an end.
Once it’s much smarter than us, it may not find that hard to accomplish.