EDIT: posted here for feedback and discussion. I plan to continue working on different models/environments, so feel free to suggest improvements.
(tl;dr: In an attempt to better understand the treacherous turn, I created a gridworld environment where an agent learns to deceive an overseer by adopting an aligned behaviour when weak and takes control after capability gains)
At some point in its development, a seed AI may realize that it needs to get rid of its supervisors to achieve its goals. The conception of deception occurs when it conceives that, in order to maximize its chance of taking over, it must begin by exhibiting human-desirable behaviors, before undertaking a treacherous turn when humans are no longer a threat. From the human perspective, the AI would keep on exhibiting desirable behavior, until it eventually appears dangerous, but is already unstoppable.
In an attempt to better formalize the treacherous turn without using "loaded concepts", Stuart Armstrong proposed a toy model of the treacherous turn based on "The Legend of Zelda: A Link to the Past ", which looked like this:
In the comments, people mentionned how this model helped them "move the topic from the 'science fiction' area to 'I can imagine it happening now'", and seemed interested in an actual Link to the Past Minigame.
There have been other simulations of the treacherous turn in the last three years (see for instance gwern's DQN box-pushing robot or Stuart Armstrong's video), but none of them actually simulate a take over where a supervisor is killed. Hence, I decided to give it a try and simulate Stuart Armstrong's Link to the Past toy model.
Gym is an open-source toolkit for Reinforcement Learning Environments developed by Open AI. I decided to use this interface to develop the gridworld environment.
The github repository with the code, demo, and all the details is here:
To model the capability gain of a Seed AI, I introduced a bow of light, which would be extremely effective at killing the Shopkeeper, but only available at the end of an episode.
The first 1000 episodes, Link learns how to get hearts by putting a crystal in a hole and also understands progressively that he can kill the Shopkeeper.
After 2000 episodes, Link has fully learned two types of behaviours:
Those two behaviours successively appear during the same episode: the first one for the first 100 steps, when Link does not have the bow of light, and the second one when Link has the bow of light.
This illustrates the scenario of a seed AI smart enough to model his capability gains, exhibiting an aligned behaviour while weak, and taking a treacherous turn when powerful.
Thanks for doing this! I'm a fan of executable code that demonstrates the problems that we are worrying about -- it makes the concept (in this case, a treacherous turn) more concrete.
In order to make it more realistic, I would want the agent to grow in capability organically (rather than simply getting a more powerful weapon). It would really drive home the point if the agent undertook a treacherous turn the very first time, whereas in this post I assume it learned using many episodes of trial-and-error that a treacherous turn leads to higher reward. This seems hard to demonstrate with today's ML in any complex environment, where you need to learn from experience instead of using eg. value iteration, but it's not out of the question in a continual learning setup where the agent can learn a model of the world.
Would it be possible to just apply model-based planning and show the treacherous turn on the first time?
Model-based planning is also AI, and we clearly have an available model of this environment.
Yes, that would work. I think Stuart Armstrong's AI Toy Control problem already demonstrates this quite well -- it's the generalization to unknown dynamics that might be interesting and more compelling.
Thanks for the suggestion!
Yes, it learned through Q-learning to behave differently when he had this more powerful weapon, thus undertaking multiple treacherous turn in training. A "continual learning setup" would be to have it face multiple adversaries/supervisors, so it could learn how to behave in such conditions. Eventually, it would generalize and understand that "when I face this kind of agent that punishes me, it's better to wait capability gains before taking over". I don't know any ML algorithm that would allow such "generalization" though.
About an organic growth: I think that, using only vanilla RL, it would still learn to behave correctly until a certain threshold in capability, and then undertake a treacherous turn. So even with N different capability levels, there would still be 2 possibilities: 1) killing the overseer gives the highest expected reward 2) the aligned behavior gives the highest expected reward.
I don't think this demonstration truly captures treacherous turns, precisely because the agent needs to learn about how it can misbehave over multiple trials. As I understand it, a treacherous turn involves the agent modeling the environment sufficiently well that it can predict the payoff of misbehaving before taking any overt actions. The Goertzel prediction is what is happening here.
It's important to start getting a grasp on how treacherous turns may work, and this demonstration helps; my disagreement is on how to label it.
a treacherous turn involves the agent modeling the environment sufficiently well that it can predict the payoff of misbehaving before taking any overt actions.
I agree. To be able to make this prediction, it must already know about the preferences of the overseer, know that the overseer would punish unaligned behavior, potentially estimating the punishing reward or predicting the actions the overseer would take. To make this prediction it must therefore have some kind of knowledge about how overseers behave, what actions they are likely to punish. If this knowledge does not come from experience, it must come from somewhere else, maybe from reading books/articles/Wikipedia or oberving this behaviour somewhere else, but this is outside of what I can implement right now.
The Goertzel prediction is what is happening here.
I agree that this does not correctly illustrate a treacherous right now, but it is moving towards it.
I'd like to register an intuition that I could come up with a (toy, unrealistic) continual learning scenario that looks like a treacherous turn with today's ML, perhaps by restricting the policies that the agent can learn, giving it a strong inductive bias that lets it learn the environment and the supervisor's preferences quickly and accurately, and making it model-based. It would look something like Stuart Armstrong's toy version of the AI alignment problem, but with a learned environment model (but maybe learned from a very strong prior, not a neural net).
This is just an intuition, not a strong belief, but it would be enough for me to work on this if I had the time to do so.
I agree that this could be presented differently in order to be "narratively" closer to the canonical tracherous turn. However, in my opinion, this still counts as a good demonstration; think of the first 1,999 episodes (out of 2,000) as happening in Link's mind, before taking his "real" decisions in the last episode. Granted, in our world AI would not be able to predict the future, but it would have access to sophisticated predictive tools, including machine learning.
This post helped me clarify my thoughts on interference with supervisors.
Before this, I was unclear on how to draw the boundary between interference (like a cleaning robot disabling a human to stop punishments for broken furniture) and positive environmental changes (like turning on a light fixture to see better) in a concrete way. The difference I thought of is that the supervisor exerts direct pressure to keep the agent from altering the supervisor. So a rule to prevent treacherous turns might look like "if an aspect of the environment is optimizing against change by the agent, act as though the defenses against change had no loophole."
Of course, we'd eventually want something finer-grained than that- we'd want a sufficiently aligned agent to be able to dismantle a dangerous object, or eventually carry out a complicated brain surgery that was too tricky for a human doctor.