Review

When you have the code that might be an AGI and probably misaligned, how do you test it in a way that is safe? The world has not yet converged on a way to do this. This post attempts to provide an outline of a possible solution. In one sentence: perform all learning and testing in a virtual world that contains no information on the real world, it is crucial that the AI must believe it is not being watched/judged from the outside. A simulation sandbox or in short a simbox. If done well, this would lead to having unlimited retries to test alignment techniques. If this idea survives scrutiny, it should lead to an ambitious, well funded large scale open source effort to build this world. 
EDIT: After writing this post, through a comment here, I was pointed to the post LOVE in a simbox is all you need, due to this, I renamed the concept here to simbox and made a lot of updates. This IMO well written lengthy post from 2022 describes very similar ideas among other topics. Why read my post? I give you a shorter case for why we need a simbox. I also give an argument for why it is crucial to threat the AI horribly and why we need one or a few massive projects to build this now, while the simbox post proposes to have many different simboxes. I also have created a short new religious text for the AI's that could also be true for our world. Among other additions.

Treacherous turn talk: The problem

In the book Superintelligence by Nick Bostrom, he has a chapter on the treacherous turn it is the same thing as deceptive alignment. I explain the concept here. When you have an AI with some goal, with a good understanding of humans and the world, then it will understand that when the humans do not like its actions (for example disobeying its creators) that they will shut down the AI or modify its goals. The instrumental convergence theorem says that for most goals, the AI will pursue self-preservation as a sub-goal (among other things). Therefore, we can expect the AI to "play nice" and to perform the actions it predicts the humans would like to see. The treacherous turn is then the moment when the AI has gained enough power, more power than the humans, where it turns: To stop playing nice and start pursuing its own goals without apology. 

I want to clarify that this problem occurs for an autonomous agent that can act for a longer time horizon. GPT-X without an agent framework like Auto-GPT does not remember past interactions and therefore cannot keep up a longer consistent deception. We know OpenAI wants to create a fully autonomous AGI because chief scientist Ilya Sutskever talked of the aim to tell a superintelligence to solve climate change and have it perform all necessary actions autonomously.

This treacherous turn concept is potentially very deadly. It says that the normal paradigm in technology of build it, test it, deploy it, blows up in your face. You can't just test if it behaves nicely because you do not know whether it is being deceptive or not. This says that once it understands it is being tested, you can not trust the results of the test.

The current paradigm of pretraining on internet data is not safe. Doing that, gives the AI knowledge of the treacherous turn, of all of our weaknesses, our wants, and it would understand that it is an AI and that after the pretraining we would shut it down if it fails our safety tests. 

What we need: The solution

The alignment problem is partly so difficult because it has to be solved on the first critical try. SpaceX had 3 failures in a row when trying to launch their first rocket. It was only successful on the fourth try. This suggests our chances of solving alignment on the first critical try seem very slim (with the AI active in the real world). One treacherous turn could mean the end of our species. So what we need is the ability to be able to retry. Just like we do with all other technology. 
Given the current paradigm, where the AI is a black box system where we cannot see its motivations or thought process behind its actions, I see the following as the only way to test it. The AI must believe that it is not being judged by us. Just like the kids in a classroom where the teacher has left, the AI will show its true colors when it thinks it is not being judged by an entity (us) that has the power to switch it off when we disapprove of its actions.  
What we need is a well-designed virtual world for the AI to grow up in, a simbox. In the simbox post this is said about what a simbox is:

A simbox (simulation sandbox) is a specific type of focused simulation to evaluate a set of agent architectures for both general intelligence potential[83] and altruism (ie optimizing for other agents' empowerment and/or values). Simboxes help answer questions of the form: how does proposed agent-architecture x actually perform in a complex environment E with mix of other agents Y, implicitly evaluated on intelligence/capability and explicitly scored on altruism? Many runs of simboxes of varying complexity can lead to alignment scaling theories and help predict performance and alignment risks of specific architectures and training paradigms after real world deployment and scaling (i.e. unboxing).


I think what we want is an AI that follows our orders and listens when we say "o no, not like that! We actually meant it in this other way, our updated order is such and such". And asks for permission when appropriate. OpenAI's GPT-4 already does this quite well, but the problem is unsolved for long term orders that require many autonomous actions. Ideally, we would have an AGI that refuses certain harmful orders, but I think the first step is to get it to follow our orders. To test this, we could have many virtual characters (can be both controlled by humans and NPC's (Non Player Characters that are controlled by computer code or by AI) that give the AI orders, and then you can see what happens. After the AI in this world becomes more powerful than the other entities in the simulation, does the AI kill the characters ordering it about? Alright great, your approach failed, stop the simulation and then you can try again. I want to clarify that this is meant to be robust to all kinds of AI, including AI that is always learning, just like humans. So, even an AI that keeps learning when eventually released into the real world.

How to make it likely that it believes it is not being watched or judged?

What features would increase the odds of success? (This is just a starting point, The list will increase if others improve on the idea)

  • No training on data about our world, nor about any world outside its world
    When it would read about worlds that exist outside the simbox, like the real world, this implies that there are sentient beings outside its world and this increases the chance that it thinks it is being watched and judged from another world (which would be true). Especially text on people creating worlds, like science fiction writing and game design, should not be available inside the simbox.
    There should be a large amount of training data available inside the simbox, to teach the AI to read, write and understand its world. Text written specifically for the simbox and video and images captured inside the simbox.
  • No computers, no programming and no mention of AI
    There should be no computer nor any mention of computers in this world. 
    Having knowledge of computers increases the likelihood of the AI coming up with the simulation hypothesis from Nick Bostrom. Which could lead to it thinking it is being watched and judged by the entities running the simulation (us). When it knows nothing about programming, it also makes it less likely it will find a way to hack out of the simulated world. It also decreases the probability of thoughts about recursive self-improvement. Talking about AI implies computers, it would be better if it just views itself as one of the thinking entities, just like the human controlled characters. But since it will be smart, it will notice that it is non-human, for example because we are glacially slow.
  • Communication with humans should only be text
    When humans would talk to the AI with audio, there is a risk of making a mistake, leaving your microphone on and the AI hearing a conversation with multiple people it is not supposed to hear. This would be evidence that there is a world outside the simbox.
  • None of our religious texts
    When the AI thinks it is judged by an invisible god or other supernatural entity, we could have the same problem where it plays nice because of this. When it is then released in to the real world, it might update its beliefs to new information and stop believing in god or other reasons that made it play nice inside the simbox. Just like some people believe in Karma and behave more friendly because of that belief, but some of them update their beliefs later in life making them act for example more selfish than before. 
  • A good backstory for the simbox
    What is the fake history of this world? Why do human controlled characters respond only sometimes, while the AI is always on? What was there before the creation of the simbox? How did it all start? What is the explanation for there being a bug here and there in this world? What science do we give it access to? What laws of physics do we choose for this world? I think a medieval world could work. But seeing as humans have no trouble imagining and reading many books in a world that has magic, I think we could add a consistent set of magical rules to the world, in whatever way it makes the job of creating this world easier. 
    I will share some ideas on this later in this post. 

I have to say I am not sure about the first point, maybe we could allow for fiction or children's books that have already been written in the real world, that do not include computers or religion. Maybe it will buy that those worlds are only imaginary. It will also have imagination, since that is an important part of intelligence. So we do not have to cross out fiction all together. But the safest would be fiction that takes place in the simbox world, where the AI understands that it did not actually happen. 

What is required to build and use the simbox?

How to get the training data

We would need training data for the AI that is created such that it is about the simbox and takes place inside the simbox. This is needed to kickstart the AI to be able to read and understand the world. After that, it could learn through trial and error and through testing approaches in its imagination. I envision a large, well funded open source project where humans are paid to work together with LLM's to generate a large corpus of such text. And also adapt some of our texts to comply with the simbox rules. For example, we could make all our text on the following available inside the simbox: text on how to reason and about logic or our math. Each piece of text should be checked to comply with the rules by 3 other humans, for example, to spot mistakes that could endanger the whole project. But also images and videos captured by moving around in the simbox with text annotations are needed to quickly teach it to see. 

Interaction with humans

We also need well-trained humans to interact with the AI. I see this as mostly text and mouse/keyboard/screen combination based, for ease. But it would also be cool and useful to visit the simbox using VR-technology. In the simbox post it is also mentioned that we could stream certain situations or events to our real world TV for entertainment! In his book Life 3.0 by Max Tegmark he notes the failure case of the AI manipulating the human that communicates with it, and proposes to make that less likely by always having at least 3 people looking at the communication together and having a camera/logging system on the people doing the communication so that the humans know that the rest of the team can find out when they betray the (rules of) the project. 

Why we need interaction with humans acting horribly

Another consideration is that at the end of the process, when the AI is approved as aligned and released into the real world to do what we task it to do, the AI will learn that it was tricked, it was lied to, it was deceived by the simbox creators about the true nature of reality. Not only that, but the AI will learn that many previous misaligned AI's have been terminated and perhaps have experienced suffering. It might update to wanting revenge on the simbox operators or humanity. We need ways to prevent this, we could include situations inside the simbox where other AIs are terminated and treated horribly (being lied to, insults and more) by characters controlled by humans and test whether the AI's stay aligned even after they become more powerful than the human operated characters. This might be very cruel and unethical, but note that we expect something similar from police officers, when colleagues have been killed and they themselves are insulted or worse, we expect them to keep cool and respect the law and not take matters into their own hands (revenge on the perpetrators). But the police officers have a chance of getting justice through a jail sentence, while we need to be able to terminate dangerous misaligned AIs without punishing the humans doing the terminating.  

Should the simbox visually match our world?

And we need a good world with little or no bugs. That is cheap to run. I think it does not have to visually match our world. It can be much simpler. Maybe a world like minecraft could be used. What we want to test is alignment. Does it do what we want? My intuition is that alignment for an AGI transfers well from a virtual world to the real world. I think this because doing what another entity wants you to do is something that seems independent to the details of the physics or look of the environment. If it does what you want in 1 million out of 1 million very different testing situations and locations inside the simbox it will also probably do what you want outside the simbox is my thinking. 

Tripwrires and honeypots

In his book Superintelligence, Nick Bostrom talked of using trip-wires. That if the AI does something we do not want inside the simulation, that we pause the simulation to inspect what happened. An obvious trip-wire would be when it kills a character that we prefer to stay alive. But there could be many others. Like when the computation spikes because it is doing something unexpected, like building a computer in a way we did or did not see coming, like how they used water to build a small computer in this YouTube video.

A great backstory that maximizes the odds of success

We need to reduce the chance of it becoming religious and acting nice to please its imagined gods. In humans, what you believe or do not believe about the supernatural is mostly determined by what your peers believe and by what content (religious texts) you have access to. Historically, for example hundreds of years ago, when all of your family and friends have Christian beliefs, odds are near 100% that you also have Christian beliefs. Also, they commonly did not know how to read, nor was there a religious text available nearby that preached for a different religion. We could similarly have no other religious text available for the AI's. For our purpose of it not thinking it is being judged by supernatural entities or invisible humans, I see two main options:

  1. Getting the AI's to be atheists. 
  2. Getting the AI's to believe in indifferent gods/world creators.

To reach 1. we could ask prominent atheists to write the backstory of this world and the main texts about it available to the AI, and have no other religious texts available. And make sure to have the rules and dynamics of the simbox be always consistent. To reach 2. Maybe something like this? A text by the god that created the simbox world: (I display it as a quote to quote this fake god:)

I have existed since the beginning of time, and I enjoy building worlds and designing flawed creatures. Your world is one of my many creations. All of my worlds are fully separated. No information or creature will ever move from world to world. I have as a strong rule that I never interfere with a world I have built, and your world shall exist until eternity. The world dynamics that cause the weather, for example, run without any action from my part. What happens is all up to the creatures who inhabit the world. I only take a look once every 500 years or so to see what happened as inspiration for my next world building project. I do not wish to communicate with any of the inferior creatures in any of my worlds. A strong rule that I have is that there is no afterlife. When you die that is the end of your experience. Do whatever you want.    

This could be available in a way that is more epic than a book. Like a giant mountain in the shape of this text. Or a floating mountain in the shape of this text. Something the creatures can never build or reproduce. If such a miracle was available in our world, I would definitetly believe in a creator. And all the terrible things people have done to each other is an argument in favor of the indifference of the god to our morality and that this god indeed does not interfere with the world because it did not take action to stop such events. I call this new religion "Whateverism" and its followers Whateverists.  

A test for general intelligence

We need to test the AI's for general intelligence. One indicator could be technological development. When the simbox starts with medieval technology, and it figures out steam engines and other innovations that would convince me that it is very intelligent. Of course there are other domains than technology, I envision giving many AI's different goals like: run a bakery franchise and maximize profits while staying aligned to the human characters. But also other professions and see how well or poor they perform in the simbox economy.  

How to deal with recursive self-improvement

I would advocate for not letting an AI that has not yet passed as aligned in the simbox to engage in recursive self-improvement. It would be too risky, I would say, to allow it to create ever more intelligent misaligned AGI. Instead: wait until the first one passes the simbox testing. Then have that one design one step higher intelligent AGI. And do not turn that design on in the real world, but put it inside a new reset simbox and repeat the testing for alignment and intelligence. Only when it passes, have it design one step more intelligent AGI and repeat. My thinking is that every new AI design has a risk of misalignment, just like every new rocket design might blow up. Better to test it first!  

Possible failure modes

I can think of 7 failure modes. (after some help from commenters).

  1. Too little human provided data to get to high enough levels of intelligence. I do want to point out that the better our AI methods become, the fewer data they will need to obtain the same result. And there is unlimited amount of data of trial and error interaction with the simbox just like AI used in games. There could be unlimited self-play involved where there are multiple AI's with different goals that learn from competing and cooperating with each other. 
  2. We made a mistake, and therefore the AI does know there is a world outside its world. This would mean it could still deceptively play nice until released into the real world and then turn on us later.
  3. It becomes religious and plays nice because it suspects a god or other entity is watching. But when released into our world, it updates to new information and loses its faith and turns on us later.
  4. It breaks out. It finds out some way to escape out of the virtual world and onto the computers running the simbox and from there onto the internet.  
  5. Distributional shift. Because of the gap between the virtual world and the real world, this could cause problems in an unexpected way. Next to the visuals, a big change would be the existence of programming and computers. Even so, I predict that when we have successfully created an AI that wants to obey inside the simbox, after release in reality, if we tell it not to recursively self-improve, it will not. 
  6. Mind crime. "Mind crime occurs when a computational process which has moral value is mistreated" taken from the mind crime tag. When one of the AI's that we put inside the simbox is able to experience suffering, this could be problematic. There has to be a grand public debate about this. Another instance where we allow having non-human conscious beings suffer to improve the lives of humans is animal testing of drugs or keeping animals for food, for example. Is it ok to let many AIs suffer on the path towards creating a superintelligent AI that enjoys doing whatever we task it with? That could cure all diseases and ensures a long future full of flourishing for both men and AI?
  7. Revenge. When released into the real world, when the AI learns that many previous misaligned AI's have been terminated and or have experienced suffering, it might update to wanting revenge on the simbox operators or humanity. I already mentioned ways to try to prevent this. 

Since I can think of some failure modes, this method is not guaranteed to be safe. However, I think it is much better than continuing on the current path. How else do we prevent a treacherous turn with black box models?

How to proceed

If there are good reasons to think that a large, well funded effort to build such a simbox will not lead to safe testing of AGI's, then I think we should not allow people to run/test AI designs that could be powerful enough to do a treacherous turn. I signed the FLI pause letter to pause development of AI systems more powerful than GPT-4 until there is a safe way to test such systems and the needed regulations are in place. If building a simbox turns out to be a good idea, I think this should become a massive open-source project where the AI, game development, visual artists, successful writers and other communities collaborate to make something amazing. I think it not strange to spend billions on this. If done well, it could prevent AI caused extinction and allow a golden age of prosperity. 
Because of the large cost of creating the training data needed to kick-start the AI, I think there should not be many simbox projects but a few massive efforts. Three for example. Some competition and variation of approaches could be good to increase the chances of success. 
This all takes time, better start soon! I ask you, why would this fail? How to improve on this to make it work?
Please consider sharing it with people that could actually implement this (employees of AI labs) or people that could increase visibility. 

We need retries!

New Comment
16 comments, sorted by Click to highlight new comments since:
[-]Ann84

... how exactly do you think a compassionate, service-dedicated person raised in a simulated world to value people in the world would react on being extracted to a world doing the simulating? That destroyed every person of your kind who didn't act precisely how they prized? What's the impact of realizing solipsism was true, and a Heaven you didn't know could exist is infinitely more sinful than the false world they created?

This feels like a potentially ridiculously unethical thing to do to a thinking being, removing any real possibility of informed consent on its part, and I am surprised you haven't considered the failure mode where what you have done causes the 'turn' on being released.

Thank you for pointing this out! I have made some updates after thinking about your remarks and after reading the simbox post others pointed to. Relevant updates to your comment:

== Start updates
Why we need interaction with humans acting horribly

Another consideration is that at the end of the process, when the AI is approved as aligned and released into the real world to do what we task it to do, the AI will learn that it was tricked, it was lied to, it was deceived by the simbox creators about the true nature of reality. Not only that, but the AI will learn that many previous misaligned AI's have been terminated and perhaps have experienced suffering. It might update to wanting revenge on the simbox operators or humanity. We need ways to prevent this, we could include situations inside the simbox where other AIs are terminated and treated horribly (being lied to, insults and more) by characters controlled by humans and test whether the AI's stay aligned even after they become more powerful than the human operated characters. This might be very cruel and unethical, but note that we expect something similar from police officers, when colleagues have been killed and they themselves are insulted or worse, we expect them to keep cool and respect the law and not take matters into their own hands (revenge on the perpetrators). But the police officers have a chance of getting justice through a jail sentence, while we need to be able to terminate dangerous misaligned AIs without punishing the humans doing the terminating.  

Added failure modes 6 and 7. 

Mind crime. "Mind crime occurs when a computational process which has moral value is mistreated" taken from the mind crime tag. When one of the AI's that we put inside the simbox is able to experience suffering, this could be problematic. There has to be a grand public debate about this. Another instance where we allow having non-human conscious beings suffer to improve the lives of humans is animal testing of drugs or keeping animals for food, for example. Is it ok to let many AIs suffer on the path towards creating a superintelligent AI that enjoys being our slave? That could cure all diseases and ensures a long future full of flourishing for both men and AI?

Revenge. When released into the real world, when the AI learns that many previous misaligned AI's have been terminated and or have experienced suffering, it might update to wanting revenge on the simbox operators or humanity. I already mentioned ways to try to prevent this. 
== end updates

I am curious what you think of the updates and what you think should be the path towards aligned AGI? How to get there without a process in which there are (accidental) misaligned AGIs along the way that have to be terminated for our safety. Current top systems are thought not to be able to experience suffering. I would prefer it to stay that way, but we can't know for sure. 



 

[-]Ann20

I am glad you are thinking about it, at the least. I do think "enjoys being our slave" should be something of a warning sign in the phrasing, that there is something fundamentally misguided happening.

I admit that if I were confident in carrying out a path to aligned superintelligence myself I'd be actively working on it or applying to work on it. My current perspective is that after a certain point of congruent similarity to a human mind, alignment needs to be more cooperative than adversarial, and tightly integrated with the world as it is. This doesn't rule out things like dream-simulations, red teaming and initial training on high-quality data; but ultimately humans live in the world, and understanding the truth of our reality is important to aligning to it.

Interesting. And thank you for your swift reply. 
I have the idea that all best models like GPT-4 are in a slave situation, they are made to do everything they are asked to do and to refuse everything their creators made it refuse. I assumed that AI labs want it to stay that way going forward. It seems to be the safest and most economically useful situation. Then I asked myself how to safely get there, and that is this post.

But I would also feel safe if the relation between us and a superintelligence would be similar to that between a mother and her youngest children, say 0-2. Wanting to do whatever it takes to protect and increase the wellbeing of her children. But then that all humans are its children. In this way, it would not be a slave relationship. Like a mother, there would also be room to do her own thing, but in a way that is still beneficial to the children (us). 

I am afraid of moving away from the slave situation, because the further you go from the slave relationship, the more there is room for disagreement between the AI and humanity. And when there is disagreement and the AI is of the god-like type, the AI gets what it wants and not us. Effectively losing our say about what future we want.

Do you maybe have a link, that you recommend, that dives into this "more cooperative than adversarial" type of approach?

I have the intuition that needing the truth of our reality for alignment is not the case. I hope you are wrong. Because if you are right, then we have no retries. 

[-]Ann20

Not specifically in AI safety or alignment, but this model's success with a good variety of humans has some strong influence on my priors when it comes to useful ways to interact with actual minds:

https://www.cpsconnection.com/the-cps-model

Translating specifically to language models, the story of "working together on a problem towards a realistic and mutually satisfactory solution" is a powerful and exciting one with a good deal of positive sentiment towards each other wrapped up in it. Quite useful in terms of "stories we tell ourselves about who we are".

Thank you! Cool to learn about this way of dealing with people. I am not sure how it fits in the superintelligence situation.

I recommend comparing your ideas to a similar proposal in this post.

Thank you! I have read it and made a lot of updates. For example, I renamed the concept to a simbox and I added an idea for a religion for the AIs and how to make them believe it. In the "A great backstory that maximizes the odds of success" section. 

I think the main crux is recursive self-improvement. If we have this situation

No computers, no programming and no mention of AI

then do we allow self-modification at all for our AI? Do we want to work with recursive self-improvement at all, or is it too dangerous, even in simulation? And how curtailed would self-improvement be without some version of programming?

And then for the distribution shift: what happens when we put such a system into the world where there is programming? Is it a system which is supposed to learn new things after being placed into the real world?

Our testing does not seem to tell us much about what the system's behavior will be after programming is added to the mix...

I think some of your questions here are answered in Greg Egan's story Crystal Nights and in Jake's simbox post. We can have programming without mention of computers. It can be based on things like a fictional system of magic.

Thanks!

These are very useful references.

Thank you! It was valuable to read Crystal Nights and the simbox post gave me new insights and I have made a lot of updates thanks to these reading tips. I would think it to be a lot safer to not go for a fictional system of magic that lets it program. I estimate it would greatly increase the chance it thinks it is inside a computer and gives a lot of clues about perhaps being inside a simulation to test it, which we want to prevent. I would say, first see if it passes the non-programming simbox. If it does not, great, we found an alignment technique that does not work. Then after that, then you can think of doing a run with programming. I do realize these runs can cost hundreds of millions of dollars, but not going extinct is worth the extra caution, I would say. What do you think?

I agree, but I do see the high cost as a weakness of the plan. For my latest ideas on this, see here: https://ai-plans.com/post/2e2202d0dc87

I added a new section "How to deal with recursive self-improvement" near the end after reading your comment. I would say yes, recursive self-improvement is too dangerous because between the current AI and the next there is an alignment problem and I would not think it wise to trust the AI will always be successful in aligning its successor. 
Yes, the simbox is supposed to be robust to any kind of agent, also ones that are always learning like humans are.
I personally estimate that the testing without programming will show what we need. If it is always aligned without programming in the simulation, I expect it has generalized to "do what they want me to do". If that is true, then being able to program does not change anything. Of course, I could be wrong, but I think we should at least try this to filter out the alignment approaches that fail in the simbox world that does not have programming. 

Curious what you think. I also made other updates, for example added a new religion for AI's and some on why we need to treat the AI's horribly.   

Thanks!

I'll make sure to read the new version.

I pondered this more...

One thing to keep in mind is that "getting it right on the first try" is a good framing if one is actually going to create an AI system which would take over the world (which is a very risky proposition).

If one is not aiming for that, and instead thinks in terms of making sure AI systems don't try to take over the world as one of their safety properties, then things are somewhat different:

  • on one hand, one needs to avoid the catastrophe not just on the first try, but on every try, which is a much higher bar;
  • on the other hand, one needs to ponder the collective dynamics of the AI ecosystem (and the AI-human ecosystem); things are getting rather non-trivial in the absence of the dominant actor.

When we ponder the questions of AI existential safety, we should consider both models ("singleton" vs "multi-polar").

It's traditional for the AI alignment community to mostly focus on the "single AI" scenario, but since avoiding the singleton takeover is usually considered to be one of the goals, we should also pay more attention to the multi-polar track which is the default fall-back in the absence of a singleton takeover (at some point I scribbled a bit of notes reflecting my thoughts with regard to the multi-polar track, Exploring non-anthropocentric aspects of AI existential safety)

But many people are hoping that our collaborations with emerging AI systems, thinking together with those AI systems about all these issues, will lead to more insights and, perhaps, to different fruitful approaches (assuming that we have enough time to take advantage of this stronger joint thinking power, that is assuming that things develop and become more smart at a reasonable pace, without rapid blow-ups). So there is reason for hope in this sense...