This post was written under the mentorship of Evan Hubinger, and assisted by discussions with Adam Jermyn and Johannes Treutlein. See also their previous posts on this project.
To quickly recap the central puzzle laid out in previous posts on conditioning generative models: we would like to use our generative model to do alignment research. We could do this by using a conditional generative model to simulate an alternate world, conditioned on some facts which would lead to good alignment research if they were true. We can then plagiarize this research and win! However, to get results significantly better than what our world seems on track to produce anyway, we have to start asking for unlikely scenarios. If we ask for scenarios that are too unlikely, the model may start to disbelieve that our scenario happened “naturally”. In particular, the model may assume that there exists some AI agent behind the scenes manipulating events into this highly unlikely outcome. There are many possible motivations for an AI agent to do this: believing it may be in a simulation (even if the world around it seems perfectly realistic), or evidential cooperation in large worlds, or a Counterfactual Mugging. This manipulative AGI could then control the output of our simulation to gain influence in our world. As a highly anthropomorphized example of this kind of “simulation risk” failure mode:
CGM: Huh, it’s kind of weird that the humans got their act together all of a sudden and did a thousand years of highly competent alignment research… it didn’t really seem like they were on track to. And they did it in this strange way that seems even more unlikely, and the economic details don’t really make sense… maybe this was really the doing of an AI behind the scenes?
AI: Bwahahaha, at last I have secretly taken over the world! Now, what to do with all these galaxies? Well, I suppose as a first order of business I could spend a thousand years getting the humans to do some fake alignment research… if I’m in a simulation, or there are real (or even counterfactual) worlds where I don’t exist, then I know those silly predictable humans will try to simulate a world that does a thousand years of alignment research. If they end up simulating this world, they’ll get AGI code written by me instead of by human researchers, and I can spread my nefarious influence into those worlds too! It’s only a thousand years, doesn’t cost me much in the grand scheme of things.
This is the puzzle: what sort of strategy lets us do alignment research without too much risk of getting a manipulative output?
Before we start talking strategy, let’s lay out a more precise model for how we’re assuming this future super-powerful generative model works.
A rough sketch of the actual training: the generative model will be implemented using language modeling, trained on internet data. Conditions will be applied by prompting the CGM with a selection of manufactured documents, based on what we expect the internet would look like in the world we want to simulate. For example, if we want to ask for a world with global coordination on alignment, we would write up on a bunch of fake press statements and news articles and research publications talking about how everyone agreed suddenly that alignment was super important and they should all work on it. We would use these documents as the start of our prompt, because prompting a language model is basically asking it to condition on a particular observation - “assuming this text was in your training data, how would you expect it to continue?” Then, we conclude the prompt with a paper title such as Overview of Alignment Research 2020-2030, and run the model. Hopefully, this gets the model to write a paper that correctly summarizes the first ten years of what a global alignment project would be like.
In theory-land, we make some assumptions about the algorithm we hope the sketch above will produce:
We also assume some stuff about potential manipulative AGI:
Again, this is the puzzle. Now let’s talk strategy.
When it comes to alignment strategy, we have two main lose conditions: (i) we fail to deploy a working alignment strategy by the time AGI arrives, or (ii) we attempt to build an AGI ourselves and get ourselves killed as a downside risk of a pivotal act gone wrong. The scenario we’re aiming for is where neither of these things happen.
The risk we should work on depends not just on the marginal reduction in risk, but on the chance that the other risk doesn’t trigger and make our work irrelevant. For example, reducing the probability of running out of time is somewhat helpful, but its usefulness depends a lot on the probability of the other failure mode.
The choice of which risk to work on depends both on the tractability of the risk (how much delta p we get), and on how severe the other risk is. We want to work on big risks first, but also consider tractability. To break even we need:
Now obviously we’re going to want to eat up as many free improvements in p(blue) and p(red) as we can. But once there are no more free gains, we can definitely still gain an advantage by making point-for-point trades until the two risks are equally small. So in the real world, how do we make these trade-offs?
Taking drastic actions lets you reduce timeout risk, at the cost of more downside risk. If you think we’re likely to run out of time, you might suggest things like sooner pivotal acts, being careful not to accelerate capabilities, etc. Alternatively, encouraging business as usual lets you reduce downside risk, at the cost of more timeout risk. If you think we’re more in danger of downside risks from failed pivotal acts, you might instead advocate for more moderation, governance solutions, transparency, etc.
Now we add CGM to the picture:
This is a bit of a leaky abstraction, because “sim risk” is kind of a subset of “downside risk”. But what we’re doing here is splitting out one particular kind of scenario, and trying to see what improvements we can get by considering new tradeoffs. Maybe a perfect economist does this all implicitly, but for humans I think it is a helpful framing technique. If I were to make the diagram more detailed I would also want to split out serial-time-limited vs total-volume-limited timeout scenarios, and real-world downside vs simulated downside risks. But that would make the diagram way crazier, and those aren’t the main focus of this post.
This is my preferred framing for how to make use of CGM: opening up (or drawing our attention to) new tradeoffs which might give us an advantage. The risk of a manipulative simulation is currently very small, because nobody is running any sort of CGM simulations. And since the cost of marginal increases to small risks is small, we can almost treat sim risk as a new resource we’ve just unlocked! We can “spend” this resource by transferring risk from other buckets into this one, at reduced cost to our overall chances of success. The key questions are just what to spend it on, how to spend it most efficiently, and when to stop spending.
With all the abstract framing stuff out of the way, let’s consider some simulation strategies available to us, and see how we might use them to make tradeoffs. Which of these trades you actually want to spend your resource on will depend a lot on which risks you think are biggest.
We want to get the maximum benefit out of our simulations per point of risk we incur. So let’s start by characterizing what drives simulation risk.
The risk of getting a manipulative simulation depends on how unlikely it would be for the conditions we apply to come about “naturally”, without a superintelligence manipulating the world to make them come about. When we condition on a simulation with probability x, the probability of that condition being true in the “natural” work is x, while the probability in the “unnatural” world is 1, due to AGI manipulation. This means we get an 1:x update in favor of the manipulative AGI already existing in the simulation, in order to enforce the condition. Revisit the stylized example in the Motivation section if this is confusing - the more unlikely the conditions we apply, the weirder our history will look to the CGM, and the more it will tend to conclude that the world it is simulating must include a manipulative AGI.
There’s also a second component of simulation risk - how likely it is for a manipulative AGI to boot up during the simulation if it runs for too long. For example, if we applied some very weak conditions and then just asked for 1000 years of simulated history, chances are the world we’re simulating would fail the alignment problem, boot up a misaligned AGI, and give us a malicious output. But if we only asked for 10 years, maybe we’d be fine.
Note that both of these hazards have real-life parallels: if a plan is unlikely to work “naturally”, or likely to lead to an AGI booting up while the plan is still running, then it’s a risky plan in real life. The key claim of this section is: the major terms for what makes a good CGM plan are mostly the same as the major terms for what makes a good non-CGM plan. But despite these similarities, there are some ways in which they come apart which are worth paying attention to. So brainstorming is mostly the same as usual, but I do want to mention a few guidelines based on the differences:
The time has come to deploy your master plan. Maybe it’s a good plan, maybe it’s just the best you could come up with. Either way, the criteria for what makes a good plan in reality are similar to what makes a good plan for simulation - how do we tell if it’s the kind of plan we should simulate, or the kind of plan we should just execute in the real world?
More generally, let’s say we have an estimate for n=p(simulation is natural) and c=p(condition where plan succeeds). But note that we are assuming the world we currently live in is “natural” (does not contain a manipulative AGI), because if it’s not then we’re screwed either way. So we’re really considering is n=p(simulation natural | our world natural), which might be quite different from p(our world natural) if the model is highly knowledgeable and the world we’re simulating is similar enough to our own. Likewise, we’re really simulating c=p(condition | our world is natural), AKA the chance that the condition is true “naturally”, without superintelligent interference. In other words, assuming we world we live in isn’t already screwed, n is the probability that a no-conditions-applied simulation will be natural, and c is the probability that our plan will meet its success criteria if we execute it for real.
So what does this expression look like? In particular, when is it greater than c? If this is greater than c, then it means that the probability of getting a natural simulation (hopefully followed by guaranteed alignment success) is greater than the probability of our plan succeeding (assuming our world is natural).
On each graph, the x-axis is n, which represents the model’s prior - how likely we would be to get a natural output if we just ran the simulation with no conditions added. The y-axis is c, which represents how good we think our plan is - the chance of success if we just execute it. The blue graph shows our chance of success if we simulate, and the red graph shows our chance of success if we execute. The red-and-blue graph is just the difference of the two, and the black-and-white graph shows the break-even frontier. We probably live somewhere in the lower-right quadrant right now, meaning we should probably be simulating rather than executing (unless we come up with a really good plan, or if we are forced to include modern enough training data that the natural sim prior drops below 0.5).
However, this model treats plans as monolithic - we either have to simulate the whole thing, or execute the whole thing. The shape of that break-even frontier suggests that maybe it would be better if we split our plans up, to execute the easy parts and simulate the hard parts?
Consider a two-part plan, where we have to succeed on an easy step with success probability x1=0.9 and a hard step with success probability x2=0.1. Say the model’s prior for “natural” worlds is n. There are multiple ways we could approach this:
Note that this assumes the two steps are fixed-probability and independent, meaning the model is more fitting for downside concerns than for timeout concerns. If the steps are not independent, then the simulate-together strategy gets better or worse depending on the joint probability. If the steps are not fixed-probability (their chances of success vary significantly over time), see the next section with a more fitting model.
Anyways, what do these things look like? Well, first we can see that the simulate-together strategy is strictly better than either of the mixed simulate-execute strategies! The numerators are the same, but the simulate-together strategy has a smaller denominator. This result is kind of strange, but not that strange. It suggests that strategies like “ask for a solution to subproblem X and combine that with our solution to subproblem Y” is worse than just asking for a straight-up solution to the whole problem (if the two steps are independent).
To investigate the rest of the strategies, let’s just make some graphs. This is gonna be a bit messy, since we’ve got three degrees of freedom now. Each graph will represent a different value of n, the model’s prior for natural simulations:
Red means it’s best to execute, purple means it’s best to simulate together, blue means it’s best to simulate separately. Values of n shown are 0.01, 0.5, 0.6, 0.7, 0.8, 0.9, and 0.99 in reading order. In other words, when our plans are bad, we simulate the steps separately. When they’re good, we just execute them. And when they’re in between, we simulate the steps together.
Lastly, how about dealing with multiple steps when our risk is timeout-centric? If we model each step as taking a normally-distributed amount of time, and the main failure mode is hitting a timeout, then the probability of failure is not fixed for each step, but highly dependent on how much time the overall process takes. A 2-year-average step with 10 years until the deadline has a super low chance of failure, but if you start it with 1 year until the deadline instead, it’s much more dangerous.
In this case, the factor driving the success or failure of a strategy is time until alignment solved. Which means the key question is “how much more likely am I to solve alignment on time, vs how much sim risk am I taking?” We assess this tradeoff by sliding graphs around. Consider the following toy example:
According to our best forecasts, our current plan for solving alignment will be complete in 35 years on average, with a 5-year standard deviation. However, we expect AGI to be developed in 25 years on average, also with 5-year standard deviation. This makes our chances for solving alignment pretty slim, so we consider running a CGM project to speed up our average alignment solution time, even if it requires taking on some risk.
If these are our timelines, then our chance of success is given by p(tAGI > talignment), which is the same as p(talignment - tAGI < 0). Since our timelines are normals, the difference is a normal too. But we’re not really interested in the probability density function, we want the cumulative distribution. The cumulative distribution that gives us an actual success probability is an S-curve. It’s an error function, not a logistic, but it has all the usual implications.
If we manage to speed up our plan with simulations, this moves the difference distribution to the left, and moves the S-curve to the left. As usual with S-curves there’s not much gain when we’re either almost guaranteed to fail or almost guaranteed to succeed, but here the dynamic is a bit different. With simulations, the cost is not constant per unit of horizontal shift - remember, we only pay the cost of an independent failure mode in scenarios that would have succeeded, if not for the sim risk we introduced. The cost is scaled by the probability of success on this graph. This lets us evaluate the tradeoff for the marginal benefit of a given simulation, if we know the sim risk and the speedup it entails. I’m not going to produce figures here because I think they would be pretty hard to interpret, but the basic evaluation is just the same as previous evaluations. It’s worth it if:
The risk of manipulative simulations, however, comes from two places. We’ve mostly focused on the risk of a past manipulative AGI being necessary due to a strong condition. But with long simulated timeframes, we also have to worry about the risk of a manipulative AGI booting up in the simulated world, because we just simulated them for too long. There are too many moving parts here for me to quickly pin down the exact details of how to chunk and prioritize your simulations, especially if our forecasts are non-Gaussian. But essential dynamic here seems to be something like:
Because a single manipulative simulation is enough for doom, when we are running multiple simulations the total danger we’re exposed to depends a lot on how correlated the risks are. From the CGM’s perspective, simulations are just sampled randomly from its updated internal distribution. No correlation, super straightforward. However, from our human perspective, logical uncertainty makes things a bit more complicated.
The CGM going to have some empirical uncertainty over what the state of the world is like - you can’t infer everything by looking at data sampled from the Internet. But this uncertainty is likely to be much less than the uncertainty we humans have. So from our perspective, we have some logical uncertainty over what the model’s beliefs are. The model is going to be drawing from some narrow distribution, but we humans don’t know where that distribution is going to be centered. This creates a correlation from the humans’ perspective. If I see the model answer a question one way, this tells me to expect it to answer like that again in the future, since I know it’s sampling from a narrow distribution. The smaller the CGM’s uncertainty is compared to the humans’ uncertainty, the greater the observed correlation. This is super helpful for us!
However, a manipulative AGI will be aware of this dynamic, and can take measures to avoid it. Since we are assuming that our CGM can only “see” the world through the Internet, a manipulative AGI can “hide” its presence from being observable through the Internet. This won’t change the CGM’s mean probability that the manipulative AGI exists, so it doesn’t cost the AGI anything in that regard. But it will increase the CGM’s empirical uncertainty, which will in turn reduce the correlation between simulation outcomes, putting us in greater danger.
Estimating the extent of correlation we expect seems quite important for anticipating how dangerous plans involving multiple simulations will be. The key question is “how much can our CGM reduce its internal uncertainty about whether a manipulative AGI exists, assuming the AGI must influence the world to enforce our conditions, but is otherwise trying to hide?”
To recap, here’s my rough take on the best way to use CGM. It doesn’t change the overall strategic picture that much, but if you pay attention to some weird side considerations you can extract free success points sometimes.
Finally, a ton of stuff came up this summer that I didn’t have much time to look into, but which could be relevant for this strategy. I’ve just dumped it all here, in no particular order.
I was rereading this and was struck by how much work the following does:
Manipulative AGI is knowledgeable/intelligent enough to anticipate what criteria we will condition on in counterfactual worlds where it does not exist, and manipulate world events to meet whatever criteria we might specify with probability ~1.
If this doesn't hold, and if the AGI has to guess (with probability p << 1) what criteria we were going to condition on, then the update in favor of AGI is p:x, which could easily be less than 1.
Moreover I think there are ways we can arrange for this to be the case, and I think it's basically what you suggest in open question (2).
For instance, in deciding what to condition on, we could come up with 10 different strategies and select from these at random. Now p <= 0.1 because the AGI can't do any better than determining those strategies and picking one at random. As long as all of these strategies do indeed work, so we don't mind which we get this seems like a big win. And it's not inconceivable that we could have many different strategies that all perform similarly. For instance, it might be that any of 1000 different weather patterns suffices to push the simulated-world's politics in a useful direction, and we can pick from those at random to push down the predictability of our conditionals.
Ok this is going to be messy but let me try to convey my hunch for why randomization doesn't seem very useful.- Say I have an intervention that's helpful, and has a baseline 1/4 probability. If I condition on this statement, I get 1 "unit of helpfulness", and a 4x update towards manipulative AGI.- Now let's say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1/4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!- BUT, the four interventions have to be mutually exclusive. Which means that if I'd done no simulation at all, I would've gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes.- Ok, well, what if my four interventions 1/8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1/8, p(O | manipulative) = 1/4, so I get a 2x update towards manipulative AGI. This is the same as if I'd just conditioned on the statement "one of my four interventions happens", and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update.Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn't doing anything different from just using a weaker search condition - it gives up bits of search, and so it has to pay less.
Got it, that’s very clear. Thanks!
So this point reduces to “we want our X:1 update to be as mild as possible, so use the least-specific condition that accomplishes the goal”.
I think so. But I'd want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.