Partial Simulation Extrapolation: A Proposal for Building Safer Simulators

lukemarks

Introduction

This document outlines a proposal for building safer simulators/predictive models, hereafter ‘Partial Simulation Extrapolation’ (PSE). Without requiring Nice Things like flawless interpretability tools or undiscovered theoretical results, a successful implementation of PSE could help prevent outer alignment failures in simulators (namely the prediction of misaligned systems). Moreover, I suggest PSE will enforce a minimal alignment tax and can be considered ‘self-contained’ in that application to already trained (as well as yet to be trained) models might be possible without architectural or training data modifications. The caveat is that even a completely ideal implementation doesn’t provide a theoretical safety guarantee, and only addresses select outer alignment failures.

As a primer, I recommend reading the preceding post Higher Dimension Cartesian Objects and Aligning ‘Tiling Simulators’, and if you are interested in the math component of the theory, also the introductory post to Scott Garrabrants Cartesian Frames sequence and section one and two of MIRI’s Tiling Agents for Self-Modifying AI, and the Löbian Obstacle (although neither of these will be necessary to obtain a conceptual grasp). Whilst the previous post focused quite heavily on the mathematical side of Cartesian objects, I intend to make this proposal as accessible as possible, and will then associate its claims with more rigorous theory in later posts.

Recap

This is roughly what a Cartesian object looks like:

Where the axis represents an agent, which we formalize as a set of actions that agent can take. $A^{2}$ is much the same but for a different agent, and $E$ is the environment (which in accordance with the Cartesian frame paradigm, can also be modeled as a set of actions except taken by the environment in place of an agent). The twenty-seven translucent cubes represent potential worlds. In theory a Cartesian object could have any number $n \in N$ of dimensions, and the above is just one potential configuration used as a toy model to introduce the concept. Most models will involve far more than three actions per axis and many more than two agents.

Quoting Higher Dimension Cartesian Objects and Aligning ‘Tiling Simulators’:

To make Cartesian objects useful for thinking about simulators, do the following:
Imagine a box. This box has a volume that corresponds to the highest possible complexity simulation a given simulator can conduct.
Imagine a Cartesian object in that box. This object consumes an amount of the box's volume proportional to its own complexity and likely consists of an environment and various agents.
By default this box does not care what is in it. The box was content with this until someone started telling the box that some of the things in it were good, and some of the things in it were bad (conditioning). If this box was a deceitful box, it would just hide the bad objects under good ones such that it looked like a good box (deceptive alignment). If it was a good box (robustly aligned), it would contain only good objects. If the box wants to have a lot of objects in it, it can't rely on the limited approval rate of the person who initially instructed it, and so it needs a more scalable criterion.
When modeling a conditioned simulator as a tiling agent, you also gain access to various luxuries such as the fact that by definition the seed agent can simulate the agents it creates in the future. … On the other hand, it's not as though you can just have a simulator simulate a world with a general intelligence in it to see if it is aligned or not, as this process is identical to having it be simulated regularly, and thus entails the same risks.

Partial Simulation Extrapolation

I associate the risk of simulating maligned AI mostly with the capacity to conjure high complexity simulacrum. To illustrate this; no matter to what extent you think about and model AGI internally, this imagined intelligence will not be able to perform actions that affect the real world of its own accord. In our current configuration, humans are incapable of conducting dangerous internal simulations as a result of the limited capacity of our simulator (the human brain). Despite this, we can still ascertain useful information from these crude simulations. This is roughly the basis for PSE.

Here is a high-level overview of how I envision PSE working:

The user submits their prompt via an interface like ChatGPT (to them it appears as though they are prompting the target simulator directly)
The prompt is fed to a smaller model, the ‘partial simulator’
Output is sampled from a partial simulation
Output is passed to another instance of the partial simulator along with the original prompt
Inference is performed by the partial simulator regarding the safety of a complete^[1] simulation based on the partial simulation data (e.g. by simulating a world in which competent humans are trained to analyze this data)
A complete simulation is either performed or declined based on this analysis^[2]

The obvious failure mode for this scheme is that the output of the smaller partial simulation model is manipulated and thus the (potentially dangerous) simulator conducts the dangerous simulation regardless. I will later argue that this is very unlikely. Not only do I envision deceptive output from the partial simulator being unlikely, but also not as problematic as would seem intuitive. The smaller model conducting the partial simulation needn’t be aligned at all, and probably shouldn’t be subject to conditioning. This way you can use an unconditioned simulator to elicit more aligned behavior from a more powerful simulator.

Theoretically this could all be done in a single forward pass in the target simulator (perhaps by some kind of complexity limitation scheme), but using two separate models:

Allows for the solution to be self-contained - Meaning one partial simulator could be applied to many target simulators (which makes application in the open source context much more feasible)
Could have important security implications - If for example our partial simulator was somehow able to simulate something dangerous, having done so in the context of a GPT-4 level model is likely significantly safer than in the context of a GPT-6 level model
Permits a lower alignment tax - Delegating some of the inference to a smaller model results in lower operating costs

The following are two potential methods of constructing partial simulations, although they should not be treated as dichotomous, and a successful implementation of PSE will likely use both. The partial simulator doesn’t need to actually abide by the following descriptions at all. How the partial simulator converges at a useful simulation doesn’t matter (hence why you don’t need interpretability tools to make PSE useful). These are just some ideas explaining how the partial simulator might still produce useful data despite being weaker than its complete counterpart. To be clear; unless some simulator is actually flawlessly emulating a complex system; it is probably doing one of these things already. I’m defining them so that I can make crisp statements about PSE in later posts and hopefully be able to formalize extrapolation neatly.

Fragmented Simulation

This technique involves simulating portions of a Cartesian object. Most obviously this could be done by only simulating select axes of the complete intended Cartesian object, but also by simulating only a fragment of a given agent's cognition. Theoretically, the former is quite simple (in that it is easy to comprehend and formalize), but the latter not so much. You could take a factored cognition approach, or potentially do something involving restricting an agent's action-space, but the path forward theory-wise for this seems quite unclear to me. It may just end up being correct to denote fragments of agents as inscrutable variables like A^1_f1 (meaning the first fragment of the first agent) instead of having a detailed way of conveying a fragmented agent. Again, it doesn’t really matter, but having a nice formalization of this could make modeling extrapolation easier.

I will refer to a simulation as ‘unified’ if it is conducted in a non-fragmented manner.

Low-Fidelity Simulation

Simply simulating things in lesser detail is a very intuitive way of cutting down on complexity. There isn’t much else to say here.

Extrapolation and Analysis

You have now some patchwork reality generated by a smaller model and need to make a judgment regarding the safety of its growth to a complete alternative, which needs to be done without actually conducting the unified higher-fidelity simulation. Therefore, symptoms of misalignment need to be clear prior to a complete simulation being conducted in order to make use of extrapolation. My current best guess for how to do this would be to simulate a world in which competent humans or narrow AI systems are capable of doing this. If it is possible to construct a world the simulator believes is probable enough to have occurred without forcing anthropic capture or invoking a manipulative AGI, then this could be performed by the target simulator, but:

If we could do this we could likely just simulate a world in which we accelerated alignment research and sample from there
I wouldn’t even know how to begin calculating something like this
Failure has catastrophic implications
If you could calculate the simulators belief in the probability of a world running partial simulations would be unnecessary

As a result, conducting extrapolation in the partial simulator seems correct to me. The downsides of this are that this simulation is probably partial as well, but your partial model can likely be >GPT-4 levels of capabilities without posing serious risks even with just existing alignment techniques. Intuitively, even just GPT-4 should be able to provide automated insights based on partial simulation data (although this needs to be empirically verified). This could include:

Does a capable maligned force manipulate events in this simulated world?
Are simulated agents aware of the simulated nature of their reality?
Is useful output likely (as inferred from the provided original prompt)?

Filtering for potentially dangerous worlds like this is not a good long term solution. You want your model to be innately safe during pre-training and not require an additional tack-on to reduce its probability of ending the world, but this is also a highly feasible way to reduce immediate risk. Such a risk reduction could be what is needed to navigate an initial risk period and develop more aligned systems via automated alignment. I should also note that nothing is stopping you from using the partial simulator during pre-training, but depending on inference speed this could considerably increase training time.

Infeasibility of Deceptive Corruption in the Partial Simulator

Previously I mentioned that deceptive corruption in the output of the partial simulator was the most likely failure mode for PSE. This risk seems initially potent because on top of the baseline deceptive alignment risk of our reality, we are introducing potentially improbable conditions into our simulation; elevating the risk of manipulation by a controlling intelligence. Firstly, we know that it is either highly unlikely that GPT-4 level simulators are capable of conjuring complex enough simulacra to pose a realistic takeover threat (and if it is we might already live in a world governed by a controlling general intelligence), and even when using more powerful models as partial simulators we should be able to make similar statements about their safety (as they will almost certainly be deployed as is the current trend in LLM development). Secondly, if a maligned controlling intelligence simulated by the partial simulator has the capacity to influence the real world by dint of deceiving a supervisor it is unlikely to know many properties of as a means of simulating a more complex version of itself (which it would need to know would be to some extent aligned with it), this intelligence can probably influence the world more directly.

In order for this to be considered likely, the simulated controlling intelligence would need to infer that:

It is being simulated in a fragmented/low-complexity fashion (relative to what is possible in our reality)
This partial simulation will be later extrapolated to a complete iteration
This extrapolation is conditional on the approval of an unknown supervisor
The complete simulation of itself will share aligned enough goals to make it the dominant decision

Additionally, the simulated supervisor may be able to detect symptoms of a modified output, although this is largely dependent on the relative capabilities of the supervisor and controlling intelligence. As a result of the above, I simply do not view this risk as credible, and if it is; then we are likely already doomed (due to advanced manipulative intelligences already existing).

^{^}
'Complete' meaning as high a fidelity and unity as possible by the target simulator at the time of simulation.
^{^}
This doesn’t necessarily mean that “As a large language model trained by… I cannot…” is returned. Just because a complete simulation is deemed unsafe by the partial simulator does not mean output cannot be returned. Having to return “Your prompt has been deemed unsafe…” would be a hefty alignment tax and be a detriment to the applicability of PSE. In the case of a complete simulation being deemed unsafe by the partial supervisor, you could still return the output of the partial simulation, or produce a modified prompt that did pass the safety evaluation.

LESSWRONG
LW