Transparency for Generalizing Alignment from Toy Models

Johannes C. Mayer

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Status: Some rough thoughts and intuitions.

🪧 indicates signposting

TL;DR If we make our optimization procedures transparent, we might be able to analyze them in toy environments to build understanding that generalizes to the real world and to more powerful, scaled-up versions of the systems.

🪧 Let's remind ourselves why we need to align powerful cognition.

One path toward executing a pivotal act is to build a powerful cognitive system. For example, if a team of humans together with the system can quickly do the necessary research to figure out how to do brain emulation correctly, and then build a functioning simulator and upload a human into it, we would win.

I expect that the level of capability required of the system to perform the necessary tasks, will make the system dangerous if misaligned. For this path to work, we need to be able to guide the optimization of the system.

🪧 Now I'm going to introduce a particular kind of model that I think would be more transparent than modern DL, though the overall argument should apply to any system that makes its internal workings highly transparent to us. For example, if we would get really good interpretability tools for neural networks, then the argument would apply there too.

Let's assume we have figured out how to create an algorithm that builds a predictive model of the world, and that both the algorithm and the resulting world model are transparent to us. To me, it seems likely that once we have such an algorithm, it would be relatively easy to put another algorithm on top that uses the world model to determine action sequences that result in particular outcomes in the world. If the world model is transparent, this will also make this decision procedure more transparent.

I'm imagining here that we know the explicit line-by-line source code that does the building of the world model. The same goes for the algorithm that uses the world model to determine what actions to perform. This is in contrast to only knowing an algorithm that optimizes a computational structure, such as a neural network until the structure becomes good at performing some task, but where we don't understand what is going on internally. However, I am not presuming that we precisely understand all of the internal workings of these algorithms, that is what we want to end up with eventually. And importantly, I'm not assuming that these algorithms already have the relevant alignment properties that we would want.

One way that the world model could be transparent is if it decomposes the world in similar ways humans do. For example, there might be a particular concept corresponding to a chair and a particular concept corresponding to a table in the world model. These concepts might be thought of as objects containing a predictive model (or multiple), and other data. If we get the world modeling algorithm right, I think we would be able to inspect and understand these objects, if the things they are modeling are simple enough.

I do not expect that a human can look at the world model of the real world, of a full-fledged AGI, and understand it in the relevant time frame. It will simply be too big to be comprehended. Possibly even most concepts on their own would be too complicated to be comprehended in full, even when optimized for being understandable to humans.

🪧 So, let us now consider how toy models can be helpful here.

I expect that we could look at simple environments and study the world modeling algorithm there, together with the decision procedure that is layered on top of the world model. For example, we could try to elicit misalignment on purpose. We might give it an objective function that is slightly different from what we want the system to do eventually. Then we can study the system and understand what sorts of mechanisms we would need to prevent goodharting.

You could of course apply techniques such as Quantilizers by default, but I would expect that if the world model is transparent and the optimization procedure uses the world model in a very direct way, that we would be able to find structural properties that would correspond to the system not optimizing too hard. Quantilization works on the level of the objective function. But if we have a white box system, we can start to work with the internal structure of the algorithms. It might be easier to understand how one specific algorithm needs to be changed, such that it has various desirable alignment properties, compared to figuring out an objective function that works in general for any powerful optimizer.

For example, consider that we want to create a myopic agent. We could try to figure out a general objective function that would make any general optimizer myopic. However, it might be easier to understand how a concrete system would need to be changed to make it myopic. Perhaps you could identify where exactly the program uses the world model to predict how the future will look like. Then you could limit the number of times this predictive model can be called on outputs of itself.^[1]

The idea is to use these toy environments to analyze the algorithms such that this understanding generalizes to when we run the system in the real world.

🪧 Let's make this more concrete by drawing an analogy to how this might look like by considering how we might do this with another much simpler algorithm.

Consider the following example: If you program a list sorting algorithm, you can study it by looking at what is going on for small lists. Either you look at the algorithm and think about what is going on, or you could even step through it using a debugger. You can then use your observations to understand how the list sorting algorithm works and how it will behave for larger lists. You can build intuitions that generalize. Here I am considering that you are not just looking at a bunch of examples, notice that all of them behave as you want, and then are satisfied. Instead, I am thinking of the scenario where you carefully study the program and build up your understanding to such an extent that you know that the list-sorting algorithm will work for any possible input you could give it.

Note that I am intentionally talking about intuition, and not about mathematical proof. Sure, you can formally verify that a particular list sorting algorithm will always produce a sorted list. But doing this kind of analysis is often tedious. When I'm writing a list sorting algorithm, and want to be confident that the output will always be sorted, I don't need to write the algorithm in Coq. I can work with my intuitions and carefully look at the program that I am writing. I'm not saying don't use mathematical proof. Instead, I'm saying that it would be very good if we can put our programs into a form that is conducive to building correct, deep-reaching intuitions that generalize widely.

Similarly, we might be able to study the property of being inner aligned (or any other property really) in the toy set up and understand what properties of the internals of the system would correspond to being inner aligned. And not just understand them superficially, but grok them deeply in a way that makes us grasp how to build the system such that it would keep being inner aligned even if we scale up the system and put it in the real world.^[2]

Edit: Also see this comment for further clarification.

This is more of a rough illustrative example. It probably doesn't correspond to getting the notion of myopia we would want, but it seems to roughly go in the right direction. ↩︎
Inner alignment is probably not a property we would verify directly, but rather we would decompose it and verify various sub-properties. So inner alignment here is just an illustrative example. ↩︎

[-]baturinsky1y10

I think that this field is indeed underresearched. Focus is either on LLMs or on single payer environment. Meanwhile, what matters for Alignment is how AI will interact with other agents, such as people. And we don't haveto wait for AGI to be able to research AI cooperation/competition in simple environments.

One idea I had is "traitor chess" - have several AIs playing one side of chess party cooperatively, with one (or more) of them being a "misaligned" agent that is trying to sabotage others. And/or some AIs having a separate secret goal, such as saving a particular pawn. Them interacting with each other could be very interesting.

[-]Johannes C. Mayer1y10

Somebody said that they would be skeptical that this would avoid the sharp left turn.

I should have said this more explicitly, but the idea is that this will avoid the sharp left turn if you can just develop deep enough intuitions about the system. You can then use these intuitions, to "do science" on the system and figure out how to iteratively make it more and more aligned. Not just by doing empirical experiments, but by building up good models of the system. And at each step, you can use these intuitions to verify that your alignment solution generalizes. That is the target.

These models are probably not just made up of intuitions. We want to have <formal/mathematical> models. However, I expect the hard part is to get the system into a form where it is easy to develop deep intuitions about it. Once we have that, I expect creating formal models based on these intuitions to be much easier than getting the system to the state where it is easy to have intuitions about it.

[-]plex1y20

that was me for context:

core claim seems reasonable and worth testing, though I'm not very hopeful that it will reliably scale through the sharp left turn

my guesses the intuitions don't hold in the new domain, and radical superintelligence requires intuitions that you can't develop on relatively weak systems, but it's a source of data for our intuition models which might help with other stuff so seems reasonable to attempt.

LESSWRONG
LW

Transparency for Generalizing Alignment from Toy Models

13

Ω 4

13

Ω 4