This is something of a grab-bag of thoughts I've had about the Builder/Breaker game.
The ELK document had a really nice explanation of its research methodology in terms of an imaginary dialogue between a "Builder" who makes positive proposals, and a "Breaker" who tries to break them. To an extent, this is just the ordinary philosophical method, and also a common pattern in other research areas. However, I felt that the explicit write-up helped to clarify some things for me.
We might think of the Builder/ Breaker game as an adversarial game where either the builder or breaker "wins", like AI debate. However, I find it more fruitful to think of it as a cooperative game. When the game is played by AI safety researchers, the players have a common goal of finding robust plans to avoid catastrophic outcomes. The builder/breaker game merely organizes cognitive work: both Builder and Breaker are trying to map the space of proposals, but each takes primary responsibility for avoiding a different kind of error (false positives vs false negatives).
I think Builder/Breaker is a good way to understand Eliezer's notion of security mindset (1, 2). The Builder is trying to construct a positive argument for safety, with (at least) the following good properties:
- The argument clearly states its assumptions.
- Each assumption is as plausible as possible (because any grain of doubt indicates a possibility of failure).
- There are as few assumptions as possible (because more assumptions mean more ways the plan can fail).
- Each step of reasoning is sound.
- The conclusion of the argument is a meaningful safety guarantee.
I will call such a plan robust. We can question whether AI safety research should focus on robust plans. I won't dwell on this question too much. Clearly, some endeavors require robust plans, while others do not. AI safety seems to me like a domain which requires robust plans. I'll leave it at that for now.
In any case, coming up with robust plans has proven difficult. The Builder/Breaker game allows us to incrementally make progress, by mapping the space of possibilities and marking regions which won't work.
I could easily forgive someone for reading a bunch of AI alignment literature and thinking "AI alignment researchers seem confident that reinforcement learners will wirehead.". This confusion comes from interpreting Breaker-type statements as confident predictions.
(Someone might try to come up with alignment plans which leverage the fact that RL agents wirehead, which imho would be approximately as doomed as a plan which assumed agents wouldn't. Breaker start saying "What if the agent doesn't wirehead?" instead of "What if the agent wireheads?".)
Reward is not the optimization target. The point isn't that RL agents necessarily wirehead. The point is that reinforcement signals cannot possibly rule out wireheaders.
This is an example of a very important class of counterexamples. If we are trying to teach an agent some class of behaviors/beliefs using feedback, the feedback may be consistent with what we are actually trying to teach, but it will also be consistent with precisely modeling the feedback process.
A model which understands the feedback process in detail, and identifies "maximizing good feedback" as the goal, will plausibly start trying to manipulate that feedback. This could mean wireheading, human manipulation, or other similar strategies. In the ELK document, the "human simulator" class of counterexamples represents this failure mode.
Since this is such a common counterexample, it seems like any robust plan for AI safety needs to establish confidently that this won't occur.
(It also happens that we have empirical evidence showing that this kind of thing can actually happen in some cases; but, I would still be concerned that it could happen for highly capable systems, even if nothing similar had ever been observed.)
The ELK document describes Builder/Breaker in service of worst-case reasoning; we want to solve ELK in the worst case if we can do so. This means any counterexample is fair game, no matter how improbable.
One might therefore protest: "Worst-case reasoning is not suitable for deconfusion work! We need a solid understanding of what's going on, before we can do robust engineering."
However, it's also possible to use Builder/Breaker in non-worst-case (ie, probabilistic) reasoning. It's just a matter of what kind of conclusion Builder tries to argue. If Builder argues a probabilistic conclusion, Builder will have to make probabilistic arguments.
Breaker's job remains the same: finding counterexamples to Builder's proposals. If Builder thinks a counterexample is improbable, then Builder should make explicit assumptions to probabilistically rule it out.
Breaker's job is twofold:
- Point out implausible assumptions via plausible counterexamples.
- In this case we ask: does the plausibility of the counterexample force the assumption to be less probable than we'd like our precious few assumptions to be? (This reasoning should also take into account the tendency for a single counterexample to suggest the possibility of more; the provided counterexample might in itself be improbable, but it might be obvious that Breaker could spell out many other counterexamples like that, which could be collectively too probable to dismiss.)
- Point out holes in the argument, by suggesting examples which seem consistent with the assumptions but which lead to bad outcomes.
When doing the second job, Breaker doesn't have to be logically omniscient; Breaker just needs to find a hole in Builder's proof. Builder then tries to fill in the hole, either by making more detailed arguments from existing assumptions, or by making more assumptions explicit.
One reason why I think of Builder/Breaker as a cooperative game is because I think Breaker should try to provide helpful critiques. Counterexamples should strike at the heart of a proposal, meaning, they should rule out as many similar proposals as possible.
When it's going well, Builder/Breaker naturally moves in the direction of more detailed arguments. If Builder offers an informal proof sketch and Breaker makes a fiddly technical objection, that's a good sign: it means Breaker thinks the informal plan is plausible on its own terms, and so, needs to be further formalized in order to be judged properly. If Breaker thought the whole plan seemed doomed without filling in those details, Breaker should have produced a counterexample illustrating that, if possible.
In other words: the Builder/Breaker game has a natural "early game" (when plans and objections are very informal), and "late game" (when plans and objections are very formal).
This idea can help "unify" the Paul-ish approach to AI safety and the MIRI-ish approach. (I would advise caution applying this to actual Paul or actual MIRI, but I think it does capture something.) The Paul-ish approach focuses on making concrete proposals, trying to spell out safety arguments, finding counterexamples which break those arguments, and then using this to inform the next iteration. The MIRI-ish approach focuses more on deconfusion.
The Rocket Alignment Problem argues for deconfusion work through the analogy of rocket science. However, (I claim,) you can analyze most of the mistakes the Alfonzo character makes by "Alfonzo doesn't understand the Builder/Breaker game".
I see deconfusion work as saying something like: "When we try to play Builder/Breaker, we notice specific terms popping up again and again. Terms like 'optimization' and 'agent' and 'values' and 'beliefs'. It seems like our confusion about those terms is standing in the way of progress."
The MIRI-ish path reflects the common saying that if you can clearly state the problem, you're halfway to a solution. The Paul-ish path doesn't abandon the idea of clearly stating the problem, but emphasizes iteration on partial solutions as a way to achieve the needed clarity.
- Play the Builder/Breaker game yourself, with avoiding AI X-risk as the top-level goal. (Or, whichever statement of AI risk / AI alignment / AI control problem / etc seems right to you.)
- If you make it to the point where you have a vague plausible plan stated in English:
- What terms do you need to define more rigorously, before you can fill in more details of the plan or judge it properly?
- Try to operationalize those terms with more technical definitions. Do you run into more concepts which you need to be more deconfused about before proceeding?
You might want to try this exercise before reading the next section, if you want to avoid being influenced by other ideas.
Breaking Down Problems
You could say that the field of AI safety / AI alignment / whatever-we-call-it-these-days has a number of established sub-problems, eg:
- Value loading.
- Reward hacking.
- Impact measures.
- Inner alignment.
- Ontological crises.
However, there's not some specific plan which fits all of these parts together into a coherent whole. This means that if you choose one item from the list and try to work on it, you can't be very confident that your work eventually contributes to a robust plan.
This is part of the advantage of playing Builder/Breaker on the whole alignment problem, at least for a little while, before settling in on a specific sub-problem. It helps give you a sense of what overall plans you might be trying to fit your research into. (Of course, Builder/Breaker might also be a useful way to make progress on your sub-problem; this was the way ELK used it. But, this is different from playing Builder/Breaker for the whole problem.)
In other words: we can't necessarily correctly solve a sub-problem from first principles, and then expect the solution to fit correctly into an overall plan. Often, it will be necessary to solve a sub-problem in a way that's aware of the overall plan it needs to fit into.
(If we stated the sub-problems carefully enough, this would not be a problem; however, because we are still confused about many of the key concepts, these problems are best stated informally to allow for multiple possible formalizations.)
So, what are some actual high-level plans which break the problem into sub-problems which do add up to a solution?
Here is a rough sketch of the two plans:
These are not "robust plans" in the sense I defined earlier, since they are extremely vague and success relies on conditions which we don't know how to achieve. The point is that both are sketches of what robust plans might look like, such that we can see how the various sub-problems need to fit together in order to add up to something good.
My main point here is, high-level plans help us zoom in on terms which deconfusion work should focus on. I think it's fine and important to be curiosity-driven and to say "concept X just seems somehow important here" -- I'm not necessarily saying that you should drop your pet project to deconfuse "consciousness" or whatever. But to the extent that you try let your research be guided by explicit reason, I think it makes a lot of sense to play builder/breaker to try to refine high-level plans like this, and then try to deconfuse the vague terminology and intuitions involved in your high-level argument.
Building Up Solutions
In Why Agent Foundations, John justifies deconfusion work as follows:
- He names Goodhart's Law as the main reason why most would-be alignment proposals fail, justifying this with an example. The analysis is somewhat along the lines of Rohin's view from the previous section.
- He introduces the concept of "true names": concepts which don't fall apart under optimization pressure.
On his view, the aim of deconfusion work is to find a set of useful "true names" relating to AI x-risk, so that we can build solutions which don't fall apart when a huge amount of optimization pressure is applied to them.
I don't think that this is wrong, exactly, but it sounds like magic. I also find it to be a bit restrictive. For example, I think Quantilizers are a venerable illustration of the right way of doing things:
- It sets the target at "avoid catastrophe", while making as few assumptions about what "catastrophe" means as possible. This is good, because as I mentioned earlier, assumptions are opportunities to be wrong. We would like to "avoid catastrophe" in as broad and vague a sense as we can get away with, while still establishing strong results which we think apply in the real world.
- Under some assumptions, which might possibly be achievable via human effort, it gives us a meaningful guarantee with regards to avoiding catastrophe!
However, Quantilizers escape the letter of the law for John's "true names", because they explicitly do fall apart if too much optimization power is employed. Instead, we get a theory in which "too much optimization" is rigorously defined and avoided.
So, instead of John's "true names" concept, I want to rely on the rough claim I highlighted earlier, that clearly stating a problem is often 50% of the work.
Instead of "true names", we are looking for sufficiently robust descriptions of the nature of the universe, which we can use in our robust plans.
Especially "analytic philosophy".
We might define something like a "safety margin" as the number of our confident assumptions which can fail, without compromising the argument. For example, if you've got 3 assumptions and 3 different safety arguments, each of which use a different 2 of the 3 assumptions, your safety margin is 1, because you can delete any 1 assumption and still have a strong argument left. This captures the idea that redundant plans are safer. We would love to have even a single AI safety measure with a single confident argument for its adequacy. However, this only gets us to safety margin zero.
Once we have any safety argument at all, we can then try to improve the safety margin.
The risk of assigning numbers is that it'll devolve into complete BS. It's easy to artificially increase the safety margin of a plan by lowering your standards -- a paper might estimate an impressive safety margin of 6, but when you dig into the details, none of the supposed safety arguments are conclusive by your own standards.
This is essentially the question of arguing concretely for AI risk. If you are skeptical of risk arguments, you'll naturally be skeptical of the idea that AI safety researchers need to look for "robust plans" of the kind builder/breaker helps find.
Reinforcement Learning with a Corrupted Reward Channel by Everitt et al makes significant headway on this problem, proposing that feedback systems need to give feedback on other states than the current one. In ordinary RL, you only ever get feedback on the current situation you're in. This means you can never learn for sure that "it's bad to corrupt your reward signal" -- you can never experience anything inconsistent with the hypothesis "utility is (the discounted future sum over) whatever number the reward circuit outputs".
If humans are able to give feedback on hypothetical states, however, we can create a hypothetical where the agent manipulates its feedback signal, and assign a low value to that state.
Unfortunately, this does not completely rule out the more general counterexample strategy! Breaker might still be able to use the "human simulation" style counterexamples discussed in the ELK document. To name a concrete problem: if the system is judged by humans in a different state than the current one, the system might try to manipulate those other humans, rather than the current humans, which could still be bad. So the builder/breaker game continues.
Of course, Breaker may have a vague sense that the whole plan is doomed, but only be able to spot fiddly technical objections. If Breaker is wrong about the vague feelings, the technical objections are useful anyway. And if Breaker is right about the vague feelings, well, at least Breaker can slowly rule out each specific proposal Builder makes, by spotting technical flaws. This is a fine type of progress, even if slow.