Robustness as a Path to AI Alignment

13Scott Garrabrant

5AlexMennen

2Nisan

2AlexMennen

4AlexMennen

4jsteinhardt

2abramdemski

3magfrump

3abramdemski

New Comment

9 comments, sorted by Click to highlight new comments since: Today at 3:53 AM

I think the sense in which logical inductors are a robustness result is much stronger than the connection between Bayes and robustness. The dutch books are just an argument that if you are not coherent, you could be money pumped. They are not even the most convincing argument for coherence in my opinion. Logical inductors on the other hand are directly getting good behavior in reasoning about logic by explicitly stopping adversaries in my reasoning process from making a specific kind of treacherous turn.

In my view logical inductors are an application of a general purpose robust learning framework to logic. Most of the stuff in the logical induction paper could have been done in a domain other than logic, and most of the insights are not about logic. Instead, the insights are about a way of aggregating a bunch of experts that allow them to watch over each other and bet against the adversarial experts when they try to make a treacherous turn.

Logic is a very rich domain that we needed to understand in order to think about naturalized agency. However, since it is so large and rich, it contains adversaries, and normal ways of doing induction were not robust to these adversaries pushing you around (like in the all mathematicians are rollable thing). Logical inductions stepped in as a way to get good local results in logic without letting the adversaries take over. (Or at least it was slightly more robust than other approaches. It still does not solve benign induction.) Then, we used logical induction for reasoning about naturalized agency and a bunch of fruit came out, but those are just the applications. At the heart, logical inductors are a robustness result that is not about logic. (Although our methods for solving the robustness problem were very naturalized in flavor.)

So, the Dutch Book argument for the axioms of probability theory has an adversarial form as well. The same can be said of the money-pump argument which justifies expected utility theory. Bayesians are not so averse to adversarial assumptions as they may seem; lurking behind the very notion of "doing well in expectation" is a "for all" requirement! Bayesians turn up their noses at decision procedures which try to do well in any other than an average-case sense because they know such a procedure is money-pumpable; an adversary could swoop in and take advantage of it! This funny mix of average-case and worst-case reasoning is at the very foundation of the Bayesian edifice.

This is incorrect. The Bayesian edifice involves a complete rejection of worst-case reasoning. The possibility of an adversary tricking you into paying money to go through a sequence of trades that leave you back where you started isn't a good justification for the transitivity axiom; it's just a story that sacrifices being a correct argument when taken literally in favor of dramatic appeal, and is intended to hint in the direction of the actual argument. It's the "a paperclip factory will build an AI that will turn you into paperclips" of justifying transitive preferences. Concluding from this story that advocates of expected utility theory are worried primarily about being exploited by adversaries is missing the point in the same way that concluding that advocates of taking AI risk seriously are worried primarily about being turned into paperclips is.

The actual justification for transitive preferences is not a worst-case argument, but an EVERY-case argument. There doesn't exist any way for an agent to demonstrate that it has intransitive preferences without wasting resources by fighting itself. If there are any lotteries A, B, and C such that an agent prefers A>B, B>C, and C>A, and a nonzero probability for each pair from those three that the agent will get the opportunity to pay a small cost ("small" meaning small enough not to change any of those preferences) to choose between those two, then the agent pays for a probability distribution over outcomes that it could have gotten without paying. So the best case for an agent with intransitive preferences is that it never gets the chance to act on those preferences, and thus acts the same was that an agent with transitive preferences would. Anything else results in the agent spending unnecessary costs.

A few reasons. First, the VNM framework isn't about sequential decisions; it's about one-shot decisions. This doesn't matter too much in practice because sequential decision problems can be turned into one-shot decision problems either by having the agent pick a policy, or by using what the agent expects it will do in the future in each case to figure out what future outcomes are currently the available options. So if the agent is being supposedly being offered a choice between A and B, but if it picks B, then it will taken an option to switch to C in the future, then it isn't actually being offered a choice between A and B. The sequential argument doesn't really make sense in the static VNM context.

But also, the argument from the sequential scenario is much less robust, since as Abram pointed out, it is only one scenario that could happen with intransitive preferences. The fact that every scenario in which an agent gets to act on its intransitive preference also involves unnecessary costs to the agent seems more important. Another way in which the sequential scenario is less robust is that it can be defeated by having a policy of stopping before you get back where you started if offered the opportunity to pay to repeatedly switch outcomes. But of course this tactic does not change the fact that if you pay to go even one step no matter what your starting position was, then you're paying unnecessary costs.

We want an alternative to optimization which is robust to misspecified utility functions. A Bayesian approach might introduce a probability distribution over possible utility functions, and maximize expected utility with respect to that uncertainty. This doesn't do much to increase our confidence in the outcome; we've only pushed the problem back to correctly specifying our uncertainty over the utility distribution, and problems from over-optimizing a misspecified function seem just about as likely. Certainly we get no new formal guarantees.

So, instead, we model the situation by supposing that an adversary has some bounded amount of power to deceive you about what your true utility function is. The adversary might concentrate all of this on one point which you'll be very mistaken about (perhaps making a very bad idea look very good), or spread it out across a number of possibilities (making many good ideas look a little worse), or something inbetween. Under this assumption, a policy which randomizes actions somewhat rather than taking the max-expected-utility action is effective. This gives you some solid guarantees against utility misspecification, unlike the naive Bayesian approach. There's still more to be desired, but this is a clear improvement.

It's not clear to me that this actually gains anything. I'd expect that adequately parameterizing a class of adversaries to defend against isn't much easier than adequately parameterizing a class of utility functions to be uncertain over.

Calibration means that the beliefs can be treated as frequencies

Does logical induction have a calibration result? I know it has a self-trust result that basically says it believes itself to be calibrated, but I'm not aware of a result saying that logical inductors actually are calibrated. For that matter, I'm not even sure how such a result would be precisely stated. [Edit: It turns out there are calibration results, in section 4.3 of the logical induction paper.]

You might be interested in my work on learning from untrusted data (see also earlier work on aggregating unreliable human input). I think it is pretty relevant to what you discussed, although if you do not think it is, then I would also be pretty interested in understanding that.

Unrelated, but for quantilizers, isn't the biggest issue going to be that if you need to make a sequence of decisions, the probabilities are going to accumulate and give exponential decay? I don't see how to make a sequence of 100 decisions in a quantilizing way unless the base distribution of policies is very close to the target policy.

I've been interested in the general question of adapting 'safety math' to ML practices in the wild for a while, but as far as I know there isn't a good repository of (a) math results with clear short term implications or (b) practical problems in current ML systems. Do you have any references for such things? (even just a list of relevant blogs and especially good posts that might be hard to track down otherwise would be very helpful)

First, I want to note that the approach I'm discussing in the post doesn't necessarily have much to do with (a) or (b); the "philosophy to math to implementation" pipeline may still be primarily concerned with (a*) math results with far-term implications and (b*) practical problems in ML systems which aren't here yet.

That being said, it is hard to see how a working philosophy-math-implementation pipeline could grow and stay healthy if it focused only on problems which aren't here yet; we need the pipeline to be in place by the time it is needed. This poses a problem, because if we are trying to avert future problems, we don't want to get caught in a trap of only doing things which can be justified by dealing with present issues.

Still, over-optimizing for the wrong objective really is a natural generalization of overfitting machine learning models to the data, so it is plausible that quantilizing (or other techniques yet to be invented) provides *better* results on a wide variety of problems than maximizing. Although my thinking on this is motivated by longer-term considerations, there's no reason to think this doesn't show up in existing systems.

Some references for alignment/safety work in this direction: RL with a Corrupted Reward Channel, and Concrete Problems in AI Safety.

[Epistemic Status: Some of what I'm going to say here is true technical results. I'll use them to gesture in a research direction which I think may be useful; but, I could easily be wrong. This does not represent the current agenda of MIRI overall, or even my whole research agenda.]## Converting Philosophy to Machine Learning

A large part of the work at MIRI is to turn fuzzy philosophy problems into hard math. This sometimes makes it difficult to communicate what work needs done, for example to math-savvy people who want to help. When most of the difficulty is in finding a problem statement, it's not easy to outsource the intellectual labor.

Philosophy is also hard to get traction on. Arguably, something really good happened to the epistemic norms of AI research when things switched over from GOFAI to primarily being about machine learning. Before, what constituted progress in AI was largely up to personal taste. After, progress could be verified by achieving high performance on benchmarks. There are problems with the second mode as well -- you get a kind of bake-off mentality which focuses on tricks to get higher performance, not always yielding insight. (For example, top-performing techniques in machine learning competitions often combine many methods, taking advantage of the complementary strengths and weaknesses of each. However, this approach leans on the power of the other methods.) Nonetheless, this is better for AI progress than armchair philosophy and toy problems.

It would be nice if AI alignment could be more empirically grounded. There are serious obstacles to this. Many alignment concerns, such as self-modification, barely show up or seem quite easy to solve when you aren't dealing with a superintelligent system. However, I'll argue that there is sometimes a way to turn difficult alignment problems into machine learning problems.

A second reason to look in this direction is that in order to do any good, alignment research has to be used by the people who end up making AGI. The way things look right now, that means they have to be used by machine learning researchers. To that end, anything which puts things closer to a shape which ML researchers are familiar with seems good.

To put it a different way: in the end, we need a successful pipeline from philosophy, to math, to implementation. So far, MIRI has focused on optimizing the first part of that pipeline. I think it may be possible to do research in a way which helps optimize the second part.

We wouldn't want the research direction to be constrained by this, since in the end we need to figure out what actually works, not what creates the most consumable research. However, I'll also argue that the research direction is plausible in itself.

## The Big Picture

I started working full-time at MIRI about three months ago. In my second week, we had a research retreat in which we spent a lot of time re-thinking how all of the big research problems connect with each other and to the overall goal. I came out of this with the view that things factored somewhat cleanly into three research areas:

value learning,robust optimization, andnaturalized agency. [Again, this write-up isn't intended reflect the view of MIRI as a whole.]Value Learning:The first problem is to specify what is "good" or what you "want" in enough detail that nothing goes wrong when we optimize for it. This is too hard (since humans seem really bad at knowing what they want in precise terms), so it would be nice to reduce it to a learning problem, if possible. This requires things like learning human concepts (including the concept "human") and accounting for bounded rationality in learning human values (so that you don't assume the humanwantedto stub its toe on the coffee table).Robust Optimization:We will probably get #1 wrong, so how can we specify systems which don't go off-track too badly if their values are misspecified? This includes things like transparency, corrigibility, and planning under moral uncertainty (doing something other than max-expected-value to avoid over-optimizing). Ideally, you want to be able to ask a superintelligent AI to make burritos, andnotend up with a universe tiled with burritos. This corresponds approximately to the AAMLS agenda.Naturalized Agency:Even if we justknewthe correct value function and knew how to optimize it in a robust way, we don't actually know how to build intelligent agents which optimize values. It's a bit like the difference between knowing that you want to classify images and getting to the point where you optimize neural nets to do so: you have to figure out that squared-error loss plus a regularizer works well, or whatever. We aren't to the point where we just know what function to optimize neural nets for to get AGI out, value-aligned or no. Existing decision theories, agent frameworks, and definitions of intelligence don't seem up to the task of examining what rational agency looks like when the agent is embedded in a world which is bigger than it (so the real world is certainly not in the hypothesis space which the agent can represent), the agent can self-modify (so reflective stability and self-trust becomes important), and the agent is part of the world (so agents must understand themselves as physics and consider their own death).To storify:

AI should do X such that X=argmax(value(x))!WAIT!We don't know what value is! We should figure that out!WAIT! Trying to argmax a slightly wrong thing often leads to more-than-just-slightly wrong results! We should figure out some other operation than argmax, which doesn't have that problem! WAIT! The universe isn't actually in a functional form such that we can just optimize it! What are we supposed to do?In a sense, this is a series of proxy problems. We actually want #1, but we've done relatively little on that front, because it seems much too confusing to make progress on. #2 still cuts relatively close to the problem, and plausibly, solving #2 means not needing to solve #1 as well. More has been done on #2, but it is still harder and more confusing than #3. #3 is fairly far removed from what we want, but working on #3 seems to plausibly be the fastest route to resolving confusions which block progress on #1 and #2.

What I want to outline is a particular way of thinking which seems to be associated with progress on both #2 and #3, and which also seems like a good sign for the philosophy→implementation pipeline.

## What is Robustness?

(I'm calling this thing "robustness" in

associationwith #2, but "robust optimization" should be thought of as its own thing -- robustness is necessary for robust optimization, but perhaps not sufficient.)Robustness might be intuitively described as tolerance to errors. Put in a mathematical context, we can model this via an adversary who has some power to trip you up. A robustness property says something about how well you do against such adversaries.

For example, take quantilization. We want an alternative to optimization which is robust to misspecified utility functions. A Bayesian approach might introduce a probability distribution over possible utility functions, and maximize expected utility with respect to that uncertainty. This doesn't do much to increase our confidence in the outcome; we've only pushed the problem back to correctly specifying our uncertainty over the utility distribution, and problems from over-optimizing a misspecified function seem just about as likely. Certainly we get no new formal guarantees.

So, instead, we model the situation by supposing that an adversary has some bounded amount of power to deceive you about what your true utility function is. The adversary might concentrate all of this on one point which you'll be very mistaken about (perhaps making a very bad idea look very good), or spread it out across a number of possibilities (making many good ideas look a little worse), or something inbetween. Under this assumption, a policy which randomizes actions somewhat rather than taking the max-expected-utility action is effective. This gives you some solid guarantees against utility misspecification, unlike the naive Bayesian approach. There's still more to be desired, but this is a clear improvement.

Mathematically, an adversarial assumption is just a "for all"requirement. Bayesians are more familiar with doing well

in expectation. Doing well in expectation has its merits. However, adversarial assumptions create stronger guarantees on performance, by optimizing for the worst case.## Garrabrant Inductors as Robustness

Garrabrant Inductors (AKA logical inductors) are MIRI's big success in naturalized agency. (Reflective oracles come in second.) They go a long way to clear up confusions about logical uncertainty, which was one of the major barriers to understanding naturalized agents. When I say that there has been more progress on naturalized agency than on robustness, they're a big part of the reason. Yet, at their heart is something which looks a lot like a robustness result: the logical induction criterion. You take the set of all poly-time trading strategies on a belief market, and ask that a Garrabrant inductor doesn't keep losing against any of these forever. This is very typical of bounded-loss conditions in machine learning.

In return, we get reliability guarantees. Sub-sequence optimality means that we get the benefits of the logical induction criterion no matter which subset of facts we actually care about. Calibration means that the beliefs can be treated as frequencies, and unbiasedness means these frequencies will be good even if the proof system is biased (selectively showing evidence on one side more often than the other). Timely learning means (among other things) that it doesn't matter too much if the theorem prover we're learning from is slow; we learn to predict things as quickly as possible (eventually).

The logical induction criterion is a relaxation of the standard Bayesian requirement that there be no Dutch Book against the agent. So, the Dutch Book argument for the axioms of probability theory has an adversarial form as well. The same can be said of the money-pump argument which justifies expected utility theory. Bayesians are not so averse to adversarial assumptions as they may seem; lurking behind the very notion of "doing well in expectation" is a "for all" requirement! Bayesians turn up their noses at decision procedures which try to do well in any other than an average-case sense because they

knowsuch a procedure is money-pumpable; an adversary could swoop in and take advantage of it!This funny mix of average-case and worst-case reasoning is at the very foundation of the Bayesian edifice. I'm still not quite sure what to think of it, myself. Philosophically, what should determine when I prefer an average-case argument vs a worst-case one? But, that is a puzzle for another time. The point I want to make here is that there's a close connection between the types of arguments we see at the foundations of decision theory (Dutch Book and money-pump arguments which justify notions of rationality in terms of guarding yourself against an adversary) and arguments typical of machine learning (bounded-loss properties).

The Dutch Book argument forces a tight, coherent probability distribution, which can't be both computable and consistent with logic. Relaxing things a little yields a wealth of benefits. What other foundational arguments in decision theory can we relax a little to get rich robustness results?

## Path-Independence

These examples are somewhat hand-wavy; what I'll say here is true, but hasn't yet brought forth any fruit in terms of AI alignment results. I am putting it here merely to provide more examples of being able to frame decision-theory things as robustness properties.

I've mentioned the Dutch Book argument. Another of the great foundational arguments for Bayesian subjective probability theory is Cox's Theorem. One of the core assumptions is that if a probability can be derived in many ways, the results must be equal. This is related to (but not identical with) the fact that it doesn't matter what order you observe evidence in; the same evidence gives the same conclusion, every time.

Putting this into an adversarial framework, this means the class of arguments which we accept doesn't leave us open to manipulation. Garrabrant Induction weakens this (conclusions are not fully independent of the order in which evidence is presented), but also gets versions of the result which a Bayesian can't, as mentioned in the previous section: it arrives at unbiased probabilities even if it is shown a biased sampling of the evidence, so long as it keeps seeing more and more. (This is part of what saves Garrabrant Induction from my All Mathematicians are Trollable result.)

Another example illustrating the need for path-independence is Pascal's Mugging. If your utility function is unbounded and your probability distribution is something like the Solomonoff distribution, it's awfully hard to avoid having divergent expected utilities. What this means is that when you try to sum up your expected utility, the sum you get is dependent on the order you sum things in. This means Pascal can alter your end conclusion by directing your attention to certain possibilities, extracting money from you as a result.

It seems to me that path-independent reasoning is a powerful rationality tool which I don't yet fully understand.

## Nuke Goodhart’s Law From Orbit

(Repeatedly. It won't stay down.)Goodhart's Curse is not the

onlyproblem in the robust optimization cluster, but it's close; the majority of the problems there can be seen as one form or another of Goodhart. Quantilizers are significant progress against goodhart, but not total.Quantilizers give you a knob you can turn to optimize softer or harder, without clear guidance on how much optimization is safe. If you keep turning up the knob and seeing better results, what would make you back off from cranking it up as far as you can go?

Along similar lines, but from the AIs perspective, there's nothing stopping a quantilizer from building a maximizer in order to solve its problem. In our current environment, "implement a superintelligent AI to solve the problem" is far from the laziest solution; but in an environment containing highly intelligent quantilizers, the tools to do so are lying around. It can do so merely by "turning up its own knob".

Nonetheless, it seems plausible that progress can be made via more robustness results in a similar direction.

Something which has been discussed at MIRI, due to Paul Christiano's thoughts on the subject, is the Benign Induction problem. Suppose that you have some adversarial hypotheses in your belief mixture, which pose as serious hypotheses and make good predictions much of the time, but are actually out to get you; after amassing enough credibility, at a critical juncture they make bad predictions which do you harm.

One way of addressing this, inspired by the KWIK learning framework, is the consensus algorithm. How it works is, you don't output any probability at all unless your top hypotheses

agreeon the prediction; not just on the classification, but on the probability to within some acceptable epsilon tolerance. This acts as an honesty amplifier. Suppose you have a hundred hypotheses, and only one is good; the rest are corrupt. Even if the corrupt hypotheses can coordinate with each other, the one good hypothesis keeps them all in check: nothing they say gets out to the world unless they agree very closely with the good one. (However, theycansilence the good hypothesis selectively, which seems concerning!)A solution to the benign induction problem would be significant progress on the robust optimization problem: if we could trust the output of induction, we could use it to predict what optimization techniques are safe! (There's much more to be said on this, but this is not the article for it.)

So, quantilizers are robust to adversarial noise in the utility function, and the consensus algorithm is (partially) robust to adversarial hypotheses in its search space. Imagine a world where we've got ten more things like that. This seems like significant progress. Maybe then we come up with a Robust Optimization Criterion which implies all the things we want!

Machine learning experts and practitioners alike are familiar with the problems of over-optimization, and the need for regularization, in the guise of overfitting. Goodhart's Curse is, in a sense, just a generalization of that. So, this kind of alignment progress might be absorbed into machine learning practice with relative ease.

## Limits of the Approach

One problem with this approach is that it doesn't provide

thatmuch guidance. My notion of robustness here is extremely broad. "You can frame it in terms of adversarial assumptions" is, as I noted, equivalent to "usefor all". Setting out to use universal quantifiers in a theory is hardly much to go on!It's not nothing, though; as I said, it challenges the Bayesian tendency to use "in expectation" everywhere. And, I think adding adversarial assumptions is a good brainstorming exercise. If a bunch of people sit down and try to come up with new parts to inject adversarial assumptions into for five minutes, I'm happy. It just may be that someone comes up with a great new robustness idea as a consequence.