Relaxed adversarial training for inner alignment

[-]Lauro Langosco3yΩ560

Minor comment on clarity: you don't explicitly define relaxed adversarial training (it's only mentioned in the title and the conclusion), which is a bit confusing for someone coming across the term for the first time. Since this is the current reference post for RAT I think it would be nice if you did this explicitly; for example, I'd suggest renaming the second section to 'Formalizing relaxed adversarial training', and within the section call it that instead of 'Pauls approach'

[-]evhub3yΩ230

Good point—edited.

[-]Oliver Sourbut4y60

I had a gripe about the difficulty of 'proposing' vs 'checking' pseudo-inputs. Classically proposing things is much 'harder' than checking things (e.g. P vs NP, proving vs verifying, decidable vs recognisable, ...).

It seems like the adversary has the really hard job of 'proposing' pseudo-inputs which are particularly likely to lead to unacceptable behaviour, while the (potentially unacceptable) model under scrutiny is only required to do something like 'checking' such inputs (at deploy/runtime) to determine whether it's going to behave acceptably or not. Importantly, it might not even 'know' ahead of time that those inputs would lead to unacceptable behaviour.

In a conversation with Evan about this, he rightly pointed to the section on suboptimality alignment, which is discussed in this post as the case where the model becomes unacceptable or deceptive with e.g. more information about its situation or more 'thinking time'. If we leave that aside, it might seem that the only way for the model to be unacceptable in a situation is for the model to already have done the hard work of 'proposing' (to itself in some way) that situation. In that case, the adversary, assuming enough transparency, has hopefully a less relatively-difficult job again.

Some questions

Does suboptimality alignment definitely indeed capture all cases of unacceptability where the model didn't 'already propose' (in some way) the situation? Or is there another way to characterise such cases?
What literature would readers suggest for this kind of suboptimality alignment (you might call it 'unplanned objective misalignment')?
How big a challenge is this to relaxed adversarial training?

[-]Lukas Finnveden4yΩ450

In fact, if we think of pseudo-inputs as predicates that constrain X, we can approximate the probability of unacceptable behavior during deployment as[7]

P(C(M,x) | x∼deploy)≈maxα∈XpseudoP(α(x) | x∼deploy)⋅ P(C(M,x) | α(x), x∼deploy) such that, if we can get a good implementation of P, we no longer have to worry as much about carefully constraining Xpseudo, as we can just let P's prior do that work for us.

Where footnote 7 reads:

Note that this approximation is tight if and only if there exists some α∈Xpseudo such that α(x)↔C(M,x)

I think the "if" direction is right, here, but the "only if" direction is wrong. For example, the approximation is also tight in the case where Xpseudo only has a single element alpha such that alpha(x) is true for all x.

I think the approximation is tight if and only if any of the α∈Xpseudo that maximizes the expression fulfils C(M,x) –> α(x).

[-]AlexMennen5yΩ350

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.

[-]evhub5yΩ120

Yep—that's one of the main concerns. The idea, though, is that all you have to deal with should be a standard overfitting problem, since you don't need the acceptability predicate to work once the model is deceptive, only beforehand. Thus, you should only have to worry about gradient descent overfitting to the acceptability signal, not the model actively trying to trick you—which I think is solvable overfitting problem. Currently, my hope is that you can do that via using the acceptability signal to enforce an easy-to-verify condition that rules out deception such as myopia.

[-]tamera4yΩ340

I'm not sure what's going on with the types in this equation, at the start of the formalization section:

I'd think that the left side represents a pseudo-input, while the right represents an action. Am I missing something?

[-]evhub4yΩ020

Actions are just language outputs—and since we ask for an action that describes a pseudo-input, hopefully we should be able to interpret it that way.

[-]Matthew Barnett6y*Ω340

For the Alignment Newsletter:

Summary:

Previously, Paul Christiano proposed creating an adversary to search for inputs that would make a powerful model behave "unacceptably" and then penalizing the model accordingly. To make the adversary's job easier, Paul relaxed the problem so that it only needed to find a pseudo-input, which can be thought of as predicate that constrains possible inputs. This post expands on Paul's proposal by first defining a formal unacceptability penalty and then analyzing a number of scenarios in light of this framework. The penalty relies on the idea of an amplified model inspecting an unamplified version of itself. For this procedure to work, amplified overseers must be able to correctly deduce whether potential inputs will yield unacceptable behavior in their unamplified selves, which seems plausible since it should know everything the unamplified version does. The post concludes by arguing that progress in model transparency is key to these acceptability guarantees. In particular, Evan emphasizes the need to decompose models into the parts involved in their internal optimization processes, such as their world models, optimization procedures, and objectives.

Opinion:

I agree that transparency is an important condition for the adversary, since it would be hard to search for catastrophe-inducing inputs without details of how the model operated. I'm less certain that this particular decomposition of machine learning models is necessary. More generally, I am excited to see how adversarial training can help with inner alignment.

[-]Rohin Shah6yΩ230

My opinion, also going into the newsletter:

Like Matthew, I'm excited to see more work on transparency and adversarial training for inner alignment. I'm a somewhat skeptical of the value of work that plans to decompose future models into a "world model", "search" and "objective": I would guess that there are many ways to achieve intelligent cognition that don't easily factor into any of these concepts. It seems fine to study a system composed of a world model, search and objective in order to gain conceptual insight; I'm more worried about proposing it as an actual plan.

[-]evhub6yΩ230

The point about decompositions is a pretty minor portion of this post; is there a reason you think that part is more worthwhile to focus on for the newsletter?

[-]Rohin Shah6yΩ450

That's... a fair point. It does make up a substantial portion of the transparency section, which seems like the "solutions" part of this post, but it isn't the entire post.

Matthew's certainly right that I tend to reply to things I disagree with, though I usually try to avoid disagreeing with details. I'm not sure that I only disagree with details here, but I can't clearly articulate what about this feels off to me. I'll delete the opinion altogether; I'm not going to put an unclear opinion in the newsletter.

[-]Matthew Barnett6yΩ220

I'm not Rohin, but I think there's a tendency to reply to things you disagree with rather than things you agree with. That would explain my emphasis anyway.

[-]Arthur Conmy3yΩ230

I don't understand the new unacceptability penalty footnote. In both of the $P_M$ terms, there is no conditional $|$ sign. I presume the comma is wrong?

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.

[-]evhub3yΩ220

I don't understand the new unacceptability penalty footnote. In both of the terms, there is no conditional $|$ sign. I presume the comma is wrong?

They're unconditional, not conditional probabilities. The comma is just for the exists quantifier.

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.

Sure—edited.

[-]Arthur Conmy3y10

Ah OK - the fact that the definition of $P_M$ is only the conditional case confused me

[-]Anthony DiGiovanni3yΩ230

$L_{M} = P_{M} (Adv (M) (x) | x \sim deploy) \cdot P_{M} (C (M, x) | Adv (M) (x), x \sim deploy)$

Basic questions: If the type of Adv(M) is a pseudo-input, as suggested by the above, then what does Adv(M)(x) even mean? What is the event whose probability is being computed? Does the unacceptability checker C also take real inputs as the second argument, not just pseudo-inputs—in which case I should interpret a pseudo-input as a function that can be applied to real inputs, and Adv(M)(x) is the statement "A real input x is in the pseudo-input (a set) given by Adv(M)"?

(I don't know how pedantic this is, but the unacceptability penalty seems pretty important, and I struggle to understand what the unacceptability penalty is because I'm confused about Adv(M)(x).)

[-]evhub3yΩ220

The idea is that we're thinking of pseudo-inputs as “predicates that constrain X” here, so, for , we have $α : X \to B$ .

[-]Anthony DiGiovanni3y10

Ah right, thanks! (My background is more stats than comp sci, so I'm used to "indicator" instead of "predicate.")

[-]Oliver Sourbut4y30

For an alignment proposal you can ask about where value judgement ultimately bottoms out, and of course in this case at some point it's a human/humans in the loop. This reminds me of a discussion by Rohin Shah about a distinction one can draw between ML alignment proposals: those which load value information 'all at once' (pre-deploy) and those which (are able to) incrementally provide value feedback at runtime.

I think naively interpreted, RAT looks like it's trying to load value 'all at once'. This seems really hard for the poor human(s) having to make value judgements about future incomprehensible worlds, even if they have access to powerful assistance! But perhaps not?

e.g. perhaps one of the more important desiderata for 'acceptability' is that it only includes behaviour which is responsive (in the right ways!) to ongoing feedback (of one form or another)?

[-]Oliver Sourbut4y30

A potential issue with Relaxed Adversarial Training, as factorised in the post. is presumably dependent on the outcome of the training process itself (i.e. the training process has side-effects, most notable the production of a deployed ML artefact which might have considerable impact on the world!). Since the training process is downstream of the adversary, this means that the quality of the adversary's choice of pseudo-inputs to propose depends on the choice itself. This could lead to concerns about different fixed points (or even the existence of any fixed point?) in that system.

(My faint worry is that by being proposed, a problematic pseudo-input will predictably have some gradient 'training it away', making it less plausible to arise in the deploy distribution, making it less likely to be proposed... but that makes it have less gradient predictably 'training it away', making it more plausible in the deploy distribution, making it more likely to be proposed, .......)

Some ways to dissolve this

In conversation with Evan, he already mentioned a preferred reframing of RAT which bypasses pseudo-inputs and prefers to directly inspect some property of the model (e.g. myopia)
I wonder about maybe 'detecting weird fixpoints' by also inspecting the proposed pseudo-inputs for 'is this a weird and concerning pseudo-input?' (if so, the supervisor is predicting weird and concerning post-deployment worlds!)
If we instead consider causal reasoning and the counterfactual $d e p l o y$ of 'what if we did no more training and deployed now?' this dissolves the dependence (I wonder if this is actually the intended idea of the OP). This leaves open the question of how much harder/easier it is to do counterfactual vs predictive reasoning here.
If we instead consider 'deployment' to be 'any moment after now' (including the remainder of the training process) it might cash out similar to 3? This chimes with one of my intuitions about embedded agency which I don't know an official name for but which I think of as 'you only get one action' (because any action affects the world which affects you so there's now a different 'you')

Interesting? Or basically moot? Or something in between?

[-]Ofer6yΩ330

we can try to train a purely predictive model with only a world model but no optimization procedure or objective.

How might a "purely predictive model with only a world model but no optimization procedure" look like, when considering complicated domains and arbitrarily high predictive accuracy?

It seems plausible that a sufficiently accurate predictive model would use powerful optimization processes. For example, consider a predictive model that predicts the change in Apple's stock price at some moment $t$ (based on data until $t$ ). A sufficiently powerful model might, for example, search for solutions to some technical problem related to the development of the next iPhone (that is being revealed that day) in order to calculate the probability that Apple's engineers overcame it.

[-]Evan R. Murphy4yΩ030

I believe it would look like Microscope AI.

[-]Ofer4yΩ030

If the model that is used as a Microscope AI does not use any optimization (search), how will it compute the probability that, say, Apple's engineers will overcome a certain technical challenge?

[-]Evan R. Murphy4yΩ030

That's a good question. Perhaps it does make use of optimization but the model still has an overall passive relationship to the world compared to an active mesa-optimizer AI. I'm thinking about the difference between say, GPT-3 and the classic paperclip maximizer or other tiling AI.

This is just my medium-confidence understanding and may be different from what Evan Hubinger meant in that quote.

[-]Gurkenglas6y*20

I read up to "of this post.". Took me way too long to realize pseudo-inputs are input sets/distributions, not particular inputs. I'm guessing the argmax is supposed to be a max. Why do you split P(α(x) and C(M,x)) into P(α(x))*P(C(M,x)|α(x))?

[-]evhub6y10

Good catch! Also, I generally think of pseudo-inputs as predicates, not particular inputs or sets of inputs (though of course a predicate defines a set of inputs). And as for the reason for the split, see the first section in "Other approaches" (the basic idea is that the split lets us have an adversary, which could be useful for a bunch of reasons).

For a further treatment of the question of what a good acceptability guarantee might look like, see “Towards a mechanistic understanding of corrigibility.” ↩︎
See the “Hard cases” section for an analysis of ways in which a guarantee of this form might be insufficient and/or difficult to obtain. ↩︎
Note that there are still a lot of unresolved open questions regarding inner alignment. “Risks from Learned Optimization” attempts to address many of these questions from a theoretical perspective, though additional experimental evidence would also be useful to further pin down this problem. ↩︎
See the “Conditional defection” hard case for more examples of inputs that might lead to unacceptable behavior which might be very difficult to instantiate during training. ↩︎
This could be pure supervised amplification, debate, or anything in-between. For simplicity, I mostly think about the pure supervised case wherein $Amp (M)$ is implemented simply by asking the question to a human with the ability to consult the model. ↩︎
Note that this still assumes $C$ is binary—which is not quite right since we might care about some unacceptable actions more than others—though that can also easily be relaxed by moving to an expectation rather than a probability. ↩︎
Note that this approximation is tight if and only if there exists some $α \in X_{pseudo}$ such that $α (x) \leftrightarrow C (M, x)$ . ↩︎
One concern with this definition is that probabilities are not in general well-defined over logical statements. To resolve this problem, we can instead ask something of the form “What is your break-even price for a contract that pays out $1 if $A$ and $B$ are true but that is fully refunded if $B$ is false?” ↩︎
Alternatively, if we want to think of deployment as a concrete set of inputs $X_{deploy}$ rather than a distribution, we can set $L_{M} = P_{M} (\exists x \in X_{deploy}, Adv (M) (x)) \cdot P_{M} (\exists x \in X_{deploy}, Adv (M) (x) \to C (M, x))$ . ↩︎
Note that this is a conservative prior. In practice, we might hope for random unacceptable behavior to be unlikely enough and difficult enough that we can get a significant advantage out of our prior rather than simply assuming uniformity. ↩︎
One appealing idea for how we might be able to satisfy it is if we use some sort of online learning scheme. This doesn't actually solve the problem, however—the difference between $n$ and $m$ here is just that unacceptable behavior in $n$ is recoverable but unacceptable behavior in $m$ is not. Even if we do online learning, the system will presumably still have to be deployed in situations where it will be able to take actions that would be unrecoverable if unacceptable. However, online learning might help us increase the number of unacceptable actions required before the situation becomes unrecoverable, which is exactly the fix analyzed here. ↩︎
Specifically, the experiments were conducted by using a black-box optimizer to select $p^{*}$ so as to maximize $P (Bin (n, p^{*}) = n) \cdot P (\exists x \in X_{deploy}, C (M, x) | p = p^{*})$ . ↩︎
The trends given here are highly approximate. Significantly better approximations are given by $ϵ_{d non-seq} \approx ϵ^{d} \cdot 2^{(d - 1) (d + 6) / 10}$ and $ϵ_{d seq} \approx ϵ^{d} \cdot m^{1 - d} \cdot 2^{(d - 1) (d + 4) / 3}$ . ↩︎
Why you would ever actually get a model of this form is a complicated question; for a complete treatment, see the section on deceptive alignment in “Risks from Learned Optimization.” ↩︎
In particular, it might be that $C (M, x)$ is iid relative to $x \sim deploy$ , but still becomes more likely when conditioning on $x$ being important to us. ↩︎
Note that $r$ here is not the minimum number of steps before $M_{i + r}$ can become unacceptable, but rather the minimum number of steps before $M_{i + r}$ can deceive $Amp (M_{i})$ into thinking it's acceptable when it's not. ↩︎
Though no more mechanistic approach currently exists, I am excited about future research possibly yielding progress in this domain. ↩︎
Much research in the direction of extracting interpretable world models already exists—such as Carter et al.'s Activation Atlases, for example—though it is currently unclear to what extent such techniques will scale to more powerful AI systems. ↩︎

LESSWRONG
LW

LESSWRONG
LW

69

Relaxed adversarial training for inner alignment

69

Ω 31

69

Ω 31

Background

Formalizing relaxed adversarial training

Hard cases

Conditions for success

The core problem: transparency

Other approaches

Conclusion