Why I’m not working on {debate, RRM, ELK, natural abstractions}

[-]paulfchristiano3yΩ6150

I agree that ultimately AI systems will have an understanding built up from the world using deliberate cognitive steps (in addition to plenty of other complications) not all of which are imitated from humans.

The ELK document mostly focuses on the special case of ontology identification, i.e. ELK for a directly learned world model. The rationale is: (i) it seems like the simplest case, (ii) we don't know how to solve it, (iii) it's generally good to start with the simplest case you can't solve, (iv) it looks like a really important special case, which may appear as a building block or just directly require the same techniques as the general problem.

We briefly discuss the more general situation of learned optimization here. I don't think that discussion is particularly convincing, it just describes how we're thinking about it.

On the bright side, it seems like our current approach to ontology identification (based on anomaly detection) would have a very good chance of generalizing to other cases of ELK. But it's not clear and puts more strain on the notion of explanation we are using.

At the end of the day I strongly suspect the key question is whether we can make something like a "probabilistic heuristic argument" about the reasoning learned by ML systems, explaining why they predict (e.g.) images of human faces. We need arguments detailed enough to distinguish between sensor tampering (or lies) and real anticipated faces, i.e. they may be able to treat some claims as unexplained empirical black boxes but they can't have black boxes so broad that they would include both sensor tampering and real faces.

If such arguments exist and we can find them then I suspect we've dealt with the hard part of alignment. If they don't exist or we can't find them then I think we don't really have a plausible angle of attack on ELK. I think a very realistic outcome is that it's a messy empirical question, in which case our contribution could be viewed as clarifying an important goal for "interpretability" but success will ultimately come down to a bunch of empirical research.

[-]Steven Byrnes3yΩ240

Thanks!

Thinking about it more, I think my take (cf. Section 4.1) is kinda like “Who knows, maybe ontology-identification will turn out to be super easy. But even if it is, there’s this other different problem, and I want to start by focusing on that”.

And then maybe what you’re saying is kinda like “We definitely want to solve ontology-identification, even if it doesn’t turn out to be super easy, and I want to start by focusing on that”.

If that’s a fair summary, then godspeed. :)

(I’m not personally too interested in learned optimization because I’m thinking about something closer to actor-critic model-based RL, which sorta has “optimization” but it’s not really “learned”.)

[-]Rohin Shah3yΩ570

I think you're missing the primary theory of change for all of these techniques, which I would say is particularly compatible with your "follow-the-trying" approach.

While all of these are often analyzed from the perspective of "suppose you have a potentially-misaligned powerful AI; here's what would happen", I view that as an analysis tool, not the primary theory of change.

The theory of change that I most buy is that as you are training your model, while it is developing the "trying", you would like it to develop good "trying" and not bad "trying", and one way to make this more likely is to notice when bad "trying" develops and penalize it if so, with the hope that this leads to good "trying".

This is illustrated in the theory-of-change diagram below, where to put it in your terminology:

Each of the clouds (red or blue) consists of models that are "trying"
The grey models outside of clouds are models that are not "trying" or are developing "trying"
The "deception rewarded" point occurs when a model that is developing "trying" does something bad due to instrumental / deceptive reasoning
The "apply alignment technique" means that you use debate / RRM / ELK instead of vanilla RLHF, which allows you to notice it doing something bad and penalize it instead of rewarding it.

Some potential objections + responses:

But the model will be "trying" right after pretraining, before you've even done any finetuning!
1. Response: I don't think this is obvious, but if that is the case, that just means you should also be doing alignment work during pretraining.
But all of these techniques are considering models that already have all their concepts baked in, rather than developing them on the fly!
1. Response: I agree that's what we're thinking about now, and I agree that eventually we will need to think about models that develop concepts on the fly. But I think the overall theory of change here would still apply in that setting, even if we need to somewhat change the techniques to accommodate this new kind of capability.

[-]Steven Byrnes3yΩ440

Thanks, that helps! You’re working under a different development model than me, but that’s fine.

It seems to me that the real key ingredient in this story is where you propose to update the model based on motivation and not just behavior—“penalize it instead of rewarding it” if the outputs are “due to instrumental / deceptive reasoning”. That’s great. Definitely what we want to do. I want to zoom in on that part.

You write that “debate / RRM / ELK” are supposed to “allow you to notice” instrumental / deceptive reasoning. Of these three, I buy the ELK story—ELK is sorta an interpretability technique, so it seems plausible that ELK is relevant to noticing deceptive motivations (even if the ELK literature is not really talking about that too much at this stage, per Paul’s comment). But what about debate & RRM? I’m more confused about why you brought those up in this context. Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something?

(We can give the debaters / the reward model a printout of model activations alongside the model’s behavioral outputs. But I’m not sure what the next step of the story is, after that. How do the debaters / reward model learn to skillfully interpret the model activations to extract underlying motivations?)

[-]Wei Dai1yΩ660

Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something?

It's confusing to me as well, perhaps because different people (or even the same person at different times) emphasize different things within the same approach, but here's one post where someone said, "It is important that the overseer both knows which action the distilled AI wants to take as well as why it takes that action."

[-]Rohin Shah3yΩ440

I'm not claiming that you figure out whether the model's underlying motivations are bad. (Or, reading back what I wrote, I did say that but it's not what I meant, sorry about that.) I'm saying that when the model's underlying motivations are bad, it may take some bad action. If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.

It's plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it's also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.

[-]dxu3yΩ120

If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.
It's plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it's also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.

but, but, standard counterargument imperfect proxies Goodharting magnification of error adversarial amplification etc etc etc?

(It feels weird that this is a point that seems to consistently come up in discussions of this type, considering how basic of a disagreement it really is, but it really does seem to me like lots of things come back to this over and over again?)

[-]Rohin Shah3yΩ241

Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying.

(Yes, I'm aware that the arguments are more sophisticated than that and "previous examples of Goodharting didn't lead to extinction" isn't a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like "that's a wildly strong conclusion you've drawn from a pretty handwavy and speculative argument".)

Ultimately I think you just want to compare various kinds of models and ask how likely they are to arise (assuming you are staying within the scaled up neural networks as AGI paradigm). Some models you could consider:

The idealized aligned model, which does whatever it thinks is best for humans
The savvy aligned model, which wants to help humans but knows that it should play into human biases (e.g. by being sycophantic) in order to get high reward and not be selected against by gradient descent
The deceptively aligned model, which wants some misaligned goal (say paperclips), but knows that it should behave well until it can execute a treacherous turn
The bag of heuristics model, which (like a human) has a mix of various motivations, and mostly executes past strategies that have worked out well, imitating many of them from broader culture, without a great understanding of why they work, which tends to lead to high reward without extreme consequentialism.

(Really I think everything is going to be (4) until significantly past human-level, but will be on a spectrum of how close they are to (2) or (3).)

Plausibly you don't get (1) because it doesn't get particularly high reward relative to the others. But (2), (3) and (4) all seem like they could get similarly high reward. I think you could reasonably say that Goodharting is the reason you get (2), (3), or (4) rather than (1). But then amongst (2), (3) and (4), probably only (3) causes an existential catastrophe through misalignment.

You could then consider other factors like simplicity or training dynamics to say which of (2), (3) and (4) are likely to arise, but (a) this is no longer about Goodharting, (b) it seems incredibly hard to make arguments about simplicity / training dynamics that I'd actually trust, (c) the arguments often push in opposite directions (e.g. shard theory vs how likely is deceptive alignment), (d) a lot of these arguments also depend on capability levels, which introduces another variable into the mix (now allowing for arguments like this one).

The argument I'm making above is one about training dynamics. Specifically, the claim is that if you are on a path towards (3), it will probably take some bad actions initially (attempts at deception that fail), and if you successfully penalize those, that would plausibly switch the model towards (2) or (4).

[-]Charlie Steiner3yΩ340

Is separate training for cognitive strategy useful? I'm genuinely unsure. If you have an architecture that parametrizes how it attends to thoughts, then any ol' RL signal will teach your AI how to attend to thoughts in an instrumentally useful way. I just read Lee's post, so right now I'm primed to expect that this will happen surprisingly often, though maybe the architecture needs to be a little more flexible/recurrent than a transformer before it happens just from trying to predict text.

Instrumental cognitive strategy seems way safer than terminal cognitive strategy. Maybe you could think it's dangerous if you think it's particularly likely to give rise to a self-reflective mesa-optimizer that's capable of taking over the outer process, but mostly I expect gradient descent to work.

[-]Steven Byrnes3yΩ220

If we make an AGI, and the AGI starts doing Anki because it’s instrumentally useful, then I don’t care, that doesn’t seem safety-relevant. I definitely think things like this happen by default.

If we make an AGI and the AGI develops (self-reflective) preferences about its own preferences, I care very much, because now it’s potentially motivated to change its preferences, which can be good (if its meta-preferences are aligned with what I was hoping for) or bad (if misaligned). See here. I note that intervening on an AGI’s meta-preferences seems hard. Like, if the AGI turns to look at an apple, we can make a reasonable guess that it might be thinking about apples at that moment, and that at least helps us get our foot in the door (cf. Section 4.1 in OP)—but there isn’t an analogous trick for meta-preferences. (This is a reason that I’m very interested in the nuts-and-bolts of how self-concept works in the human brain. Haven’t made much progress on that though.)

I’m not sure what you mean by “separate training for cognitive strategy”. Also, “give rise to a self-reflective mesa-optimizer that's capable of taking over the outer process” doesn’t parse for me. If it’s important, can you explain in more detail?

[-]Charlie Steiner3yΩ120

Also, “give rise to a self-reflective mesa-optimizer that's capable of taking over the outer process” doesn’t parse for me. If it’s important, can you explain in more detail?

So, parsing it a bit at a time (being more thorough than is strictly necessary):

What does it mean for some instrumentally-useful behavior (let's call it behavior "X") to give rise to a mesa-optimizer?

It means that if X is useful for a system in training, that system might learn to do X by instantiating an agent who wants X to happen. So if X is "trying to have good cognitive habits," there might be some mesa-optimizer that literally wants the whole system to have good cognitive habits (in whatever sense was rewarded on the training data), even if "trying to have good cognitive habits" was never explicitly rewarded.

What's "self-reflective" and why might we expect it?

"Self-reflective" means doing a good job of modeling how you fit into the world, how you work, and how those workings might be affected by your actions. A non-self-reflective optimizer is like a chess-playing agent - it makes moves that it thinks will put the board in a better state, but it doesn't make any plans about itself, since it's not on the board. An optimizer that's self-reflective will represent itself when making plans, and if this helps the agent do its job, we should expect learning process to lead to self-reflective agents.

What does a self-reflective mesa-optimizer do?

It makes plans so that it doesn't get changed or removed by the dynamics of the process that gave rise to it. Without such plans, it wouldn't be able to stay the same agent for very long.

Why would a mesa-optimizer want to take over the outer process?

Suppose there's some large system being trained (the "outer process") that has instantiated a mesa-optimizer that's smaller than the system as a whole. The smaller mesa-optimizer wants to control the larger system to satisfy its own preferences. If the mesa-optimizer wants "good cognitive habits," for instance, it might want to obtain lots of resources to run really good cognitive habits on.

[And by "but I mostly expect gradient descent to work" I meant that I expect gradient descent to suppress the formation of such mesa-optimizers.]

[-]Steven Byrnes3yΩ230

Thanks. I’m generally thinking about model-based RL where the whole system is unambiguously an agent that’s trying to do things, and the things it’s trying to do are related to items in the world-model that the value-function thinks are high-value, and “world-model” and “value function” are labeled boxes in the source code, and inside those boxes a learning algorithm builds unlabeled trained models. (We can separately argue about whether that’s a good thing to be thinking about.)

In this picture, you can still have subagents / Society-Of-Mind; for example, if the value function assigns high value to the world-model concept “I will follow through on my commitment to exercise” and also assigns high value to the world-model concept “I will watch TV”, then this situation can be alternatively reframed as two subagents duking it out. But still, insofar as the subagents are getting anything done, they’re getting things done in a way that uses the world-model as a world-model, and uses the value function as a value function, etc.

By contrast, when people talk about mesa-optimizers, they normally have in mind something like RFLO, where agency & planning wind up emerging entirely inside a single black box. I don’t expect that to happen for various reasons, cf. here and here.

OK, so if we restrict to model-based RL, and we forget about mesa-optimizers, then my best-guess translation of “Is separate training for cognitive strategy useful?” into my ontology is something like “Should we set up the AGI’s internal reward function to “care about” cognitive strategy explicitly, and not just let the cognitive strategy emerge by instrumental reasoning?” I mostly don’t have any great plan for the AGI’s internal reward function in the first place, so I don’t want to rule anything out. I can vaguely imagine possible reasons that doing this might be a good idea; e.g. if we want the AGI to avoid out-of-the-box solutions or human-manipulation-related solutions to its problems, we would at least possibly implement that via a reward function term related to cognitive strategy.

I still suspect that we’re probably talking about different things and having two parallel independent conversations. ¯\_(ツ)_/¯

[-]tailcalled3yΩ340

But I’m expecting that AGI will look like model-based RL, in which case, we don’t have to search for search, the search is right there in the human source code.

Counter: The human source code won't contain general-purpose search, it will contain something like babble-and-prune or gradient descent over trajectories or something, and so there is also the question of what to do about the general-purpose search component.

[-]Steven Byrnes3yΩ220

Thanks!

I certainly expect future AGI to have “learned meta-cognitive strategies” like “when I see this type of problem, maybe try this type of mental move”, and even things like “follow the advice in Cal Newport and use Anki and use Bayesian reasoning etc.” But I don’t see those as particularly relevant for alignment. Learning meta-cognitive strategies are like learning to use a screwdriver—it will help me accomplish my goals, but won’t change my goals (or at least, it won’t change my goals beyond the normal extent to which any new knowledge and experience could potentially cause goal drift.)

I do think that the “source code” for a human brains has a rather different set of affordances for search / planning than you’ll find in AlphaZero or babble-and-prune or PPO, but I’m not sure how relevant that is. In particular, can you say more about why you believe “The human source code won’t contain general-purpose search”?

[-]Aaron_Scher3yΩ330

I doubt it's a crux for you, but I think your critique of Debate makes pessimistic assumptions which I think are not the most realistic expectation about the future.

Let’s play the “follow-the-trying game” on AGI debate. Somewhere in this procedure, we need the AGI debaters to have figured out things that are outside the space of existing human concepts—otherwise what’s the point? And (I claim) this entails that somewhere in this procedure, there was an AGI that was “trying” to figure something out. That brings us to the usual inner-alignment questions: if there’s an AGI “trying” to do something, how do we know that it’s not also “trying” to hack its way out of the box, seize power, and so on? And if we can control the AGI’s motivations well enough to answer those questions, why not throw out the whole “debate” idea and use those same techniques (whatever they are) to simply make an AGI that is “trying” to figure out the correct answer and tell it to us?

When I imagine saying the above quote to a smart person who doesn't buy AI x-risk, their response is something like "woah slow down there. Just because the AI is "trying" to do something doesn't mean it stands any chance of doing actually dangerous things like hacking out of the box. The ability to hack out of the box doesn't mysteriously line up with the level of intelligence that would be useful for an AI debate." This person seems largely right, and I think your argument is mainly "it won't work to let two superintelligences to debate each other about important things" rather than a stronger claim like "any AIs smart enough to have a productive debate might be trying to do dangerous things and have non-negligible chance of succeeding".

We could be envisioning different pictures for how debate is useful as a technique. I think it will break for sufficiently high intelligence levels, for reasons you discuss, but we might still get useful work out of it in models like GPT-4/5. Additionally, it seems to me that there are setups of Debate in which we aren't all-or-nothing on the instrumental subgoals, consequentialist planning, and meta cognition, especially in (unlikely) worlds where the people implementing debate are taking many precautions. Fundamentally, Debate is about getting more trustworthy outputs from untrustworthy systems, and I expect we can get useful debates from AIs that do not run a significant risk of the failures you describe.

Again, I doubt this is a main crux for whether you will work on Debate, and that seems quite reasonable. If it's the case that, "Debate is unlikely to scale all the way to dangerous AGIs", then to the extent that we want to focus on the "dangerous AGIs" domain we might just want to skip it and work on other stuff.

[-]Steven Byrnes3yΩ330

Thanks for your comment!

You write “we might still get useful work out of it”—yes! We can even get useful work out of the GPT-3 base model by itself, without debate, from what I hear. (I haven’t tried “coauthoring” with language models myself, partly out of inertia and partly because I don’t want OpenAI reading my private thoughts, but other people say it’s useful.) Indeed, I can get useful work out of a pocket calculator. :-P

Anyway, the logic here is:

Sooner or later, it will become possible to make highly-capable misaligned AGI that can do things like start pandemics and grab resources.
Sometime before that happens, we need to either ensure that nobody ever builds such an AGI, or that we have built defenses against that kind of AGI.

(See my post What does it take to defend the world against out-of-control AGIs?)

Pocket calculators can do lots of useful things, but they can’t solve the alignment problem, nor can they defend the world against out-of-control AGIs. What about GPT-5+debate? Can GPT-5+debate solve the alignment problem? Can GPT-5+debate defend the world against out-of-control AGIs? My belief splits between these two possibilities:

[much more likely if there are no significant changes in LLM architecture / training paradigms]—No, GPT-5+debate can’t do either of those things. But it can provide helpful assistance to humans trying to work on alignment and/or societal resilience.
- But then again, lots of things can increase the productivity of alignment researchers, including lesswrong.com and google docs and pocket calculators. I don’t think this is what debate advocates have in mind, and if it were, I would say that this goal could be better achieved by other means.
[much less likely if there are no significant changes in LLM architecture / training paradigms] Yes, GPT-5+debate can do one or both of those things. But in this scenario, I would expect that GPT-5+debate was actually doing the dangerous “trying” thing, and thus I would expect that we’re so close (maybe a couple years or less) to world-destroying AGI that there isn’t really time for humans to be involved in planning the future, which is both bad in itself and kinda undermines (what I understood as) the whole point of debate which is to enhance human supervision.
- See this comment and the last bullet point here.

[-]Aaron_Scher3y30

Thanks! I really liked your post about defending the world against out-of-control AGIs when I read it a few weeks ago.

[-]Noosphere893y10

A lot of disagreement about what a solution to technical AGI safety looks like is really downstream of disagreements about questions like “How will AGI be built? What will it look like? How will it work?”

IMO, one more disagreement I see that is arguably central to the entire field is the question of how much iteration can help you.

At one extreme, OpenAI expects the entire alignment problem to be iterated away.

At another extreme, John Wentworth doesn't expect many parts of the problem to be amenable to iteration.

[-]Steven Byrnes3y20

To me the question “how much can iteration help you?” seems to have a big impact on “What’s the probability that we’ll ultimately succeed at alignment?” but has a much smaller (albeit nonzero) impact on “What technical safety research directions are more or less promising?”. Either way, we should come up with the best plan that we can come up with for how to make aligned AGI, right? Then, insofar as we can iterate on that plan based on meaningful test data, that’s awesome, lucky us, and we should definitely do that.

(“What’s the probability that we’ll succeed at alignment” is also an important question with real-world implications, e.g. on how bad it is to shorten timelines, but it’s not something I’m talking about in this particular post.)

LESSWRONG
LW

LESSWRONG
LW

74

Why I’m not working on {debate, RRM, ELK, natural abstractions}

74

Ω 32

74

Ω 32

1. Background

1.1 “Trying” to figure something out seems both necessary & dangerous

1.2 The “follow-the-trying game”

1.3 Why I want to move the goalposts on “AGI”

2. Why I’m not working on debate or recursive reward modeling

3. Why I’m not working on ELK

4. Why I’m not working on John Wentworth’s “natural abstractions” stuff

4.1 The parts of the plan that John is thinking hard about, seem less pressing to me

4.2 The parts of the plan that seem very difficult to me, John doesn’t seem to be working on