Malign generalization without internal search

[-]abramdemski6yΩ460

A similar borderline case is death spirals in ants. (Google it for nice pictures/videos of the phenomenon.) Ants may or may not do internal search, but regardless, it seems like this phenomenon could be reproduced without any internal search. The ants implement a search overall via a pattern of behavior distributed over many ants. This "search" behavior has a weird corner case where they literally go into a death spiral, which is quite non-obvious from the basic behavior pattern.

[-]evhub6yΩ230

Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as "Prove this theorem: [...]" or "Compute the fastest route to Paris." We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above.

I feel like what you're describing here is just optimization where the objective is determined by a switch statement, which certainly seems quite plausible to me but also pretty neatly fits into the mesa-optimization framework.

More generally, while I certainly buy that you can produce simple examples of things that look kinda like capability generalization without objective generalization on environments like the lunar lander or my maze example, it still seems to me like you need optimization to actually get capabilities that are robust enough to pose a serious risk, though I remain pretty uncertain about that.

[-]Matthew Barnett6yΩ350

I feel like what you're describing here is just optimization where the objective is determined by a switch statement

Typically when we imagine objectives, we think of a score which rates how well an agent performed some goal in the world. How exactly does the switch statement 'determine' the objective?

Let's say that a human is given the instructions, "If you see the coin flip heads, then become a doctor. If you see the coin flip tails, then become a lawyer." what 'objective function' is it maximizing here? If it's maximizing some weird objective function like, "probability of becoming a doctor in worlds where the coin flips heads, and probability of becoming a lawyer in worlds where the coin flips tails" this would seem to be unnatural, no? Why not simply describe it as a switch case agent instead?

Remember, this matters because we want to be perfectly clear about what types of transparency schemes work. A transparency scheme that assumes that the agent has a well-defined objective that it is using a search to optimize for, would, I think, would fail in the examples I gave. This becomes especially true if the if-statements are complicated nested structures, and repeat as part of some even more complicated loop, which seems likely.

ETA: Basically, you can always rationalize an objective function for any agent that you are given. But the question is simply, what's the best model of our agent, in the sense of being able to mitigate failures. I think most people would not categorize the lunar lander as a search-based agent, even though you could say that it is under some interpretation. The same is true with humans, plants, animals.

[-]evhub6yΩ560

I think that piecewise objectives are quite reasonable and natural—and I don't think they'll make transparency that much harder. I don't think there's any reason that we should expect objectives to be continuous in some nice way, so I fully expect you'll get these sorts of piecewise jumps. Nevertheless, the resulting objective in the piecewise case is still quite simple such that you should be able to use interpretability tools to understand it pretty effectively—a switch statement is not that complicated or hard to interpret—with most of the real hard work still primarily being done in the optimization.

I do think there are a lot of possible ways in which the interpretability for mesa-optimizers story could break down—which is why I'm still pretty uncertain about it—but I don't think that a switch-case agent is such an example. Probably the case that I'm most concerned about right now is if you get an agent which has an objective which changes in a feedback loop with its optimization. If the objective and the optimization are highly dependent on each other, then I think that would make the problem a lot more difficult—and is the sort of thing that humans seem to do, which suggests that it's the sort of thing we might see in AI systems as well. On the other hand, a fixed switch-case objective is pretty easy to interpret, since you just need to understand the simple, fixed heuristics being used in the switch statement and then you can get a pretty good grasp on what your agent's objective is. Where I start to get concerned is when those switch statements themselves depend upon the agent's own optimization—a recursion which could possibly be many layers deep and quite difficult to disentangle. That being said, even in such a situation you're still using search to get your robust capabilities.

[-]Matthew Barnett6yΩ010

If one's interpretation of the 'objective' of the agent is full of piecewise statements and ad-hoc cases, then what exactly are we doing it by describing it as maximizing an objective in the first place? You might as well describe a calculator by saying that it's maximizing the probability of outputting the following [write out the source code that leads to its outputs]. At some point the model breaks down, and the idea that it is following an objective is completely epiphenomenal to its actual operation. The model that it is maximizing an objective doesn't shed light on its internal operations any more than just spelling out exactly what its source code is.

[-]evhub6yΩ120

I don't feel like you're really understanding what I'm trying to say here. I'm happy to chat with you about this more over video call or something if you're interested.

[-]Matthew Barnett6yΩ120

Sure, we can talk about this over video. Check your Facebook messages.

[-]Rohin Shah6yΩ220

Planned summary for the Alignment Newsletter:

This post argues that agents can have <@capability generalization without objective generalization@>(@2-D Robustness@), _without_ having an agent that does internal search in pursuit of a simple mesa objective. Consider an agent that learns different heuristics for different situations which it selects from using a switch statement. For example, in lunar lander, if at training time the landing pad is always red, the agent may learn a heuristic about which thrusters to apply based on the position of red ground relative to the lander. The post argues that this selection across heuristics could still happen with very complex agents (though the heuristics themselves may involve search).

Planned opinion:

I generally agree that you could get powerful agents that nonetheless are "following heuristics" rather than "doing search"; however, others with differing intuitions did not find this post convincing.

[-]Gordon Seidoh Worley6yΩ120

However, it's worth noting that saying the agent is mistaken about the state of the world is really an anthropomorphization. It was actually perfectly correct in inferring where the red part of the world was -- we just didn't want it to go to that part of the world. We model the agent as being 'mistaken' about where the landing pad is, but it works equally well to model the agent as having goals that are counter to ours.

That we can flip our perspective like this suggests to me that thinking of the agent as having different goals is likely still anthropomorphic or at least teleological reasoning that results from us modeling this agent has having dispositions it doesn't actually have.

I'm not sure what to offer as an alternative since we're not talking about a category where I feel grounded enough to see clearly what might be really going on, much less offer a more useful abstraction that avoids this problem, but I think it's worth considering that there's a deeper confusion here that this exposes but doesn't resolve.

[-]Daniel Kokotajlo6yΩ110

Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as "Prove this theorem: [...]" or "Compute the fastest route to Paris." We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above.

Computing the fastest route to Paris doesn't involve search?

More generally, I think in order for it to work your example can't contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.

And it's still not obvious to me that this is viable. It seems possible in principle (just imagine a sufficiently large look-up table!) but it seems like it probably wouldn't be competitive with agents that do search at least to the extent that humans do. After all, humans evolved to do search over actions, but we totally didn't have to--if bundles of heuristics worked equally well for the sort of complex environments we evolved in, then why didn't we evolve that way instead?

EDIT: Just re-read and realized you are OK with subroutines that explicitly perform search over actions. But why? Doesn't that undermine your argument? Like, suppose we have an architecture like this:

LOOP:State = GetStateOfWorld(Observation)

IF State == InPain:Cry&FlailAbout

IF State == AttractiveMateStraightAhead:MoveForward&Grin

ELSE ==: Do(RunSubroutine[SearchOverActionsAndOutputActionThoughtToYieldGreatestExpectedNumberOfGrandchildren])

END_LOOP

This seems not meaningfully different from the version that doesn't have the first two IF statements, as far as talk of optimizers is concerned.

[This comment is no longer endorsed by its author]Reply

[-]Matthew Barnett6yΩ120

Computing the fastest route to Paris doesn't involve search?

More generally, I think in order for it to work your example can't contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.

My example uses search, but the search is not the search of the inner alignment failure. It is merely a subroutine that is called upon by this outer superstructure, which itself is the part that is misaligned. Therefore, I fail to see why my point doesn't follow.

If your position is that inner alignment failures must only occur when internal searches are misaligned with the reward function used during training, then my example would be a counterexample to your claim, since the reason for misalignment was not due to a search being misaligned (except under some unnatural rationalization of the agent, which is a source of disagreement highlighted in the post, and in my discussion with Evan above).

[-]Daniel Kokotajlo6yΩ110

You are right; my comment was based on a misunderstanding of what you were saying. Hence why I unendorsed it.

(I read " In this post, I will outline a general category of agents which may exhibit malign generalization without internal search, and then will provide a concrete example of an agent in the category. Then I will argue that, rather than being a very narrow counterexample, this class of agents could be competitive with search-based agents. " and thought you meant agents that don't use internal search at all.)

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

43

Malign generalization without internal search

43

Ω 24

43

Ω 24

The switch case agent

The red-seeking lunar lander

What made the agent become malign

Imagining more complex agents