Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: There’s a lot in this post and my general approach while working on it that, in retrospect, wasn’t thought out well enough. I’m posting it because I figure sharing flawed ideas is better than letting this languish in a doc forever, while I work on improving them.

Thanks to Paul Colognese, Tassilo Neubauer, John Wentworth, and Fabien Roger for useful conversations, Janus for suggesting something that led me to some of the ideas I mention here, and Shoshannah Tekofsky for feedback on a draft.

I spent some time trying to find the high-level structure in a neural net corresponding to a deep reinforcement learning model’s objective[1][2]. In this post, I describe some of the stuff I tried, thoughts from working on this, and mistakes I made.

Context

Many approaches in current interpretability work as I understand them involve understanding low-level components of a model (Circuits and subsequent work, for example), as a way of building up to more complex components. I think we can make progress from the opposite frontier at the same time, and try to identify the presence or nature of certain high-level mechanistic structures or properties in a model. Examples of this include verifying whether a model is doing optimization, whether it’s myopic, isolating the objective of an optimizer, and essentially the entire class of properties with a singular answer for the entire model[3].

I prefer to distinguish these directions as low-level and high-level interpretability respectively for descriptive clarity. The framing of best-case and worst-case transparency as laid out in the transparency tech tree also points at the same concept. I expect that both directions are aiming at the same goal, but working top-down seems pretty tractable at least for some high-level targets that don’t require lots of deconfusion to even know what we’re looking for.

Ideally, an approach to high-level interpretability would be robust to optimization pressure and deception (in other words, to gradient descent and adversarial mesa optimizers), but I think there’s a lot we can learn toward that ideal from more naive approaches. In this post, I describe my thoughts on trying to identify the high-level structure in a network corresponding to a deep reinforcement learning model’s objective.

So I didn’t set out on this particular approach expecting to succeed at the overall goal (I don’t expect interpretability to be that easy), instead hoping that trying out a bunch of naive directions will give us insight into future high-level interpretability work. To that end, I try to lay out my reasoning at various points, riddled with gaps at points, while thinking about this.

I think there are plenty of plausible avenues to try out here, and for the most part will only be describing the problems with ones I thought of. Further, the directions I tried don’t seem very non-obvious and there are likely much better methods to be tried given more thought here.

A naive approach and patching

Take the case of a small RL model that can optimize for a reward in a given environment. What we want is information about how the model’s objective is represented in the network’s weights. Concretely, we might think about what kind of tests or processes disproportionately affect these weights more than any other, and try to isolate the information we want from there.

One approach we might try is:

  • Train the initialized model on reward RA for an extended period of time. Let’s call the model at the end of this step MA. We may think of MA as having learned a pretty sophisticated world model at this point, such that further training wouldn’t update it strongly.
  • Train the model MA on a different reward RB in the same environment[4] for enough timesteps to achieve non-trivial performance. Ideally, RB would be as orthogonal as possible from RA[5]. Let’s call the model at the end of this step MB.

Naively, one might expect the update signal from MA to MB to correspond to changes in the model’s internal objective. After all, given that the model already understood the environment well enough after the first training period, it can be intuitive to think that the primary element changing during the second period is what the model should try to optimize for.

This method fails however[6], because it relies on a few questionable assumptions. For one, it requires that the model’s policy is generated at runtime, inferring from its knowledge of the environment and its objective, instead of being a component of the model that is changed through updates - a priori as well as empirically, this seems unlikely to be true.

Is this a problem that we can patch easily? One direction I considered along these lines:

  • Take the difference of the weights between MB and MA (you can view this as all the updates the model received during the second round of training), and subtract this from the model MA. Let’s call the model you get after subtracting these updates MB*. Under the reasoning above, one might expect MB* to now optimize for minimizing the RB.
  • But since subtracting these updates messes with other components like the model’s policy, even if the model now leans toward minimizing RB, it’ll be drowned out by generally terrible performance. If you could then train MB* for a few timesteps (say, on RA) to account for capability loss, then if there is a lean toward minimization, it should become visible.

On testing, however, the retrained MB* does not show any visible inclination like this. In retrospect, that made sense - it relied on the assumption that the internal representation of the objective is bidirectional, that the parameter-reward mapping is linear. A high-level update signal in one direction doesn’t necessitate that the inverted signal results in the inverted direction. This direction was a bust, but it was useful for me to make incorrect implicit assumptions like this more explicit.

Another reason why patching this would be difficult is that there are a number of other potential confounding changes to the model’s internals. Policy changes are only one, there may also be other changes such as refocusing higher granularity on parts of the world model or more efficient modes of compressing the world model, phase changes as the model discovers new modes of reasoning, and other unknowns.

All this means that if there is information contained in the update signal corresponding to changes in the objective, it faces strong interference from other factors. Is it possible to extract this information however, using some creative methods?

Thoughts, and a (potentially) promising approach

So far, I’ve been talking about one central idea - devising a training procedure that differentially affects the internal objective from the rest of the network in some identifiable way. The naive approach above was too noisy to work for this, but my prior expectation was not that the first method (or convoluted patches on top of it) would make actual object-level progress on the problem of isolating this internal objective.

What I expected with these experiments so far was clarifying some of my own thoughts on objective representation in optimizers (and more broadly, optimizers themselves), and with non-trivial probability gaining new insights on what might and might not work or better framings for tackling this problem. I think I succeeded on the former front and probably failed on the latter.

That said, the idea I’m about to describe was one I had when I started working on this project - in retrospect, it still seems like one worth pursuing, though for the same reason of being a promising way to gain more insight into better directions (look, this line of research is really unexplored).

I hinted earlier that there are many confounders that skew any signal we might be able to extract. My current guess is however, that this is a quantitative problem more than a qualitative one - I don’t know whether we will be able to iteratively extract a purer signal corresponding to a model’s objective, but it seems plausible that accounting for all the confounders we can will give us relevant insights more easily accessible. (Measuring a lot of things would also be of use here, and is the intent behind my framing in this post.)

Several of these we can account for through more precise engineering - in other words, the obvious things I wasn’t doing because I either missed it entirely or wanted to get quick results and a lot of data without optimizing on the implementation too hard. If someone finds this line of research exciting and wants to work on it, please reach out!

Some confounders, however, will likely require more creative approaches to solve. This is where this section’s idea comes into play: what would we see if we superpose different contexts in which the model’s objective should be activated?

For example, we could train copies of the model MA on different RB, and smooth out the update signal from all of these contexts. Ideally, if the RBs we choose are orthogonal enough from each other, this should result in several confounders canceling themselves out.

This can also be framed as (although is not necessarily equivalent to) saying that in all of these contexts, the objective is plausibly one of the few things that we can control to change in a desired mechanistic manner, and thus we can select a smoothing mechanism that filters for this.

As mentioned earlier, this is only one approach out of many that may be promising, and there’s definitely a lot of space to be covered here in terms of other directions. Even if you disagree on these approaches being useful, the core idea seems very probably important, and seems to me to inspire generators for many promising approaches.

A very late appendix: what's an objective?

Now, one of the sections I should’ve worked on before starting to work on anything else in this post, which I’m instead writing weeks later. What exactly are we trying to find?

Without thinking about it too hard - which I hadn’t - you might handwave this away under the assumption that it won’t be that complicated. In my case, I thought something along the lines of “the internal mechanistic notion of objective a mesa-optimizer can use for planning or search, however sparsely that may be represented”.

This is troublesome for two reasons - first that this description and the intuitions it invokes as stated may not even be true; and second that even if we do have a good broad description, understanding it with greater granularity and what properties it would imply would be extremely helpful, plausibly crucial, to even coming up with useful ways to bang stuff against the wall to see what sticks.

So what are the scenarios in which the above description isn’t strictly accurate? 

Objectives as pointers

One thing that comes to mind is that it’s possible that the models we care about don’t learn their objectives internally, and instead contain pointers to the environment that inform its understanding of the reward function.

If this turns out to be the case, then any approach that seeks to identify the objective of an optimizer would have to be broad enough to identify those pointers and their mechanism with sufficient fidelity to gain the kind of insights we want in the other case. For example, we might need to identify the pointer inside the network / what it’s pointing to, and understand how they inform the model’s notion of objective, from the reward channel.

I think this isn’t going to be a significantly harder problem than just identifying a concrete objective structure in the network, but may involve different approaches to solving the two - the kind of approach described in this post, for example, likely would not be sufficient. We also need more empirical work into what kinds of models learn objectives in what way to make progress on this front, but for now it’s a consideration that future approaches should take.

Objectives in shards

That isn’t the only possibility, however. Especially because we’re working with toy models that ostensibly fit the description of an optimizer, we may end up with a model that mechanistically doesn’t have an explicit notion of objective.

For example (there’s a fair amount I’m still thinking about for this case, so there could be stuff I’m missing), we could end up with complex stews of contextually activated heuristics or subroutines chiseled through the outer objective that execute specific cognition without necessarily learning an internal objective to use.

It may be the case that any model we actually have cause to worry about - that has the capacity to do long-term deceptive planning - requires mechanistic internalization of some objective which the mesa-optimizer can use, faster than gradient descent can activate specific subroutines for this. While I find this intuitively compelling, I haven’t thought deeply enough about what’s necessary for planning to be entirely certain that this must be true, and even then it restricts the class of models useful to test with.


Luckily, thinking deeply about that is part of what I’m currently spending time on with this general approach, which is a lot of conceptual legwork on coming up with as formal/clear a treatment of a mechanistic objective in any network as is sufficient for somewhat promising experiments. I originally picked this particular high-level interpretability target over something like identifying myopia or agency because it felt like it involved less deconfusion work, and I still think that’s true - just that I overshot into the “requires no deconfusion” territory. Correcting for that, I’m relatively optimistic about this line of work.

  1. ^

    This could be a specific component of the model, or it could be sparsely represented across all the weights of the network - we want a setup that can work with both. What we get could look like a map of parameters and how relevant they are to an internal notion of objective - like a heatmap of the model.

  2. ^

    If your reaction is that I should be a lot more concrete about exactly what we’re looking for and what properties it might imply - I agree. This is one place I should have given a lot more thought into before jumping into experimentation. Refer to the last section for more details on this.

  3. ^

    Note that the answer itself may not be single-headed - in the case where we want to identify how a predictive model makes its predictions for example, the answer could involve multiple modes of computation such as direct simulation, heuristics, etc. The important distinction here is that all of this information is still part of the answer to the question of how the model makes predictions.

  4. ^

    Where RB could either be a single new reward or a combination of RA with a new component - hereafter I will just use RB to refer to the new component in the latter case for brevity.

  5. ^

    While working on this, I wanted to run a quick implementation and therefore did not think of this early enough or account for this well. This is one of a few instances where potential improvements on my approach seem obvious. 

  6. ^

    I plotted way too many graphs with far too much data that individually don't turn out to be all that informative. I'd be happy to share them if anyone wants, but I think they'd just clutter up this post without adding much.

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 9:49 AM

I'm going to try to have a post that's pretty similar to this out soon.

A main claim is that the thing you want to be doing (not just a general you, I mean specifically the vibe I get from you in this post) is to build an abstract model of the AI and use interpretability to connect that abstract model to the "micro-level" parameters of the AI. "Connect" means doing things like on-distribution inference of abstract model parameters from actual parameters, or translating a desired change in the abstract model into a method for updating the micro-level parameters. Neither of these directions of connection have to be don't purely "internal to the AI," without reference to the environment it's in - it's completely valid (and indeed necessary) to fit an agent into a model of its environment in the process of talking about its goals.  Being able to talk about "the AI's objectives" is the special case when you have an abstract model of the AI that features objectives as modeled objects. But using such a model isn't the only way to make progress! We need to build our general capability to connect AIs' parameters to any useful abstract model at all.

It may be the case that any model we actually have cause to worry about - that has the capacity to do long-term deceptive planning - requires mechanistic internalization of some objective which the mesa-optimizer can use, faster than gradient descent can activate specific subroutines for this.

I don't think we get off so easy. Even in humans, we're pretty general and powerful despite not having some super-obvious locus of motivation outside the "intended" motivational system that trained our neocortex. It's just that our capabilities have generalized faster than that motivational system has, and so we do things like invent Doritos even though on-distribution analysis of humans in the ancestral environment might have inferred that we want nutritious food.

A main claim is that the thing you want to be doing (not just a general you, I mean specifically the vibe I get from you in this post) is to build an abstract model of the AI and use interpretability to connect that abstract model to the "micro-level" parameters of the AI. "Connect" means doing things like on-distribution inference of abstract model parameters from actual parameters, or translating a desired change in the abstract model into a method for updating the micro-level parameters.

Yeah, this is broadly right. The mistake I was making earlier while working on this was thinking that my abstract model was good enough - I've since realized that this is the point of a large part of agent foundations work. It took doing this to realize that however and this framing isn't exactly how I was viewing it but seems pretty cool, so thanks!

Being able to talk about "the AI's objectives" is the special case when you have an abstract model of the AI that features objectives as modeled objects. But using such a model isn't the only way to make progress! We need to build our general capability to connect AIs' parameters to any useful abstract model at all.

Oh yeah I agree - hence my last section on other cases where what we want (identifying the thing that drives the AI's cognition) isn't as clear-cut as an internalized object. But I think focusing on the case of identifying an AI's objectives (or what we want from that) might be a good place to start because everything else I can think of involves even more confused parts of the abstract model and multitude of cases! Definitely agree that we need to build general capacity, I expect there's progress to be made from the direction of starting with complex abstract models that low-level interpretability would eventually scale to.

Even in humans, we're pretty general and powerful despite not having some super-obvious locus of motivation outside the "intended" motivational system that trained our neocortex.

(Disclaimer: includes neurological conjectures that I'm far from familiar with) I agree with the general point that this would plausibly end up being more complicated, but to explain my slight lean toward what I said in the post: I think whatever our locus of motivation is, intuitively it's plausibly still represented somewhere in our brain - i.e., that there are explicit values/objectives driving a lot of our cognition rather than just being value-agnostic contextually-activated reactions. Planning in particular probably involves outcome evaluation based on some abstract metric. If this is true, then wherever those are stored in our brain's memory/whatever would be analogous to what I'm picturing here.

Planning in particular probably involves outcome evaluation based on some abstract metric. If this is true, then wherever those are stored in our brain's memory/whatever would be analogous to what I'm picturing here.

Ah yeah, that makes sense for inference. Like if I'm planning some specific thing like "get a banana", maybe you can read my mind by monitoring my use of some banana-related neurons. But I view such a representation more as an intermediate step in the chain of motivation and planning, with the upshot that interpretability on this level has a hard time being used to actually intervene on what I want - I want the banana as part of some larger process, and so rewiring the banana-neurons that were useful for inference might get routed around or otherwise not have the intended effects. This also corresponds to a problem with trying to locate goals in the neocortex by (somehow) changing my "training objective" and seeing what parts of my brain change.

Oh yeah, I'm definitely not thinking explicitly about instrumental goals here, I expect those would be a lot harder to locate/identify mechanistically. I was picturing something more along the lines of a situation where an optimizer is deceptive, for example, and needs to do the requisite planning which plausibly would be centered on plans that best achieve its actual objective. Unlike instrumental objectives, this seems to have a more compelling case for not just being represented in pure thought-space, rather being the source of the overarching chain of planning.

On testing, however, the retrained MB* does not show any visible inclination like this. In retrospect, that made sense - it relied on the assumption that the internal representation of the objective is bidirectional, that the parameter-reward mapping is linear. A high-level update signal in one direction doesn’t necessitate that the inverted signal results in the inverted direction. This direction was a bust, but it was useful for me to make incorrect implicit assumptions like this more explicit.

I think it's improbable that agents internalize a single objective, but I applaud your concrete hypothesis and then going out to test it. I'm very excited about people trying to predict what algorithms a policy net will be running, thereby grounding out "mesa objectives" and other such talk in terms of falsifiable predictions about internal cognition and thus generalization behavior (e.g. going towards coin or going towards right or something else weirder than that).

[-]Jozdien1yΩ110

Do you think the default is that we'll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can't really be identified mechanistically as such, or that only processes where they're really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies - figuring out all of them seems pretty useful.

[-]Dalcy1yΩ120

Especially because we’re working with toy models that ostensibly fit the description of an optimizer, we may end up with a model that mechanistically doesn’t have an explicit notion of objective.

I think this is very likely to be the default for most toy models one trains RL on. In my model of agent value formation (which looks very much like this post), explicit representation of objectives is useful inasmuch the model already has some sort of internal "optimizer" or search process. And before that, simple "heuristics" (or shards) should suffice—especially in small training regimes.

[-]Jozdien1yΩ120

Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.

Yeah, IMO "RL at scale trains search-based mesa optimizers" hypothesis predicts "solving randomly generated mazes via a roughly unitary mesa objective and heuristic search" with reasonable probability, and that seems like a toy domain to me. 

One thing I imagine might be useful even in small training regimes would be to train on tasks where the only possible solution necessarily involves a search procedure, i.e. "search-y tasks"

For example, it's plausible that simple heuristics aren't sufficient to get you to superhuman-level on tasks like Chess or Go, so a superhuman RL performance on these tasks would be a fairly good evidence that the model already has an internal search process.

But one problem with Chess or Go would be that the objective is fixed, i.e. the game rules. So perhaps one way to effectively isolate objectives in small training regimes is to find tasks that are both "search-y" and can be modified to have modularly varying objectives eg Chess, but with various possible game rules.

Oh yeah I agree - I was thinking more along the lines of that small models would end up with heuristics even for some tasks that require search to do really well, because they may have slightly complex heuristics learnable by models of that size that allow okay performance relative to the low-power search they would otherwise be capable of. I agree that this could make a quantitative difference though and hadn’t thought explicitly of structuring the task along this frame, so thanks!

I think the ability to identify/extract an ML model's objective from its weights could be very valuable; thank you for working on this.

Thumbs up for noticing the utility of asking "what even is an objective". Going a bit further meta, I think it's worth asking something like the following question:

Wouldn't it be nice if we were building AIs in some paradigm that enforces modularity between capabilities and objectives? With deep learning, we basically have to (try to) set the objective at train-time. And to the extent that an ML model ends up having an objective, it's probably inextricably baked into the weights comprising the entire system. The closest (that I'm aware) we can get to doing anything like "load a new objective at runtime" is by engineering prompts for LLMs; but that provides a rather underwhelming level of control.

I think deep learning might be practically hopeless for the purpose of building controllable AIs; where by controllable I mean here something like "can even be pointed at some specific objective, let alone a 'good' objective". Consequently, I kinda wish more alignment researchers would at least set a 2h timer and try really hard (for those 2h) to come up---privately---with some approach to building AIs that at least passes the bar of basic, minimal engineering sanity. (Like "design the system to even have an explicit control mechanism", and "make it possible to change the objective/destination without needing to understand or change the engine".)

What do you think; does that seem worth thinking about?

I'm glad you liked the post, thanks for the comment. :)

I think deep learning might be practically hopeless for the purpose of building controllable AIs; where by controllable I mean here something like "can even be pointed at some specific objective, let alone a 'good' objective". Consequently, I kinda wish more alignment researchers would at least set a 2h timer and try really hard (for those 2h) to come up---privately---with some approach to building AIs that at least passes the bar of basic, minimal engineering sanity. (Like "design the system to even have an explicit control mechanism", and "make it possible to change the objective/destination without needing to understand or change the engine".)

I don't have strong takes here about what possible training procedures and architectures that actually work outside the deep learning paradigm would look like, but naively it feels like any system where objectives are complex will still involve high-dimensional interface mechanisms to interact with them that we won't fully understand.

Within the deep learning paradigm, GPTs seem like the archetype for something like this, as you said - you can train a powerful world model that doesn't have an objective in any relevant sense and apply some conditional you want (like a simulacra with a specific objective), but because you're interfacing with a very high-dimensional space to impart high-dimensional desires, the non-formalism seems like more a feature than a bug.

The closest (that I'm aware) we can get to doing anything like "load a new objective at runtime" is by engineering prompts for LLMs; but that provides a rather underwhelming level of control.

I think done right, it actually provides us a decent amount of control - but that it's often pretty unintuitive how to exert control, especially at higher degrees and precision because we have to have a really strong feel for what the prior it learns is and what kinds of posteriors you could get with some conditional.

(It's a slightly different problem then though, because you're not dealing with swapping out a new model objective, rather you're swapping out different simulacra with different goals.)

What do you think; does that seem worth thinking about?

I think there are a few separate ideas here worth mentioning. I disagree with that deep learning is practically hopeless for building training procedures that actually result in some goal we want - I think it's really hard, but that there are plausible paths to success. Related to modularity for example, there's some work currently being done on modularizing neural networks conceptually from the ground-up, sort of converting them into forms with modular computational components (unlike current neural networks where it's hard to call a neuron or a weight the smallest unit of optimization). The holy grail of this would plausibly involve a modularized component for "objective" if that's present in the model at all.

I expect that for better or worse, deep learning will probably be how we get to AGI, so I'm sceptical that thinking about new approaches to building AI outside it would yield object-level progress; it might be pretty useful in terms of illuminating certain ideas though, as a thought exercise.

In general? I think that going down this line of thought (if you aren't very pessimistic about deep learning) would plausibly find you working on interesting approaches to hard parts of the problem (I can see someone ending up with the kind of modularity approach above with this, for example) so it seems worth thinking about in absolute terms - in relative terms though, I'm not sure how it compares to other generators.

Thanks for the thoughtful response, and for the link to the sequence on modularity (hadn't seen that before). Will digest this.