By that I didn’t mean to imply that we care about mesa-optimisation in particular. I think that this demo working “as intended” is a good demo of an inner alignment failure, which is exciting enough as it is. I just also want to flag that the inner alignment failure doesn’t automatically provide an example of a mesa-optimiser.
I have now seen a few suggestions for environments that demonstrate misaligned mesa-optimisation, and this is one of the best so far. It combines being simple and extensible with being compelling as a demonstration of pseudo-alignment if it works (fails?) as predicted. I think that we will want to explore more sophisticated environments with more possible proxies later, but as a first working demo this seems very promising. Perhaps one could start even without the maze, just a gridworld with keys and boxes.
I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy. There is room for philosophical disagreement there. Even with that, a working demo of this environment would in my opinion be a good thing, as we would have a concrete agent to disagree about.
Ah; this does seem to be an unfortunate confusion.
I didn’t intend to make ‘utility’ and ‘reward’ terminology – that’s what ‘mesa-‘ and ‘base’ objectives are for. I wasn’t aware of the terms being used in the technical sense as in your comment, so I wanted to use utility and reward as friendlier and familiar words for this intuition-building post. I am not currently inclined to rewrite the whole thing using different words because of this clash, but could add a footnote to clear this up. If the utility/reward distinction in your sense becomes accepted terminology, I’ll think about rewriting this.
That said, the distinctions we’re drawing appear to be similar. In your terminology, a utility-maximising agent is an agent which has an internal representation of a goal which it pursues. Whereas a reward-maximising agent does not have a rich internal goal representation but instead a kind of pointer to the external reward signal. To me this suggests your utility/reward tracks a very similar, if not the same, distinction between internal/external that I want to track, but with a difference in emphasis. When either of us says ‘utility ≠ reward’, I think we mean the same distinction, but what we want to draw from that distinction is different. Would you disagree?
Thanks for raising this. While I basically agree with evhub on this, I think it is unfortunate that the linguistic justification is messed up as it is. I'll try to amend the post to show a bit more sensitivity to the Greek not really working like intended.
Though I also think that "the opposite of meta"-optimiser is basically the right concept, I feel quite dissatisfied with the current terminology, with respect to both the "mesa" and the "optimiser" parts. This is despite us having spent a substantial amount of time and effort on trying to get the terminology right! My takeaway is that it's just hard to pick terms that are both non-confusing and evocative, especially when naming abstract concepts. (And I don't think we did that badly, all things considered.)
If you have ideas on how to improve the terms, I would like to hear them!
You’re completely right; I don’t think we meant to have ‘more formally’ there.
To some extent, but keep in mind that in another sense, the behavioural objective of maximising paperclips is totally consistent with playing along with the base objective for a while and then defecting. So I’m not sure the behaviour/mesa- distinction alone does the work you want it to do even in that case.
I’ve been meaning for a while to read Dennett with reference to this, and actually have a copy of Bacteria to Bach. Can you recommend some choice passages, or is it significantly better to read the entire book?
P.S. I am quite confused about DQN’s status here and don’t wish to suggest that I’m confident it’s an optimiser. Just to point out that it’s plausible we might want to call it one without calling PPO an optimiser.
P.P.S.: I forgot to mention in my previous comment that I enjoyed the objective graph stuff. I think there might be fruitful overlap between that work and the idea we’ve sketched out in our third post on a general way of understanding pseudo-alignment. Our objective graph framework is less developed than yours, so perhaps your machinery could be applied there to get a more precise analysis?
Thanks for an insightful comment. I think your points are good to bring up, and though I will offer a rebuttal I’m not convinced that I am correct about this.
What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?
Distinguishing the “this system looks like it optimises for X” from “this system internally uses an evaluation of X to make decisions” is useful from the point of view of making the abstraction more robust. The former doesn’t make clear what makes the abstraction “work”, and so when to expect it to fail. The latter will at least tell you what kind of failures to expect in the abstraction: places where the evaluation of X doesn’t connect to the rest of the system like it’s supposed to. In particular, you’re right that if the learned environment model doesn’t generalise, the mesa-objective won’t be predictive of behaviour. But that’s actually a prediction of taking this view. On the other hand, it is unclear if taking the behavioural view would predict that the system will change its behaviour off-distribution (partially, because it’s unclear what exactly grounds the similarities in behaviour on-distribution).
I think it definitely is useful to also think about the behavioural objective in the way you describe, because the later concerns we raise basically do also translate to coherent behavioural objectives. And I welcome more work trying to untangle these concepts from one another, or trying to dissolve any of them as unnecessary. I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.
You’re also right to point out DQN as an interesting edge case. But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function. The Q function is regressed to the episode returns. If the learning goes well, the Q function is literally representing the agent’s objective (indeed, it’s not really selected to maximise return; its selected to be accurate at predicting return). Contrast this with e.g. policy optimisation trained agents, which are not supposed to directly represent an objective, but are supposed to score well on it. (Someone good at running RL experiments maybe should look into comparing the coherence of revealed preferences of DQN agents with PPO agents. I’d read that paper.)
I think humans are fairly weird because we were selected for an objective that is unlikely to be what we select for in our AIs.
That said, if we model AI success as driven by model size and compute (with maybe innovations in low-level architecture), then I think that the way humans represent objectives is probably fairly close to what we ought to expect.
If we model AI success as mainly innovative high-level architecture, then I think we will see more explicitly represented objectives.
My tentative sense is that for AI to be interpretable (and safer) we want it to be the latter kind, but given enough compute the former kind of AI will give better results, other things being equal.
Here, what I mean by low-level architecture is something like “we’ll use lots of LSTMs instead of lots of plain RNNs, but keep the model structure simple: plug in the inputs, pass it through some layers, and read out the action probabilities”, and high-level is something like “let’s organise the model using this enormous flowchart with all of these various pieces that each are designed to take a particular role; here’s the observation embedding, here’s the search in latent model space, here’s the ...”
Yes, it probably doesn’t apply to most objectives. Though it seems to me that the closer the task is to something distinctly human, the more probable it is that this kind of consideration can apply. E.g., making judgements in criminal court cases and writing fiction are domains where it’s not implausible to me that this could apply.
I do think this is a pretty speculative argument, even for this sequence.