Please note that I'm currently working on a correction for part of this post — the form of the mesa-objective G(t) I'm claiming is in fact wrong, as Charlie correctly alludes to in a sibling comment.
Sure, makes sense! Though to be clear, I believe what I'm describing should apply to optimizers other than just gradient descent — including optimizers one might think of as reward-maximizing agents.
Great post! Thanks for writing this — it feels quite clarifying. I'm finding the diagram especially helpful in resolving the sources of my confusion.
I believe everything here is consistent with the definitions I proposed recently in this post (though please do point out any inconsistencies if you see them!), with the exception of one point.
This may be a fundamental confusion on my part — but I don't see objective robustness, as defined here, as being a separate concept at all from inner alignment. The crucial point, I would argue, is that we ought to be treating the human who designed our agent as the base optimizer for the entire system.
Zooming in on the "inner alignment → objective robustness" part of the diagram, I think what's actually going on is something like:
That is to say, I think there are three levels of optimizers being invoked implicitly here, not just two. Through that lens, "intent alignment", as defined here, is what I'd call "inner alignment between the researcher and the agent"; and "inner alignment", as defined here, is what I'd call "inner alignment between the agent and the mesa-optimizer it may give rise to".
In other words, humans live in this hierarchy too, and we should analyze ourselves in the same terms — and using the same language — as we'd use to analyze any other optimizer. (I do, for what it's worth, make this point in my earlier post — though perhaps not clearly enough.)
Incidentally, this is one of the reasons I consider the concepts of inner alignment and mesa-optimization to be so compelling. When a conceptual tool we use to look inside our machines can be turned outward and aimed back at ourselves, that's a promising sign that it may be pointing to something fundamental.A final caveat: there may well be a big conceptual piece that I'm missing here, or a deep confusion that I have around one or more of these concepts that I'm still unaware of. But I wanted to lay out my thinking as clearly as I could, to make it as easy as possible for folks to point out any mistakes — would enormously appreciate any corrections!
I think there might be a minor typo in Section 2.2:
For transitivity, assume that for i=1,2
I think this should be i=0,1 based on the indexing in the rest of the paragraph.
Thanks for the kind words, Adam! I'll follow up over DM about early drafts — I'm interested in getting feedback that's as broad as possible and really appreciate the kind offer here.
Typo is fixed — thanks for pointing it out!
At first I wondered why you were taking the sum instead of just C(L)=limT→∞L(T)−L(0)T, but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.
Yes, the problem with that definition would indeed be that if your optimizer converges to some limiting loss function value like limT→∞L(T)=L∞, then you'd get limT→∞L(T)−L(0)T=limT→∞L∞−L(0)T=0 for any L∞.
Thanks for the comment!
Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.
I think this is a reasonable objection. I don't make this very clear in the post, but the "true objective" I've written down in the example indeed isn't unique: like any measure of utility or loss, it's only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, for example, aren't uniquely defined either. (I'll be doing a few experiments soon to investigate this, and a few other questions, in a couple of real systems.)
Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).
Interesting question! To try to interpret in light of the definitions I'm proposing: adding annealing changes the true objective (or mesa-objective) of the optimizer, which is no longer solely trying to minimize its gradients — it now has this new annealing term that it's also trying to optimize for. Whether this improves alignment or not depends on the effect annealing has on 1) the long-term performance of the mesa-optimizer on its new (gradient + annealing) objective; and 2) the long-term performance this induces on the base objective.
Hope that's somewhat helpful, but please let me know if it's unclear and I can try to unpack things a bit more!
Great framework - feels like this is touching on something fundamental.
I'm curious: is the controllable / observable terminology intentionally borrowed from control theory? Or is that a coincidence?