Thanks for clarifying, I misunderstood your post and must have forgotten about the scope, sorry about that. I'll remove that paragraph. Thanks for the links, I hadn't read those, and I appreciate the pseudocode.
I think most likely I still don't understand what you mean by grader-optimizer, but it's probably better to discuss on your post after I've spent more time going over your posts and comments.
My current guess in my own words is: A grader-optimizer is something that approximates argmax (has high optimization power)?
And option (1) acts a bit like a soft optimizer, but with more specific structure related to shards, and how it works out whether to continue optimizing?
Why does the infinite limit of value learning matter if we're doing soft optimization against a fixed utility distribution?
I also think that it's probably worth considering soft optimization to the old Impact Measures work from this community -- in particular, I think it'd be interesting to cast soft optimization methods as robust optimization, and then see how the critiques raised against impact measures (e.g. in this comment or this question) apply to soft optimization methods like RL-KL or the minimax objective you outline here.
Thanks for linking these, I hadn't read most of these. As far as I can tell, most of the critiques don't really apply to soft optimization. The main one that does is Paul's "drift off the rails" thing. I expect we need to use the first AGI (with soft opt) to help solve alignment in a more permanent and robust way, then use that make a more powerful AGI that helps avoid "drifting off the rails".
In my understanding, impact measures are an important part of the utility function that we don't want to get wrong, but not much more than that. Whereas soft optimization directly removes Goodharting of the utility function. It feels like the correct formalism for attacking the root of that problem. Whereas impact measures just take care of a (particularly bad) symptom.
Abram Demski has a good answer to the question you linked that contrasts mild optimization with impact measures, and it's clear that mild optimization is preferred. And Abram actually says:
An improvement on this situation would be something which looked more like a theoretical solution to Goodhart's law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities ("this is how you get the most of what you want"), allowing ML researchers to develop algorithms orienting toward this.
This is exactly what I've got.
I agree that it's good to try to answer the question, under what sort of reliability guarantee is my model optimal, and it's worth making the optimization power vs robustness trade off explicit via toy models like the one you use above.
That being said, re: the overall approach. Almost every non degenerate regularization method can be thought of as "optimal" wrt some robust optimization problem (in the same way that non degenerate optimization can be trivially cast as Bayesian optimization) -- e.g. the RL - KL objective with respect to some is optimal the following minimax problem:
for some . So the question is not so much "do we cap the optimization power of the agent" (which is a pretty common claim!) but "which way of regularizing agent policies more naturally captures the robust optimization problems we want solved in practice".
Yep, agreed. Except I don't understand how you got that equation from RL with KL penalties, can you explain that further?
I think the most novel part of this post is showing that this robust optimization problem (maximizing average utility while avoiding selection for upward errors in the proxy) is the one we want to solve, and that it can be done with a bound that is intuitively meaningful and can be determined without just guessing a number.
(It's also worth noting that an important form of implicit regularization is the underlying capacity/capability of the model we're using to represent the policy.)
Yeah I wouldn't want to rely on this without a better formal understanding of it though. KL regularization I feel like I understand.
I've probably misunderstood your comment, but I think this post already does most of what you are suggesting (except for the very last bit about including human feedback)? It doesn't assume the human's utility function is some real thing that it will update toward, it has a fixed distribution over utility throughout deployment. There's no mechanism for updating that distribution, so it can't become arbitrarily certain about the utility function.
And that distribution isn't treated like epistemic uncertainty, it's used to find a worst case lower bound on utility?
Good point, policies that have upward errors will still be preferentially selected for (a little). However, with this approach, the amount of Goodharting should be constant as the proxy quality (and hence optimization power) scales up.
I agree with your second point, although I think there's a slight benefit over original quantilizers because is set theoretically, rather than arbitrarily by hand. Hopefully this makes it less tempting to mess with it.
Thanks, you are right on both. I don't know how I missed the simplification, I remember wanting to make the analytical form as simple as possible.
I really should have added the reference for this, since I just copied it from a paper, so I did that in a footnote. I just followed up the derivation a bit further and the parts I checked seem solid, but annoying that it's spread out over three papers.
Yeah good point, I should have put more detail here.
My understanding is that, for most common initialization distributions and architectures, and in the infinite width limit. This is because they both end up being expectations of random variables that are symmetrically distributed around 0.
However, in the finite width regime if we want to be precise, we can simply add those terms back onto the kernel regression.
So really, with finite width:
There are a few other very non-rigorous parts of our explanation. Another big one is that is underspecified by the data in the infinite width limit, so it could fit the data in lots of ways. Stuff about ridge regularized regression and bringing in details about gradient descent fixes this, I believe, but I'm not totally sure whether it changes anything at finite width.
Paperclip metaphor is not very useful if interpreted as "humans tell the AI to make paperclips, and it does that, and the danger comes from doing exactly what we said because we said a dumb goal".
There is a similar-ish interpretation, which is good and useful, which is "if the AI is going to do exactly what you say, you have to be insanely precise when you tell it what to do, otherwise it will Goodhart the goal." The danger comes from Goodharting, rather than humans telling it a dumb goal. The paperclip example can be used to illustrate this, and I think this is why it's commonly used.
And he is referencing in the first tweet (with inner alignment), that we will have very imprecise (think evolution-like) methods of communicating a goal to an AI-in-training.
So apparently he intended the metaphor to communicate that the AI-builders weren't trying to set "make paperclips" as the goal, they were aiming for a more useful goal and "make paperclips" happened to be the goal that it latched on to. Tiny molecular squiggles is better here because it's a more realistic optima of an imperfectly learned goal representation.