Jeremy Gillen

Wiki Contributions

Comments

Quantilizing can be thought of as maximizing a lower bound on the expected true utility, where you know that your true utility  is close to your proxy utility function  in some region , such that . If we shape this closeness assumption a bit differently, such that the approximation gets worse faster, then sometimes it can be optimal to cut off the top of the distribution (as I did here, see some of the diagrams for quantilizers with the top cut off, I'll paste one below). 


The reason normal quantilizers don't do that is that they are minimizing the distance between  and the action distribution, by a particular measure that falls out of the proof (see above link), which allows the lower bound to be as high as possible. Essentially it's minimizing distribution shift, which allows a better generalization bound.

I think this distribution shift perspective is one way of explaining why we need randomization at all. A delta function is a bigger distribution shift than a distribution that matches the shape of .
But the next question is why are we even in a situation where we need to deal with the worst case across possible true utility functions? One story is that we are dealing with an optimizer that is maximizing trueutility + error, and one way to simplify that is to model it as max min trueutility - error, where the min only controls the error function within the restrictions of the known bound.

I'm not currently super happy with that story and I'm keen for people to look for alternatives, or variations of soft optimization with different types of knowledge about the relationship between the proxy and true utility. Because intuitively it does seem like taking the 99%ile action should be fine under slightly different assumptions.

One example of this is if we know that , where  is some heavy tailed noise, and we know the distribution of  (and ), then we can calculate the actual optimal percentile action to take, and we should deterministically take that action. But this is sometimes quite sensitive to small errors in our knowledge about the distribution of  and particularly . My AISC team has been testing scenarios like this as part of their research.

I really like infrafunctions as a way of describing the goals of mild optimizers. But I don't think you've described the correct reasons why infrafunctions help with reflective stability. The main reason is you've hidden most of the difficulty of reflective stability in the  bound.

My core argument is that a normal quantilizer is reflectively stable[1] if you have such a bound. In the single-action setting, where it chooses a policy once at the beginning and then follows that policy, it must be reflectively stable because if the chosen policy constructs another optimizer that leads to low true utility, then that policy must have very low base probability (or the bound can't have been true). In a multiple-action setting, we can sample each action conditional on the previous actions, according to the quantilizer distribution, and this will be reflectively stable in the same way (given the bound).

Adding in observations doesn't change anything here if we treat U and V as being expectations over environments.

The way you've described reflective stability in the dynamic consistency section is an incentive to keep the same utility infrafunction no matter what observations are made. I don't see how this is necessary or even strongly related to reflective stability. Can't we have a reflectively stable CDT agent?

Two core difficulties of reflective stability 

I think the two core difficulties of reflective stability are 1) getting the  bound (or similar) and 2) describing an algorithm that lazily does a ~minimal amount of computation for choosing the next few actions. I expect realistic agents need 2 for efficiency. I think utility infrafunctions do help with both of these, to some extent.

The key difficulty of getting a tight  bound with normal quantilizers is that simple priors over policies don't clearly distinguish policies that create optimizers. So there's always a region at the top where "create an optimizer" makes up most of the mass. My best guess for a workaround for this is to draw simple conservative OOD boundaries in state-space and policy-space (the base distribution is usually just over policy space, and is predefined). When a boundary is crossed, it lowers the lower bound on the utility (gives Murphy more power). These boundaries need to be simple so that they can be learned from relatively few (mostly in-distribution) examples, or maybe from abstract descriptions. Being simple and conservative makes them more robust to adversarial pressure. 

Your utility infrafunction is a nice way to represent lots of simple out-of-distribution boundaries in policy-space and state-space. This is much nicer than storing this information in the base distribution of a quantilizer, and it also allows us to modulate how much optimization pressure can be applied to different regions of state or policy-space.

With 2, an infrafunction allows on-the-fly calculation that the consequences of creating a particular optimizer are bad. It can do this as long as the infrafunction treats the agent's own actions and the actions of child-agents as similar, or if it mostly relies on OOD states as the signal that the infrafunction should be uncertain (have lots of low spikes), or some combination of these. Since the max-min calculation is the motivation for randomizing in the first place, an agent that uses this will create other agents that randomize in the same way. If the utility infrafunction is only defined over policies, then it doesn't really give us an efficiency advantage because we already had to calculate the consequences of most policies when we proved the bound.

One disadvantage, which I think can't be avoided, is that an infrafunction over histories is incentivized to stop humans from doing actions that lead to out-of-distribution worlds, whereas an infrafunction over policies is not (to the extent that stopping humans doesn't itself cross boundaries). This seems necessary because it needs to consider the consequences of the actions of optimizers it creates, and this generalizes easily to all consequences since it needs to be robust.
 

  1. ^

    Where I'm defining reflective stability as: If you have an anti-Goodhart modification in your decision process (e.g. randomization), ~never follow a plan that indirectly avoids the anti-Goodhart modification (e.g. making a non-randomized optimizer). 

    The key difficulty here being that the default pathway for achieving a difficult task involves creating new optimization procedures, and by default these won't have the same anti-Goodhart properties as the original.

Thanks! 

  1. I think it's more accurate to say it's incomplete. And the standard generalization bound math doesn't make that prediction as far as I'm aware, it's just the intuitive version of the theory that does. I've been excited by the small amount of singular learning theory stuff I've read. I'll read more, thanks for making that page.
  2. Fantastic!

No, Justin knows roughly the content for the intended future posts but after getting started writing I didn't feel like I understood it well enough to distill it properly and I lost motivation, and since then I became too busy.
I'll send you the notes that we had after Justin explained his ideas to me.

Paperclip metaphor is not very useful if interpreted as "humans tell the AI to make paperclips, and it does that, and the danger comes from doing exactly what we said because we said a dumb goal". 

There is a similar-ish interpretation, which is good and useful, which is "if the AI is going to do exactly what you say, you have to be insanely precise when you tell it what to do, otherwise it will Goodhart the goal." The danger comes from Goodharting, rather than humans telling it a dumb goal. The paperclip example can be used to illustrate this, and I think this is why it's commonly used. 

And he is referencing in the first tweet (with inner alignment), that we will have very imprecise (think evolution-like) methods of communicating a goal to an AI-in-training. 

So apparently he intended the metaphor to communicate that the AI-builders weren't trying to set "make paperclips" as the goal, they were aiming for a more useful goal and "make paperclips" happened to be the goal that it latched on to. Tiny molecular squiggles is better here because it's a more realistic optima of an imperfectly learned goal representation.

  • On it always being a rescaled subset: Nice! This explains the results of my empirical experiments. Jessica made a similar argument for why quantilizers are optimal, but I hadn't gotten around to trying to adapt it to this slightly different situation. It makes sense now that the maximin distribution is like quantilizing against the value lower bound, except that the value lower bound changes if you change the minimax distribution. This explains why some of the distributions are exactly quantilizers but some not, it depends on whether that value lower bound drops lower than the start of the policy distribution.
     
  • On planning: Yeah it might be hard to factorize the final policy distribution. But I think it will be easy to approximately factorize the prior in lots of different ways. And I'm hopeful that we can prove that some approximate factorizations maintain the same q value, or maybe only have a small impact on the q value. Haven't done any work on this yet.
    • If it turns out we need near-exact factorizations, we might still be able to use sampling techniques like rejection sampling to correct an approximate sampling distribution, because we have easy access to the correct density of samples that we have generated (just prior/q), we just need an approximate distribution to use for getting high value samples more often, which seems straightforward.

Thanks for clarifying, I misunderstood your post and must have forgotten about the scope, sorry about that. I'll remove that paragraph. Thanks for the links, I hadn't read those, and I appreciate the pseudocode.

I think most likely I still don't understand what you mean by grader-optimizer, but it's probably better to discuss on your post after I've spent more time going over your posts and comments.

My current guess in my own words is: A grader-optimizer is something that approximates argmax (has high optimization power)?
And option (1) acts a bit like a soft optimizer, but with more specific structure related to shards, and how it works out whether to continue optimizing?

Why does the infinite limit of value learning matter if we're doing soft optimization against a fixed utility distribution?

I also think that it's probably worth considering soft optimization to the old Impact Measures work from this community -- in particular, I think it'd be interesting to cast soft optimization methods as robust optimization, and then see how the critiques raised against impact measures (e.g. in this comment or this question) apply to soft optimization methods like RL-KL or the minimax objective you outline here.

Thanks for linking these, I hadn't read most of these. As far as I can tell, most of the critiques don't really apply to soft optimization. The main one that does is Paul's "drift off the rails" thing. I expect we need to use the first AGI (with soft opt) to help solve alignment in a more permanent and robust way, then use that make a more powerful AGI that helps avoid "drifting off the rails".

In my understanding, impact measures are an important part of the utility function that we don't want to get wrong, but not much more than that. Whereas soft optimization directly removes Goodharting of the utility function. It feels like the correct formalism for attacking the root of that problem. Whereas impact measures just take care of a (particularly bad) symptom.

Abram Demski has a good answer to the question you linked that contrasts mild optimization with impact measures, and it's clear that mild optimization is preferred. And Abram actually says:

An improvement on this situation would be something which looked more like a theoretical solution to Goodhart's law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities ("this is how you get the most of what you want"), allowing ML researchers to develop algorithms orienting toward this.

This is exactly what I've got.

I agree that it's good to try to answer the question, under what sort of reliability guarantee is my model optimal, and it's worth making the optimization power vs robustness trade off explicit via toy models like the one you use above.

That being said, re: the overall approach. Almost every non degenerate regularization method can be thought of as "optimal" wrt some robust optimization problem (in the same way that non degenerate optimization can be trivially cast as Bayesian optimization) -- e.g. the RL - KL objective with respect to some  is optimal the following minimax problem:

for some . So the question is not so much "do we cap the optimization power of the agent" (which is a pretty common claim!) but "which way of regularizing agent policies more naturally captures the robust optimization problems we want solved in practice". 

Yep, agreed. Except I don't understand how you got that equation from RL with KL penalties, can you explain that further? 

I think the most novel part of this post is showing that this robust optimization problem (maximizing average utility while avoiding selection for upward errors in the proxy) is the one we want to solve, and that it can be done with a bound that is intuitively meaningful and can be determined without just guessing a number.

(It's also worth noting that an important form of implicit regularization is the underlying capacity/capability of the model we're using to represent the policy.)

Yeah I wouldn't want to rely on this without a better formal understanding of it though. KL regularization I feel like I understand.

Load More