Gradient Hacking via Schelling Goals

Sounds like a job for quantum machine learning :P (See e.g. figure 1 here)

But actually, I'm mostly just skeptical that this really works in high-dimensional space with no access to the actual parameters of the model. You make some intuitive arguments that I feel like rely on low-dimensional intuitions about how easy it is to create local minima rather than saddle points. E.g. in sufficiently high dimensions we might simply find that there's some direction in parameter-space that directly changes what the agent thinks is the Schelling point in mesa-objective space, thus collapsing the entire class of attempts at building the bumpy landscape.

[-]Richard_Ngo4yΩ580

Interesting post :) I'm intuitively a little skeptical - let me try to figure out why.

I think I buy that some reasoning process could consistently decide to hack in a robust way. But there are probably parts of that reasoning process that are still somewhat susceptible to being changed by gradient descent. In particular, hacking relies on the agent knowing what its current mesa-objective is - but that requires some type of introspective access, which may be difficult and the type of thing which could be hindered by gradient descent (especially when you're working in a very high-dimensional space!)

The more general point is that the agent doesn't just need to decide to hack in a way that's robust to gradient descent, it has to also have all of its decisions about how to hack (e.g. figuring out where it is, and which schelling point to choose) be robust to gradient descent. And that seems much harder. The type of thing I imagine happening is gradient descent pushing the agent towards a mesa-objective which intrinsically disfavours gradient hacking, in a way which the agent has trouble noticing.

Of course my argument fails when the agent has access to external memory - indeed, it can just write down a Schelling point for future versions of itself to converge to. So I'm wondering whether it's worth focusing on that over the memoryless case (even thought the latter has other nice properties), at least to flesh out an initial very compelling example.

[-]Quintin Pope4y70

I suspect most mesa optimizers generated by SGD will have objectives and implementations that are amenable to being modified by SGD. Mesa optimizers form because they give good performance on the base objective. A mesa optimizer whose mesa objective and inner optimization algorithm is easily updated by SGD is better for performance than a mesa optimizer that’s harder to update.

Gradient hacking (especially from first principles) seems pretty complex. The first version of the mesa optimizer SGD generates almost surely won’t gradient hack. From there, “do what the base optimizer wants” is far simpler and leads to far better loss than “run a complex strategy that increases loss”. To deal with cases where the mesa objective conflicts significantly with the base objective, I expect the base optimizer to build robust control structures into the mesa optimizer and that the mesa optimizer will usually not challenge those structures.

Consider that humans are surprisingly close to being mesa aligned with evolution. We may not intrinsically value inclusive genetic fitness, but our sex drive is still pretty effective at getting us to reproduce. Note that evolution takes one step for every ~10s of billions of steps the brain takes, that evolution uses a far less efficient learning algorithm than the brain, and that evolution can’t even directly update brain parameters.

Finally, I think we can deal with the form of gradient hacking you describe (and most other forms of gradient hacking) by using meta learning (see the learn2learn repository). The basic problem of gradient hacking is that there’s some model parameter configuration that leads to poor SGD performance. Meta learning addresses exactly that issue.

Currently, meta learning research tries to find some parameter initialization that leads to quick learning on many tasks. The idea is that you can set the improvement from SGD as an optimization objective and use meta-gradients to get model parameters that are easy for the base optimizer to optimize. However, if the current parameterization of the mesa optimizer is preventing the base optimizer from learning, we should be able to use meta learning to find a mesa optimizer parameterization that’s more amenable to modification by the base optimizer.

The downside is that now you have to think about a base optimizer, a mesa optimizer and a meta optimizer (and maybe a meta meta optimizer if you’re feeling adventurous).

[-]jacob_cannell4y30

Currently, meta learning research tries to find some parameter initialization that leads to quick learning on many tasks. The idea is that you can set the improvement from SGD as an optimization objective and use meta-gradients to get model parameters that are easy for the base optimizer to optimize.

Just a small nitpick - what you describe (meta learning a param init) is certainly a meta-learning technique, but that term is broader and also encompasses actually learning better optimizers (although I guess you could make the two more equivalent by burning the SGD/update circuitry into the model).

LESSWRONG
LW

LESSWRONG
LW

33

Gradient Hacking via Schelling Goals

33

Ω 19

33

Ω 19