Lucius Bushnaq

AI Safety researcher, physics PhD student

Wiki Contributions


I don't typically imagine gradient hacking to be about mesa optimisers protecting themselves from erasure. Mesa optimisers are good at things. If you want to score well on a hard loss function involving diverse tricky problems, a mesa optimiser is often a great way of doing that. I do not think they would typically need to protect their optimisation circuitry from gradient descent.

Two prototypical examples of gradient hacking as I imagine it in my head are:

  1. I have a terminal goal that doesn’t care about things related to scoring well on the loss function. But I’ll try to score well anyway, that way the parameters that constitute my goal will be read by gradient descent to be optimal the way they are and not changed.
  2. If I want to change a specific parameter  over there, causing it to take the specific value , then provided I receive signals  from which I can extrapolate the state of , I can do that by abusing gradient descent. I commit to making a small error while producing outputs. This error is calculated in steps which are insensitive to small parameter changes in myself.  In other words, the error creation has a broad behavioural manifold in parameter space. When  precisely cancels my error. Otherwise, the result will be off. 
    So the quickest, most visible-to-local-first-order-optimisation way to improve the loss is for gradient descent to set .  Provided changing me to no longer employ this strategy requires second-order parameter interactions invisible to first oder optimisation, or even just more than one update step. Which seems pretty plausible to me, considering how complicated a smart mesa optimiser probably is compared to a tiny  signal circuit. 

I’m not saying that I know for sure that something like 2. is definitely possible or easy. But I don’t see how this post shows that it isn’t. So long as changing the mesa-optimiser to cease its shenanigans takes awareness of second derivatives, and the desired parameter changes can impact the loss through first derivatives, it currently looks quite doable to me.

I'm confused about this.

Say our points  are the times of day measured by a clock.  And  are the temperatures measured by a thermometer at those times. We’re putting in times  in the early morning, where I decree temperature to increase roughly linearly as the sun rises.

You write the overparametrized regression model as  . Since our model doesn’t get to see the index, only the value of  itself, that has to implicitly be something like 

Where  is the regression or NN output. So our model learned the slope, plus a lookup table for the noise values of the thermometer at those times in the training data set. That means that if the training set included the time , and the model encounters a temperature taken at  outside training again, now from a different day, it will output .

Which is predictably wrong, and you can do better by not having that memorised  noise term.

The model doesn’t get to make a general model plus a lookup table of noises in training to get perfect loss and then use only the general model outside of training. It can’t switch the lookup table off.

Put differently, if there’s patterns in the data that the model cannot possibly make a decent simple generative mechanism for, fitting those patterns to get a better loss doesn’t seem like the right thing to do. 

Put yet another way, if you're forced to pick one single hypothesis to make predictions, the best one to pick doesn't necessarily come from the set  that perfectly fits all past data.

The other risk that could motivate not making this bet is the risk that the market – for some unspecified reason – never has a chance to correct, because (1) transformative AI ends up unaligned and (2) humanity’s conversion into paperclips occurs overnight. This would prevent the market from ever “waking up”.

You don't even need to expect it to occur overnight. It's enough for the market update to predictably occur so late that having lots of money available at that point is no longer useful. If AGI ends the world next week, there's not that much a hundred billion dollar influx can actually gain you. Any operation that actually stops the apocalypse from happening probably needs longer than that to set up, regardless of any additional funds available. And everything else you can do with a hundred billion dollars in that week is just not that fun or useful.

And if you do get longer than that, if we have an entire month somehow, a large fraction of market participants believing the apocalypse is coming would be the relevant part anyway. That many people taking the problem actually seriously might maybe be enough to put the brakes on things through political channels. For a little bit at least. Monetary concerns seem secondary to me in comparison.

I guess a short position by EA aligned people just in case that this edge scenario happens and a competent institutional actor with a hundred billion dollars is needed now now now could maybe make some sense?

Interested, but depends on the cost. If I'm the only one who wants it, I'd be willing to pay $30 to get the whole series, but probably not more. I don't know how long transcriptions usually take, but I'm guessing it'd certainly be >1h. So there'd need to be additional interest to make it worth it.

Epistemic status: sleep deprived musings

If I understand this right, this is starting to sound very testable.

Feed a neural network inputs consisting of variables . Configurations in a 2D Ising model, cat pictures, or anything else we humans think we know the latent variables for. 

Train neural networks to output a set of variables  over the inputs. The loss function scores based on how much the output induces conditional independence of inputs over the training data set. 

E.g., take the  divergence between and . Then, penalise  with a higher information content through a regularisation term. E.g. the  divergence between  and .[1]

Then, you can check if the solutions found match the ones other networks, or humans and human science, would give for that system.  Either by comparing  ,  or by looking at  directly.  

You can also train a second network to reconstruct  from the latents and see what comes out.

You might also be able to take a network stitched together from the latent generation and a latent read-out network, and see how well it does on various tasks over the dataset. Image labelling, calculating the topological charge of field configurations, etc. Then compare that to a generic network trained to solve these tasks.

If the hypothesis holds strongly, generic networks of sufficient generality go through the same process when they propose solutions. Just with less transparency. So you’d expect the stitched and conventional networks to score similarly. 

My personal prediction would be that you'd usually need to require solving a few different tasks on the data for that to occur, otherwise the network doesn't need to understand all the abstractions in the system to get the answer, and can get away with learning less latents.

I think we kind of informally do a lot of this already when we train an image classifier and then Anthropic opens up the network to see e.g. the dog-head-detection function/variable in it. But this seems like a much cleaner, more secure and well defined cooking formula for finding latents to me, which may be rote implementable for any system or problem. 

Unless this is already a thing in Interpretability code bases and I don’t know about it?

  1. ^

    I haven't checked the runtime on this yet, you'd need to cycle through the whole dataset once per loss function call to get the distributions. But it's probably at least doable for smaller problems, and for bigger datasets a stochastic sampling ought to be enough.

While our current understanding of physics is predictably-wrong, it has no particular reason to be wrong in a way that is convenient for us.[1]

Meanwhile, more refined versions of some of the methods described here seem perhaps doable in principle, with sufficient technology. 

You can make difficult things happen by trying hard at them. You can't violate the laws of physics by trying harder.


  1. ^

    Out of the many things that might be wrong about the current picture, impossibility of time travel is also one of the things I'd least expect to get overturned.

This paper offers a fairly intuitive explanation for why flatter minima generalize better: suppose the training and testing data have distinct, but nearby, minima that minimize their respective loss. Then, the curvature around the training minima acts as the second order term in a Taylor expansion that approximates the expected test loss for models nearby the training minima.

I feel like this explanation is just restating the question. Why are the minima of the test and training data often close to each other? What makes reality be that way?

You can come up with some explanation involving mumble mumble fine-tuning, but I feel like that just leaves us where we started.

Our team has copied Lightcone's approach to communicating over Discord, and I've been very happy with it.

Aren't Standard Parametrisation and other parametrisations with a kernel limit commonly used mostly in cases where you're far away from reaching the depth-to-width≈0 limit, so expansions like the one derived for the NTK parametrisation aren't very predictive anymore, unless you calculate infeasibly many terms in the expensive perturbative series?

As far as I'm aware, when you're training really big models where the limit behaviour matters, you use parametrisations that don't get you too close to a kernel limit in the regime you're dealing with. Am I mistaken about that?

As for NTK being more predictable and therefore safer, it was my impression that it's more predictive the closer you are to the kernel limit, that is, the further away you are from doing the kind of representational learning AI Safety researchers like me are worried about. As I leave that limit behind, I've got to take into account ever higher order terms in your expansion, as I understand it. To me, that seems like the system is just getting more predictive in proportion to how much I'm crippling its learning capabilities.

Yes, of course NTK parametrisation and other parametrisations with a kernel limit can still learn features at finite width, I never doubted that. But it generally seems like adding more parameters means your system should work better, not worse, and if it's not doing that, it seems like the default assumption should be that you're screwing up. If it was the case that there's no parametrisation in which you can avoid converging to a trivial limit as you heap on more parameters onto the width of an MLP, that would be one thing, and I think it'd mean we'd have learned something fundamental and significant about MLP architectures. But if it's only a certain class of parametrisations, and other parametrisations seem to deal with you piling on more parameters just fine, both in theory and in practice, my conclusion would be that what you're seeing is just a result of choosing a parametrisation that doesn't handle your limit gracefully. Specifically, as I understood it, Standard parametrisation for example just doesn't let enough gradient reach the layers before the final layer if the network gets too large. As the network keeps getting wider, those layers are increasingly starved of updates until they just stop doing anything altogether, resulting in you training what's basically a one layer network in disguise. So you get a kernel limit.

TL;DR: Sure you can use NTK parametrisation for things, but it's my impression that it does a good job precisely in those cases where you stay far away from the depth-to-width≈0 limit regime in which the perturbative expansion is a useful description.

The book's results hold for a specific kind of neural network training parameterisation, the "NTK parametrisation", which has been argued (convincingly, to me) to be rather suboptimal. With different parametrisation schemes, neural networks learn features even in the infinite width limit. 

You can show that neural network parametrisations can essentially be classified into those that will learn features in the infinite width limit, and those that will converge to some trivial kernel. One can then derive a "maximal update parametrisation", in which infinite width networks actually train better than finite width networks (including finite width networks using NTK parametrisation, IIRC). 

Maximal update parametrisation also allows you to derive correct hyperparameters for a model. You can just find the right hyperparameters for a small model, which is far less costly, then scale up. If you used maximal update parametrisation, the results will still be right for the large model.

So I wouldn't use these results as evidence for or against any scaling hypotheses, and I somewhat doubt they make a good jumping off point for figuring out how to disentangle a neural network into parts. I don't think you're really doing anything more fundamental than investigating how a certain incorrect choice of parameter scaling harms you more the smaller you make the depth/width ratio of your network here.

For more on this, I recommend the Tensor programming papers and related talks by Greg Yang.

TL;DR: I think these results only really concern networks trained in a way that's shooting yourself in the foot. 

Load More