I think one common intuition when thinking about gradient descent (GD) is to think about it as more efficient genetic algorithms (GAs). I certainly used to think this way—after all, taking a bunch of random mutations and keeping the ones that work well sounds basically just like a slower version of finding the steepest direction in the first place. GAs are also much more intuitive than GD, so it's tempting to think about GD in terms of them. Unfortunately, there are some instances where they behave completely differently, which is really relevant for thinking about things like gradient hacking. This post provides one such example.

## Redundancy and Submodule Protection

Imagine you're given a network that has two identical submodules, and some kind of combining function where if it detects the outputs from both submodules are the same it passes the value through but if they're different it explodes and outputs zero or something totally random, or whatever. This is a natural idea to come up with if your goal is to ensure the optimizer doesn't mess with these modules, for example, if you're trying to protect the mesaobjective encoding for gradient hacking.

It's easy to convince yourself that this is effective against GAs: since mutations are inserted at random, the probability that both modules are mutated the same way is very small, and so any model that changes either of these modules will do very poorly since the output will be totally useless.

GD on the other hand will happily change both modules at the same time, since the direction of steepest descent is the one where both submodules change at the same time. This is even despite the fact that partial derivatives are taken assuming all other parameters stay the same, because partial derivatives don't care about the cliff just epsilon away but only the slope at the exact point where both modules are the same. Another way of thinking about it is the gradient necessarily points in the direction of steepest ascent, even if that direction isn't axis-aligned. If you're still not convinced, see this comment for a proof that there exists no combining function (under modest constraints) that can protect submodules this way.

## Takeaways

The main takeaway from this is that gradient descent is really unintuitive sometimes. When in doubt, it's probably a good idea to work out the gradient by hand.

First, I love this example. Second, I think it's wrong.

It's true in idealized mathematical terms - if you're right at the point where the submodules agree, then the gradient will be along the continue-to-agree direction. But that's not what matters

numerically- i.e. it's not what matters in actual practice. Numerically, the speed of (non-stochastic) gradient descent is controlled by the local condition number, and the condition number for this two-module example would be enormous - meaning that gradient descent will move along the submodules' parameter space extremely slowly. Unless the surface along which the submodules match is perfectly linear (and the parameters are already exactly on that surface), every step will take it just a little bit off the surface, so it will end up taking extremely small steps (so that it's always within numerical precision of the surface).In practice, I'd think that numerically the combining function wouldn't be perfectly sharp but rather have some epsilon error tolerance due to itself being subject to floating point numerics too. My mental image for this example would then be some 2*epsilon wide surface centered on the line where both are equal, and it shouldn't stray from the center of the surface by a symmetry argument.

Huh. I was interpreting it differently then - if I was building a module-checker to keep an eye out for AI tampering, I would

not feed the result of the checker back into the gradient signal.The big difference, parroting Steve ( https://www.lesswrong.com/posts/ey7jACdF4j6GrQLrG/thoughts-on-safety-in-predictive-learning , I think) is that gradient descent doesn't try things out and then keep what works, it

modelschanges and does what is good in the model.Sorry if this wasn't clear in the post but when I say "we're" trying to protect some submodule, I don't mean us as the engineers who want to make sure the model doesn't change a submodule (we could do that trivially, just add a stop gradient), but rather from the perspective of a gradient hacker that needs to protect its own logic from being disabled by SGD.

Ah, that makes sense.

Yup, that's roughly what I was picturing. (Really I was picturing a smooth approximation of that, but the conclusion is the same regardless.)

In general, "shouldn't stray from the center of the surface by a symmetry argument" definitely should not work for GD in practice - either because numerical noise knocks us off the line-where-both-are-equal, or because the line-where-both-are-equal itself curves.

So, unless the line-where-both-are-equal is perfectly linear

andthe numerics are perfectly symmetric all the way to the lowest bits, GD will need to take steps of size ~epsilon to stay near the center of the surface.There should be a fair bit more than 2 epsilon of leeway in the line of equality. Since the submodules themselves are learned by SGD, they won’t be exactly equal. Most likely, the model will include dropout as well. Thus, the signals sent to the combining function are almost always more different than the limits of numerical precision allow. This mean the combining function will need quite a bit of leeway, otherwise the network’s performance is just zero always.

I suspect the thesis is true and can be valuable to appreciate, but I'm left feeling that this doesn't explore the thesis nearly as much as I want. You give one example where there might be a divergence, but I would be more satisfied if you provided more examples where they diverge. I also found myself wanting to read more thoughts on cases where the difference can impact things in a way that I would care about. Where would descent be more useful, and when would there be benefits from using evolution? I think there are examples of both, but this post doesn't touch on those.

I think this is a wrong interpretation of the idea that I described in this comment (which your linked comment here is a reply to). There need not be a "dedicated" piece of logic that does nothing other than checking whether the outputs from two subnetworks satisfy some condition and making the model "fail" otherwise. Having such a dedicated piece of logic wouldn't work because SGD would just remove it. Instead, suppose that the model depends on two different computations, X and Y, for the purpose of minimizing its loss. Now suppose there are two malicious pieces of logic, one within the subnetwork that computes X and one within the subnetwork that computes Y. If a certain condition about the input of the entire network is satisfied, the malicious logic pieces make both X and Y fail. Albeit doing so, the gradient components of the weights that are associated with the malicious pieces of logic are close to zero (putting aside regularization), because changing any single weight has almost no effect on the loss.

My formulation is broad enough that it doesn't have to be a dedicated piece of logic, there just has to be some way of looking at the reset of the network that depends on X and Y being the same.

This is what I take issue with - if there is a way to change both components simultaneously to have an effect on the loss, SGD will happily do that.

But X and Y are not the same! For example, if the model is intended to classify images of animals, the computation X may correspond to [how many legs does the animal have?] and Y may correspond to [how large is the animal?]

This seems to me wrong. SGD updates the weights in the direction of the gradient, and if changing a given weight alone does not affect the loss then the gradient component that is associated with that weight will be 0 and thus SGD will not change that weight.

If the partial derivative wrt two different parameters is zero, i.e ∂f∂θ1=∂f∂θ2=0, then it must be that changing both simultaneously does not change the loss either (to be precise, limh→0f(x+h(θ1+θ2))h=0).

I don't see how this is relevant here. If it is the case that changing only w1 does not affect the loss, and changing only w2 does not affect the loss, then SGD would not change them (their gradient components will be zero), even if changing them both can affect the loss.

It's relevant because it demonstrates that in differentiable functions, if it is the case that changing only w1 does not affect the loss, and changing only w2 does not affect the loss, then it is not possible that changing them both can affect the loss either.