This is a special post for quick takes by ojorgensen. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

1 comment, sorted by Click to highlight new comments since: Today at 8:15 PM

Problem: we want to make it hard for ML systems (trained via SGD) to perform naive gradient hacking. By naive gradient hacking, I mean "being able to keep some weights of the network constant for an arbitrary step of SGD".

Solution: do "stochastic" regularisation, e.g. sample the amount of regularisation we perform randomly (could use quantum stuff if we want true randomness). This seems like it should make naive gradient hacking almost impossible - in order to keep some target weights unchanged, you'd have to match their +ve contribution to the loss to the degree of regularisation. If the degree of regularisation is stochastic, the loss contribution must also be stochastic, which is not possible with a determinstic NN!

The question of how useful this is from preventing any gradient hacking depends on how stable the loss landscape around some "deceptive / gradient hacking minimum is". Seems possible the surrounding loss landscape could be pretty unstable to random pertubations?