Bias towards simple functions; application to alignment?

2Charlie Steiner

1DavidHolmes

2Thomas Kwa

5evhub

3DavidHolmes

1DavidHolmes

1DavidHolmes

New Comment

First problem to come to mind : subagent stability (sometimes it's simple to build a new agent that does your task, even if it's complicated to explain what that agent would do).

Hi Charlie,
If you can give a short (precise) description for an agent that does the task, then you have written a short programme that solves the task. I think then if you need more space to ‘explain what the agent would do’ then you are saying there also exists a less efficient/compact way to specify the solution. From this perspective I *think* the latter is then not so relevant.
David

The upper bound for Kolmogorov complexity of an input-output map in Dingle is not very interesting; iirc it's basically saying that a program can be constructed of the form "specify this exact input-output map using the definition already provided", and the upper bound is just the length of this program.

Also one concern is it's not clear thinking about this differentially advances alignment over capabilities.

I think it very clearly advantages alignment over capabilities—understanding SGD's inductive biases is one of the primary bottlenecks for inner alignment imo.

The stuff linked here is pretty old, though, e.g. this stuff predates Mingard et al..

P.s. the main thing I have taken so far from the link you posted is that the important part is not exactly about the biases of SGD. Rather, it is about the structure of the DNN itself; the algorithm used to find a (local) optimum plays less of a role than the overall structure. But probably I’m reading too much into your precise phrasing.

Hi Thomas, I agree the proof of the bound is not so interesting. What I found more interesting were the examples and discussion suggesting that, in practise, the upper bound seems often to be somewhat tight.

Concerning differential advancement, I agree this can advance capabilities, but I suspect that advancing alignment is somewhat hopeless unless we can understand better what is going on inside DNNs. On that basis I think it does differentials advance alignment, but of course other people may disagree.

## Summary

Deep neural networks (DNNs) are generally used with large numbers of parameters relative to the number of given data-points, so that the solutions they output are far from uniquely determined. How do DNNs 'choose' what solution to output? Some fairly recent papers (

^{[1]},^{[2]}) seem to suggest that DNNs are biased towards outputting solutions which are 'simple', in the sense of Kolmogorov complexity (minimum message length complexity) (^{[3]}). It seems to me that this might be relevant for questions around AI alignment. I've not thought this idea through in detail; rather, I'm writing this post to ask whether people are aware of work in this direction (or see obvious reasons why it is a terrible idea) before I devote substantial time to it.## Example

Setting: some years in the future. I instruct Siri: "tell me a story". Here are two things Siri could do:

I would much prefer Siri to choose option 1. There are lots of ways I might go about improving my request, or improving Siri, to make option 1 more likely. But I claim that, assuming Siri is still based on the same basic architecture as we are currently familiar with (and probably even if not), it is more likely to choose option 1 because it is

simpler. Slightly more formally, because explicit instructions on how to carry out option 1 can be written out in many fewer characters than those required for option 2.Now making this claim precise is hard, and if one looks too closely it tends to break down. For example, the instruction string 'tell me a story' could lead to option 1 or option 2 without changing its length. More generally, measures of complexity tend to be 'up to a constant' with the constant depending on the programming language, and if the language is 'natural language that Siri parses' then I am exactly back where I started. But I suspect this is not an unsolvable problem; one could for example think about the length of message that Siri has to write to instruct its sub-agents/kidnapping drones.

## Slightly more formal statement

I'll try to translate from the general setup of (

Siri:X→Y,^{[2:1]}) to the example above with Siri, to make things more concrete, but I remain close to their notation. All errors lie in the translation, etc. etc. We have the set Xof finite sequences of (english) words and the set Y actions Siri can take. The trained model Siri is a functionresponding to my requests by taking some action.

How was Siri created? We started with a space of parameters Θ⊆Rp for some (large) p, and a 'parameter-function' map

M:Θ→Hom(X,Y)which is given by some kind of DNN. We also had a bunch of training data

D⊂X×Yconsisting of a pair of an input and a 'good' resulting action (to simplify, let's assume each input x∈X occurs at most once in D).

Now, there will probably exist very many values θ∈Θ of the parameters such that $M(\theta)(x) = (y) $ for all (x,y)∈D; let's call such a parameter θ

x=David says: tell me a storyfitted toD. And, assuming thatdoes not appear as the first coordinate of a point in D, we can cook up values θ1, θ2∈Θ both fitted to D and such that

M(θ1)(David says: tell me a story)=copy from internetand

M(θ2)(David says: tell me a story)=kidnap an author.My key prediction is that, while θ1 and θ2 are both equally valid solutions to our optimisation problem, the output of the DNN strongly favours θ1 because of its simplicity.

I'm not going to try to justify this here; I need to think about it more first (and read (

^{[3:1]}) more carefully). But I do think it is at least plausible.## References

Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv:1706.10239 ↩︎

Valle-Pérez, Camargo, Louis, arXiv:1805.08522v5 ↩︎ ↩︎

Kamaludin Dingle, Chico Q Camargo, and Ard A Louis. Input–output maps are strongly biased towards simple outputs. Nature communications, 9(1):761, 2018. ↩︎ ↩︎