# 3

### Summary

Deep neural networks (DNNs) are generally used with large numbers of parameters relative to the number of given data-points, so that the solutions they output are far from uniquely determined. How do DNNs 'choose' what solution to output? Some fairly recent papers ([1], [2]) seem to suggest that DNNs are biased towards outputting solutions which are 'simple', in the sense of Kolmogorov complexity (minimum message length complexity) ([3]). It seems to me that this might be relevant for questions around AI alignment. I've not thought this idea through in detail; rather, I'm writing this post to ask whether people are aware of work in this direction (or see obvious reasons why it is a terrible idea) before I devote substantial time to it.

### Example

Setting: some years in the future. I instruct Siri: "tell me a story". Here are two things Siri could do:

1. copy-paste an existing story from somewhere on the internet.
2. kidnap a great author and force them to write a story (by now Siri is smart enough to do this);

I would much prefer Siri to choose option 1. There are lots of ways I might go about improving my request, or improving Siri, to make option 1 more likely. But I claim that, assuming Siri is still based on the same basic architecture as we are currently familiar with (and probably even if not), it is more likely to choose option 1 because it is simpler. Slightly more formally, because explicit instructions on how to carry out option 1 can be written out in many fewer characters than those required for option 2.

Now making this claim precise is hard, and if one looks too closely it tends to break down. For example, the instruction string 'tell me a story' could lead to option 1 or option 2 without changing its length. More generally, measures of complexity tend to be 'up to a constant' with the constant depending on the programming language, and if the language is 'natural language that Siri parses' then I am exactly back where I started. But I suspect this is not an unsolvable problem; one could for example think about the length of message that Siri has to write to instruct its sub-agents/kidnapping drones.

### Slightly more formal statement

I'll try to translate from the general setup of ([2:1]) to the example above with Siri, to make things more concrete, but I remain close to their notation. All errors lie in the translation, etc. etc. We have the set of finite sequences of (english) words and the set  actions Siri can take. The trained model Siri is a function

responding to my requests by taking some action.

How was Siri created? We started with a space of parameters  for some (large) p, and a 'parameter-function' map

which is given by some kind of DNN. We also had a bunch of training data

consisting of a pair of an input and a 'good' resulting action (to simplify, let's assume each input  occurs at most once in ).

Now, there will probably exist very many values  of the parameters such that $M(\theta)(x) = (y)$ for all ; let's call such a parameter  fitted to . And, assuming that

does not appear as the first coordinate of a point in , we can cook up values  both fitted to  and such that

and

My key prediction is that, while  and  are both equally valid solutions to our optimisation problem, the output of the DNN strongly favours  because of its simplicity.

I'm not going to try to justify this here; I need to think about it more first (and read ([3:1]) more carefully). But I do think it is at least plausible.

### References

Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv:1706.10239 ↩︎

Valle-Pérez, Camargo, Louis, arXiv:1805.08522v5 ↩︎ ↩︎

Kamaludin Dingle, Chico Q Camargo, and Ard A Louis. Input–output maps are strongly biased towards simple outputs. Nature communications, 9(1):761, 2018. ↩︎ ↩︎

# 3

New Comment

First problem to come to mind : subagent stability (sometimes it's simple to build a new agent that does your task, even if it's complicated to explain what that agent would do).

Hi Charlie, If you can give a short (precise) description for an agent that does the task, then you have written a short programme that solves the task. I think then if you need more space to ‘explain what the agent would do’ then you are saying there also exists a less efficient/compact way to specify the solution. From this perspective I think the latter is then not so relevant. David

The upper bound for Kolmogorov complexity of an input-output map in Dingle is not very interesting; iirc it's basically saying that a program can be constructed of the form "specify this exact input-output map using the definition already provided", and the upper bound is just the length of this program.