Inner Alignment: Explain like I'm 12 Edition
(This is an unofficial explanation of Inner Alignment based on the Miri paper Risks from Learned Optimization in Advanced Machine Learning Systems (which is almost identical to the LW sequence) and the Future of Life podcast with Evan Hubinger (Miri/LW). It's meant for anyone who found the sequence too long/challenging/technical to read.) Note that bold and italics means "this is a new term I'm introducing," whereas underline and italics is used for emphasis. What is Inner Alignment? Let's start with an abridged guide to how Deep Learning works: 1. Choose a problem 2. Decide on a space of possible solutions 3. Find a good solution from that space If the problem is "find a tool that can look at any image and decide whether or not it contains a cat," then each conceivable set of rules for answering this question (formally, each function from the set of all pixels to the set {yes,no}) defines one solution. We call each such solution a model. The space of possible models is depicted below. Since that's all possible models, most of them are utter nonsense. Pick a random one, and you're as likely to end up with a car-recognizer than a cat-recognizer – but far more likely with an algorithm that does nothing we can interpret. Note that even the examples I annotated aren't typical – most models would be more complex while still doing nothing related to cats. Nonetheless, somewhere in there is a model that would do a decent job on our problem. In the above, that's the one that says, "I look for cats." How does ML find such a model? One way that does not work is trying out all of them. That's because the space is too large: it might contain over 101000000 candidates. Instead, there's this thing called Stochastic Gradient Descent (SGD). Here's how it works: SGD begins with some (probably terrible) model and then proceeds in steps. In each step, it switches to another model that is "close" and hopefully a little better. Eventually, it stops and outputs the most


I'm not particularly buying any of this. The central metaphor just doesn't seem true (scratching an itch can be way more pleasurable than not having one, imE, and ditto with many other instances of receding unpleasantness) and I don't think "[t]herefore what we usually take as pleasure is just scratching the sore of underlying suffering" follows from any of the stuff before (lots of types of pleasure for which this doesn't seem true).