Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a deep learning curriculum with a focus on topics relevant to large language model alignment. It is centered around papers and exercises, and is biased towards my own tastes.

It is targeted at newcomers to deep learning who are already familiar with the basics, but I expect it to be pretty challenging for most such people, especially without mentorship. So I have included some advice about prerequisites and more accessible alternatives.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 10:50 AM

Was definitely expecting this to be about curriculum learning :P

For the section on interpretability, I actually divide (modern-day, I'm less clear on the ambitious future) interpretability into several prongs, and am curious what you think:

  1. Making sense of single neurons (if those single neurons are the outputs, this is about making sense of the outputs, but it's the same tools as focusing on intermediate layers). Like in Understanding RL Vision, or DeepDream, or Anthropic's attribution visualizations for attention heads. This can take many forms, from generative visualization, to finding stimuli in the dataset that activate a neuron, to attributing activation within an input image/string/vector, but the key point is that you start by picking a neuron and then ask "what does this mean?"
  2. Circuit-level understanding. This might be more like item 1.5, but it's tricky enough to merit its own spot. Figure out entire algorithms used by the NN in human-comprehensible terms. Low-level visual features in CNNs are the best example of this, induction heads too (kinda).  Often this starts with interpreting some of the individual neurons involved and then trying to make sense of how they're used, but you can also use tools specifically aimed at finding circuits that might be naturally sparse. I think there's also unexplored room for tools that take a suspected circuit and visualize what difference it makes for higher-level activations in response to inputs.
  3. Starting with a concept, and then trying to find its representation in the NN. This is done in Acquisition of Chess Knowledge in AlphaZero, and the ROME paper. If you know that a chess agent is going to learn to detect when it's in check, you can do things like training a simple classifier from the state of the NN to whether or not it's in check, and then doing attribution on the simple classifier to see how it's representing being in check, and how "reliably" it "knows" that.

I've not mentally carved things up that way before, but they do seem like different flavors of work (with 1 and 2 being closely related).

Another distinction I sometimes consider is between exploring a network for interpretable pieces ("finding things we understand") versus trying to exhaustively interpret part of a network ("finding things we don't understand"). But this distinction doesn't carve up existing work very evenly: the only thing I can think of that I'd put in the latter category is the work on Artificial Artificial Neural Networks.

That seems like a pretty reasonable breakdown (though note that 2 inherently needs to come after 1)