Thoughts on self-inspecting neural networks.

Thanks for your comment!

As for the image thing, it's more of a metaphor than a literal interpretation of what I'm talking about. I'm thinking of a multidimensional matrix representation; you can think of that a bit like an image (RGB on pixels) and use similar techniques as what are used by actual image software; but it's not a literal JPEG or BMP or whatever. The idea is to be able to take advantage of compression algorithms, etc. to make the process more efficient.

The training data for the supervisory network is the input, outputs, and parameter deltas of the target network. The idea is to learn the which parameters change based on which input/output pair and thereby eventually localize concepts within the target network and hopefully label/explain them in human understandable formats. This should be possible since the input/output is in human text.

The reasons to think of it a bit like an image and label parameters and sections of the target network is to try to take advantage of the same kind of technology used in the AlphaGo/AlphaZero/MuZero AIs that used techniques relating to image analysis to try and predict the deltas in the target network. If you could do this, then you should be able to "aim" the target network in a direction that you want it to go; basically, tell it what you want it to learn.

All of this may have several benefits, it could allow us to analyze the target network and understand what is in there and maybe check to see if anything we don't like is in there (misalignment for example). And it could allow us to directly give the network a chosen alignment instead of the more nebulous feedback that we currently give networks to train them. Right now, it's kind of like a teacher that only tells their student that they are wrong and not why they are wrong. The AI kind of has to guess what it did wrong. This makes it far more likely for it to end up learning a policy that has the outward behavior that we want but may have an inner meaning that doesn't actually line up with what we're trying to teach it.

You are correct that it may accelerate capabilities as well as safety; unfortunately, most of my ideas seem to be capabilities ideas. However, I am trying to focus more on safety. I do think we may have to accept that alignment research may impact capabilities research as well. The more we restrict the kinds of safety research we try to do, the more likely it is that we don't find a solution at all. And it's entirely possible that the hypothetical "perfect solution" would, in fact, greatly aid capabilities at the same time that it solved alignment/safety issues. I don't think we should avoid that. I tend to think that safety research should mostly be open source while capabilities research should be mostly closed. If somebody somehow manages real AGI, we definitely want them to have available to them the best safety research in the world; no matter who it is that does it.

Anyway, thanks again for your input, I really appreciate it. Let me know if I successfully answered your questions/suggestions or not. :-)

[-]Charlie Steiner3y20

I think there's a lot to like here. The entire thing about images reveals that you're not quite "thinking with portals," but I agree that automated interpretability would be powerful, and could do things like you talk about.

An important thing to consider is: what would be the training data of a supervisory network?

Maybe another important thing to consider is: would this actually help alignment more than it helps capabilities?

[-]Deruwyn3y10

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

4

Thoughts on self-inspecting neural networks.

4

4