AI notkilleveryoneism researcher at Apollo, focused on interpretability.
So you need a pretty strong argument that interp in particular is good for capabilities, which isn't borne out empirically and also doesn't seem that strong.
I think current interpretability has close to no capabilities externalities because it is not good yet, and delivers close to no insights into NN internals. If you had a good interpretability tool, which let you read off and understand e.g. how AlphaGo plays games to the extent that you could reimplement the algorithm by hand in C, and not need the NN anymore, then I would expect this to yield large capabilities externalities. This is the level of interpretability I aim for, and the level I think we need to make any serious progress on alignment.
If your interpretability tools cannot do things even remotely like this, I expect they are quite safe. But then I also don't think they help much at all with alignment. There's a roughly proportional relationship between your understanding of the network, and both your ability to align it and make it better, is what I'm saying. I doubt there's many deep insights to be had that further the former without also furthering the latter. Maybe some insights further one a bit more than the other, but I doubt you'd be able to figure out which ones those are in advance. Often, I expect you'd only know years after the insight has been published and the field has figured out all of what can be done with it.
I think it's all one tech tree, is what I'm saying. I don't think neural network theory neatly decomposes into a "make strong AGI architecture" branch and a "aim AGI optimisation at a specific target" branch. Just like quantum mechanics doesn't neatly decompose into a "make a nuclear bomb" branch and a "make a nuclear reactor" branch. In fact, in the case of NNs, I expect aiming strong optimisation is probably just straight up harder than creating strong optimisation.
By default, I think if anyone succeeds at solving alignment, they probably figured out most of what goes into making strong AGI along the way. Even just by accident. Because it's lower in the tech tree.
I didn't think Rob was necessarily implying that. I just tried to give some context to Quintin's objection.
There are more papers and math in this broad vein (e.g. Mingard on SGD, Singular learning theory) , and I roughly buy the main thrust of their conclusions[1].
However, I think "randomly sample from the space of solutions with low combined complexity&calculation cost" doesn't actually help us that much over a pure "randomly sample" when it comes to alignment.
It could mean that the relation between your network's learned goals and the loss function is more straightforward than what you get with evolution=>human hardcoded brain stem=>human goals, since the later likely has a far weaker simplicity bias in the first step than the network training does. But the second step, a human baby training on their brain stem loss signal, seems to remain a useful reference point for the amount of messiness we can expect. And it does not seem to me to be a comforting one. I for one, don't consider getting excellent visual cortex prediction scores a central terminal goal of mine.
Though I remain unsure of what to make of the specific one Quintin cites, which advances some more specific claims inside this broad category, and is based on results from a toy model with weird, binary NNs, using weird, non-standard activation functions.
Whether singular learning theory actually yields you anything useful when your optimiser converges to the largest singularity seems very much architecture dependent though? If you fit a 175 billion degree polynomial to the internet to do next token prediction, I think you'll get out nonsense, not GPT-3. Because a broad solution in the polynomial landscape does not seem to equal a Kolomogorov-simple solution to the same degree it does with MLPs or transformers.
Likewise, there doesn't seem to be anything saying you can't have an architecture with an even better simplicity and speed bias than the MLP family has.
Best I've got is to go dark once it feels like you're really getting somewhere, and only work with people under NDAs (honour based or actually legally binding) from there on out. At least a facsimile of proper security, central white lists of orgs and people considered trustworthy, central standard communication protocols with security levels set up to facilitate communication between alignment researchers. Maybe a forum system that isn't on the public net. Live with the decrease in research efficiency this brings, and try to make it to the finish line in time anyway.
If some org or people would make it their job to start developing and trial running these measures right now, I think that'd be great. I think even today, some researchers might be enabled to collaborate more by this.
Very open to alternate solutions that don't cost so much efficiency if anyone can think of any, but I've got squat.
The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following:
I would take this as indication that the explanation is inadequate. If I said that the linear combination of nodes at layer l of a NN implements the function , but in fact it implements , where g does some other thing, my hypothesis was incorrect, and I'd want the metric to show that. If I haven't even disentangled the mechanism I claim to have found from all the other surrounding circuits, I don't think I get to say my hypothesis is doing a good job. Otherwise it seems like I have a lot of freedom to make up spurious hypotheses that claim whatever, and hide the inadequacies as "small random fluctuations" in the ablated test loss.
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don't want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I'll be able to predict whether it will generalize correctly onto a particular new distribution.
I don't see how the dimensionality of the quantity you want to understand the generative mechanism of relates to the dimensionality of the comparison you would want to carry out to evaluate a proposed generative mechanism.
I want to understand how the model computes its outputs to get loss on distribution , so I can predict what loss it will get on another distribution . I make a hypothesis for what the mechanism is. The mechanism implies that doing intervention on the network, say shifting to , should not change behaviour, because the NN only cares about , not its magnitude. If I then see that the intervention does shift output behaviour, even if it does not change the value of on net, my hypothesis was wrong. The magnitude of does play a part in the network's computations on . It has an influence on the output.
But that interference was random and understanding it won't help you know if the mechanism that the model was using is going to generalize well to another distribution.
If it had no effect on how outputs for are computed, then destroying it should not change behaviour on . So there should be no divergence between the original and ablated models' outputs. If it did affect behaviour on , but not in ways that contribute net negatively or net positively to the accuracy on that particular distribution, it seems that you still want to know about it, because once you understand what it does, you might see that it will contribute net negatively or net positively to the model's ability to do well on .
A heuristic that fires on some of , but doesn't really help much, might turn out to be crucial for doing well on . A leftover memorised circuit that didn't get cleaned up might add harmless "noise" on net on , but ruin generalisation to .
I would expect this to be reasonably common. A very general solution is probably overkill for a narrow sub dataset, containing many circuits that check for possible exception cases, but aren't really necessary for that particular class of inputs. If you throw out everything that doesn't do much to the loss on net, your explanations will miss the existence of these circuits, and you might wrongly conclude that the solution you are looking at is narrow and will not generalise.
Your suggestion of using seems a useful improvement compared to most metrics. It's, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like could allow for less ambiguity).
It's not about vs. some other loss function. It's about using a one dimensional summary of a high dimensional comparison, instead of a one dimensional comparison. There are many ways for two neural networks to both diverge from some training labels by an average loss while spitting out very different outputs. There are tautologically no ways for two neural networks to have different output behaviour without having non-zero divergence in label assignment for at least some data points. Thus, it seems that you would want a metric that aggregates the divergence of the two networks' outputs from each other, not a metric that compares their separate aggregated divergences from some unrelated data labels and so throws away most of the information.
A low dimensional summary of a high dimensional comparison between the networks seems fine(ish). A low dimensional comparison between the networks based on the summaries of their separate comparisons to a third distribution throws away a lot of the relevant information.
The sequence a hypothesis predicts the inductor to receive is not the world model that hypothesis implies.
A hypothesis can consist of very simple laws of physics describing time evolution in an eternal universe, yet predict that the sequence will be cut off soon because the camera that is sending the pixel values that are the sequence the inductor is seeing is about to die.