Lucius Bushnaq

AI notkilleveryoneism researcher at Apollo, focused on interpretability.

Wiki Contributions


This sounds cool and deep but crashes headlong into the issue that the entropy rate and the excess entropy of any stochastic process is time-symmetric.

It's time symmetric around a starting point  of low entropy. The further  is from , the more entropy you'll have, in either direction. The absolute value  is what matters.

In this case,  is usually taken to be the big bang.  So the further in time you are from the big bang, the less the universe is like a dense uniform soup with little structure that needs description, and the higher your entropy will be. That's how you get the subjective perception of temporal causality. 

Presumably, this would hold to the other side of  as well, if there is one. But we can't extrapolate past , because close to  everything gets really really energy dense, so we'd need to know how to do quantum gravity to calculate what the state on the other side might look like.  So we can't check that.  And the notion of time as we're discussing it here might break down at those energies anyway.

Toy example of what I would consider pretty clear-cut cross-layer superposition: 

We have a residual MLP network. The network implements a single UAND gate (universal AND, calculating the  pairwise ANDs of  sparse boolean input features using only  neurons), as described in Section 3 here

However, instead of implementing this with a single MLP, the network does this using all the MLPs of all the layers in combination. Simple construction that achieves this:

  1. Cut the residual stream into two subspaces, reserving one subspace for the input features and one subspace for the  output features.
  2. Take the construction from the paper, and assign each neuron in it to a random MLP layer in the residual network.
  3. Since the input and output spaces are orthogonal, there's no possibility of one MLP's outputs interfering with another MLP's inputs. So this network will implement UAND, as if all the neurons lived in a single large MLP layer.

Now we've made a network that computes boolean circuit in superposition, without the boolean gates living in any particular MLP. To read out the current value of one of the circuit outputs in the MLPs, you'll need to look at a direction that's a linear combination of neurons in all of the MLPs. And if you use an SAE to look at a single residual stream position in this network before the very final MLP layer, it'll probably show you a bunch of half-computed nonsense.

In a real network, the most convincing evidence to me would be a circuit involving sparse coded variables or operations that cannot be localized to any single MLP.

A prior that doesn't assume independence should give you a sparsity penalty that isn't a sum of independent penalties for each activation.

Would you predict that SAE features corresponding to input tokens would have low FT-LLCs, since there's no upstream circuits needed to compute them?

It's not immediately obvious to me that we'd expect random directions to have lower FT-LLCs than 'feature directions',  actually. If my random read-off direction is a sum of many features belonging to different circuits, breaking any one of those circuits may change the activations of that random read-off. Whereas an output variable of a single circuit might stay intact so long as that specific circuit is preserved.

Have you also tried this in some toy settings where you know what FT-LLCs you should get out? Something where you'd be able to work out in advance on paper how much the FT-LLC along some direction  should roughly differ from another direction ?

Asking because last time I had a look at these numeric LLC samplers, they didn't exactly seem reliable yet, to put it mildly. The numbers they spit out seemed obviously nonsense in some cases. About the most positive thing you could say about them was that they at least appeared to get the ordering of LLC values between different networks right. In a few test cases. But that's not exactly a ringing endorsement. Just counting Hessian zero eigenvalues can often do that too. That was a while ago though.

I think this is particularly incorrect for alignment, relative to a more typical STEM research field. Alignment is very young[1]. There's a lot less existing work worth reading than you have in field like, say, lattice quantum field theory. Due to this, the time investment required to start contributing at the research frontier is very low, relatively speaking.

This is definitely changing. There's a lot more useful work than there was when I started dipping my toe into alignment three years ago. But compared to something like particle physics, it's still very little. 

  1. ^

    In terms of # total smart people hours invested

The reason I often bring up human evolution is because that's our only example of an outer optimization loop producing an inner general intelligence

There's also human baby brains training minds from something close to random initialisation at birth into a general intelligence. That example is plausibly a lot closer to how we might expect AGI training to go, because human brains are neural nets too and presumably have strictly-singular flavoured learning dynamics just like our artificial neural networks do. Whereas evolution acts on genes, which to my knowledge don't have neat NN-style loss landscapes heavily biased towards simplicity. 

Evolution is more like if people used classic genetic optimisation to blindly find neural network architectures, optimisers, training losses, and initialisation schemes, that are in turn evaluated by actually training the networks.

Not that I think this ultimately ends up weakening Doomimir's point all that much. Humans don't seem to end up with terminal goals that are straightforward copies of the reward circuits pre-wired into our brains either. I sure don't care much about predicting sensory inputs super accurately, which was probably a very big part of the training signal that build my mind.

Many people in interpretability currently seem interested in ideas like enumerative safety, where you describe every part of a neural network to ensure all the parts look safe. Those people often also talk about a fundamental trade-off in interpretability between the completeness and precision of an explanation for a neural network's behavior and its description length. 

I feel like, at the moment, these sorts of considerations are all premature and beside the point.  

I don't understand how GPT-4 can talk. Not in the sense that I don't have an accurate, human-intuitive description of every part of GPT-4 that contributes to it talking well. My confusion is more fundamental than that. I don't understand how GPT-4 can talk the way a 17th-century scholar wouldn't understand how a Toyota Corolla can move. I have no gears-level model for how anything like this could be done at all. I don't want a description of every single plate and cable in a Toyota Corolla, and I'm not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field. 

What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don't have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself, without a numeric optimizer as an intermediary, that would be able to talk.

I kind of expect that things-people-call-their-values-that-are-not-their-revealed-preferences would be a concept that a smart AI that predicts systems coupled to humans would think in as well. It doesn't matter whether these stated values are 'incoherent' in the sense of not being in tune with actual human behavior, they're useful for modelling humans because humans use them to model themselves, and these self-models couple to their behavior. Even if they don't couple in the sense of being the revealed-preferences in an agentic model of the humans' actions.

Every time a human tries and mostly fails to explain what things they'd like to value if only they were more internally coherent and thought harder about things, a predictor trying to forecast their words and future downstream actions has a much easier time of it if they have a crisp operationalization of the endpoint the human is failing to operationalize. 

An analogy: If you're trying to predict what sorts of errors a diverse range of students might make while trying to solve a math problem, it helps to know what the correct answer is. Or if there isn't a single correct answer, what the space of valid answers looks like.

Corrigibility and actual human values are both heavily reflective concepts.  If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment -- which of course most people can't do because they project the category boundary onto the environment

Actual human values depend on human internals, but predictions about systems that strongly couple to human behavior depend on human internals as well. I thus expect efficient representations of systems that strongly couple to human behavior to include human values as somewhat explicit variables. I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans. 

At lower confidence, I also think human expected-value-trajectory-under-additional-somewhat-coherent-reflection would show up explicitly in the thoughts of AIs that try to predict systems strongly coupled to humans. I think this because humans seem to change their values enough over time in a sufficiently coherent fashion that this is a useful concept to have. E.g., when watching my cousin grow up, I find it useful and possible to have a notion in advance of what they will come to value when they are older and think more about what they want. 

I do not think there is much reason by default for the representations of these human values and human value trajectories to be particularly related to the AI's values in a way we like. But that they are in there at all sure seems like it'd make some research easier, compared to the counterfactual. For example, if you figure out how to do good interpretability, you can look into an AI and get a decent mathematical representation of human values and value trajectories out of it. This seems like a generally useful thing to have. 

If you separately happen to have developed a way to point AIs at particular goals, perhaps also downstream of you having figured out how to do good interpretability[1], then having explicit access to a decent representation of human values and human expected-value-trajectories-under-additional-somewhat-coherent-reflection might be a good starting point for research on making superhuman AIs that won't kill everyone. 

  1. ^

    By 'good interpretability', I don't necessarily mean interpretability at the level where we understand a forward pass of GPT-4 so well that we can code our own superior LLM by hand in Python like a GOFAI. It might need to be better interpretability than that. This is because an AI's goals, by default, don't need to be explicitly represented objects within the parameter structure of a single forward pass. 

But here we are, and the idea of the USA govt nationalizing OpenAI seems a million miles outside the Overton window.

Registering that it does not seem that far out the Overton window to me anymore. My own advance prediction of how much governments would be flipping out around this capability level has certainly been proven a big underestimate. 


Load More