AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
I think I mostly agree with this for current model organisms, but it seems plausible to me that well chosen studies conducted on future systems that are smarter in an agenty way, but not superintelligent, could yield useful insights that do generalise to superintelligent systems.
Not directly generalise mind you, but maybe you could get something like "Repeated intervention studies show that the formation of coherent self-protecting values in these AIs works roughly like with properties . Combined with other things we know, this maybe suggests that the general math for how training signals relate to values is a bit like , and that suggests what we thought of as 'values' is a thing with type signature ."
And then maybe type signature is actually a useful building block for a framework which does generalise to superintelligence.
I am not particularly hopeful here. Even if we do get enough time to study agenty AIs that aren't superintelligent, I have an intuition that this sort of science could turn out to be pretty intractable for reasons similar to why psychology turned out to be pretty intractable. I do think it might be worth a try though.
I agree issue 3 seems like a potential problem with methods that optimise for sparsity too much, but it doesn't seem that directly related to the main thesis? At least in the example you give, it should be possible in principle to notice that the space can be factored as a direct sum without having to look to future layers.
Sure, it's possible in principle to notice that there is a subspace that can be represented factored into a direct sum. But how do you tell whether you in fact ought to represent it in that way, rather than as composed features, to match the features of the model? Just because the compositional structure is present in the activations doesn't mean the model cares about it.
I don't think your post contains any knockdown arguments that this approach is doomed (do you agree?), but it is maybe suggestive.
I agree that it is not a knockdown argument. That is why the title isn't "Activation space interpretability is doomed."
Hm, feels off to me. What privileges the original representation of the uncompressed file as the space in which locality matters? I can buy the idea that understanding is somehow related to a description that can separate the whole into parts, but why do the boundaries of those parts have to live in the representation of the file I'm handed? Why can't my explanation have parts in some abstract space instead? Lots of explanations of phenomena seem to work like that.
Thank you. Yes, our claim isn't that SAEs only find composed features. Simple counterexample: Make a product space of two spaces with dictionary elements each, with an average of features active at a time in each factor space. Then the dictionary of composed features has an of , whereas the dictionary of factored features has an of , so a well-tuned SAE will learn the factored set of features. Note however that just because the dictionary of factored features is sparser doesn't mean that those are the features of the model. The model could be using the composed features instead, because that's more convenient for the downstream computations somehow, or for some other reason.
Our claim is that an SAE trained on the activations at a single layer cannot tell whether the features of the model are in composed representation or factored representation, because the representation the model uses need not be the representation with the lowest .
The third term in that. Though it was in a somewhat different context related to the weight partitioning project mentioned in the last paragraph, not SAE training.
Yes, brittle in hyperparameters. It was also just very painful to train in general. I wouldn't straightforwardly extrapolate our experience to a standard SAE setup though, we had a lot of other things going on in that optimisation.
In my limited experience, attribution-patching style attributions tend to be a pain to optimise for sparsity. Very brittle. I agree it seems like a good thing to keep poking at though.
See the second to last paragraph. The gradients of downstream quantities with respect to the activations contains information and structure that is not part of the activations. So in principle, there could be a general way to analyse the right gradients in the right way on top of the activations to find the features of the model. See e.g. this for an attempt to combine PCAs of activations and gradients together.
Space has resources people don't own. The earth's mantle a couple thousand feet down potentially has resources people don't own. More to the point maybe, I don't think humans will be able to continue enforcing laws barring a hostile takeover in the way you seem to think.
Imagine we find out that aliens are headed for earth and will arrive in a few years. Just from the light emissions their probes and expanding civilisation give off, we can infer that they're obviously more technologically mature than us, probably already engineered themselves to be much smarter than us, and can basically do whatever they want with the atoms that make up our solar system and there's nothing we can do about it. We don't know what they want yet though. Maybe they're friendly?
I think guessing that the aliens will be friendly and share human morality to an extent seems like a pretty specific guess about their minds to be making, and is maybe flase more likely than not. But guessing that they don't care about human preferences or well-being but do care about human legal structures, that they won't at all help you or gift you things, also won't disassemble you and your property for its atoms[1], but will try to buy atoms from those whom the atoms belong to according to human legal records, now that strikes me as a really really really specific guess to be making that is very likely false.
Superintelligent AGIs don't start out having giant space infrastructure, but qualitatively, I think they'd very quickly overshadow the collective power of humanity in a similar manner. They can see paths through the future to accomplish their goals much better than we can, routing around attempts by us to oppose them. The force that backs up our laws does not bind them. If you somehow managed to align them, they might want to follow some of our laws, because they care about them. But if someone managed to make them care about the legal system, they probably also managed to make them care about your well-being. Few humans, I think, would not at all care about other humans' welfare, but would care about the rule of law, when choosing what to align their AGI with. That's not a kind of value system that shows up in humans much.
So in that scenario, you don't need a legal claim to part of the pre-exisiting economy to benefit from the superintelligences' labours. They will gift some of their labour to you. Say the current value of the world economy is , owned by humans roughly in proportion to how much money they have, and two years after superintelligence the value of the economy is , with ca. of the new surplus owned by aligned superintelligences[2] because they created most of that value, and ca. owned by rich humans who sold the superintelligence valuable resources and infrastructure to get the new industrial base started faster[3]. The superintelligence will then probably distribute its gains among humans according to some system that either treats conscious minds pretty equally, or follows the idiosyncratic preferences of the faction that aligned it, not according to how large a fraction of the total economy they used to own two years ago. So someone who started out with much more money than you two years ago doesn't have much more money in expectation now than you do.
For its conserved quantum numbers really
Or owned by whomever the superintelligences take orders from.
You can't just demand super high share percentages from the superintelligence in return for that startup capital. It's got all the resource owners in the world as potential bargain partners to compete with you. And really, the only reason it wouldn't be steering the future into a deal where you get almost nothing, or just steal all your stuff, is to be nice to you. Decision theoretically, this is a handout with extra steps, not a negotiation between equals.
Are you so sure that there is not a single interesting, a priori deducible fact about the superintelligent economy beyond "a singleton is in charge and everything is utopia"?
End points are easier to infer than trajectories, so sure, I think there's some reasonable guesses you can try to make about how the world might look after aligned superintelligence, should we get it somehow.
For example, I think it's a decent bet that basically all minds would exist solely as uploads almost all of the time, because living directly in physical reality is astronomically wasteful and incredibly inconvenient. Turning on a physical lamp every time you want things to be brighter means wiggling about vast numbers of particles and wasting an ungodly amount of negentropy just for the sake of the teeny tiny number of bits about these vast numbers of particles that actually make it to your eyeballs, and the even smaller number of bits that actually end up influencing your mind state and making any difference to your perception of the world. All of the particles[1] in the lamp in my bedroom, the air its light shines through, and the walls it bounces off, could be so much more useful arranged in an ordered dance of logic gates where every single movement and spin flip is actually doing something of value. If we're not being so incredibly wasteful about it, maybe we can run whole civilisations for aeons on the energy and negentropy that currently make up my bedroom. What we're doing right now is like building an abacus out of supercomputers. I can't imagine any mature civilisation would stick with this.
It's not that I refuse to speculate about how a world post aligned superintelligence might look. I just didn't think that your guess was very plausible. I don't think pre-existing property rights or state structures would matter very much in such a world, even if we don't get what is effectively a singleton, which I doubt. If a group of superintelligent AGIs is effectively much more powerful and productive than the entire pre-existing economy, your legal share of that pre-existing economy is not a very relevant factor in your ability to steer the future and get what you want. The same goes for pre-existing military or legal power.
Well, the conserved quantum numbers of my room, really.
I used to, as a child. I did accept a lawful universe, but I thought my perception of free will was in tension with that, so that perception must be "an illusion".
My mother kept trying to explain to me that there was no tension between these things, because it was correct that my mind made its own decisions rather than some outside force. I didn't understand what she was saying though. I thought she was just redefining 'free will' from a claim that human brains effectively had a magical ability to spontaneously ignore the laws of physics to a boring tautological claim that human decisions are made by humans rather than something else.
I changed my mind on this as a teenager. I don't quite remember how, it might have been the sequences or HPMOR again. I realised that my imagination had still been partially conceptualising the "laws of physics" as some sort of outside force, a set of strings pulling my atoms around, rather than as a predictive description of me and the universe. Saying "the laws of physics make my decisions, not me" made about as much sense as saying "my fingers didn't move, my hand did." That was what my mother had been trying to tell me.