the difference between activation sparsity, circuit sparsity, and weight sparsity
activation sparsity enforces that features activate sparsely - every feature activates only occasionally.
circuit sparsity enforces that the connections between features is sparse - most features are not connected to most other features.
weight sparsity enforces that most of the weights are zero. weight sparsity naturally implies circuit sparsity if we interpret the neurons and residual channels of the resulting model as the features.
weight sparsity is not the only way to enforce circuit sparsity - for example, Jacobian SAEs also attempt to enforce circuit sparsity. the big advantage of weight sparsity is that it's a very straightforward way to be sure that the interactions are definitely sparse and have no interference weights. unfortunately, it comes at a terrible cost - the resulting models are very expensive to train.
although in some sense the circuit sparsity paper is an interpretable pretraining paper, this is not the framing I'm most excited about. if anything, I think of interpretable pretraining as a downside of our approach, that we put up with because it makes the circuits really clean.
most of the time the person being recognized is not me
I find it anthropologically fascinating how at this point neurips has become mostly a summoning ritual to bring all of the ML researchers to the same city at the same time.
nobody really goes to talks anymore - even the people in the hall are often just staring at their laptops or phones. the vast majority of posters are uninteresting, and the few good ones often have a huge crowd that makes it very difficult to ask the authors questions.
increasingly, the best parts of neurips are the parts outside of neurips proper. the various lunches, dinners, and parties hosted by AI companies and friend groups (and increasingly over the past few years, VCs) are core pillars of the social scene, and are where most of the socializing happens. there are so many that you can basically spend your entire neurips not going to neurips at all. at dinnertime, there are literally dozens of different events going on at the same time.
multiple unofficial workshops, entirely unaffiliated with neurips, will schedule themselves to be in town at the same time; they will often have a way higher density of interesting people and ideas.
if you stand around in the hallways and chat in a group long enough, eventually someone walking by will recognize someone in the group and join in, which repeats itself until the group get so big that it undergoes mitosis into smaller groups.
if you're not already going to some company event, finding a restaurant at lunch or dinner time can be very challenging. every restaurant in a several mile radius will be either booked for a company event, or jam packed with people wearing neurips badges.
idk, it's unclear to me that computers and the Internet are more subtle than cars or radios. it's also, 50 year old americans today have seen the fall of the soviet union, the creation of the european union, enormous advances in civil rights, 9/11, the 2008 crash, covid, the invasion of ukraine, etc. this isn't exactly WWII level but also nowhere near a static stable world.
thoughts on lemborexant
pros: if you take it, you will fall asleep 30-60 minutes later. nothing else I've tried has been as reliable at making sure I definitely fall asleep, and as far as I can tell, it doesn't destroy my sleep quality. especially at 10mg, you can feel it knocking you out, and you basically can't power through it even if you want to. it's a bit scary but all powerful sleep drugs are at least a bit scary and often a lot more scary. I generally take 5mg instead.
cons: it doesn't do anything to keep you asleep; if your body doesn't really want to sleep, you will wake up 2 hours later fully alert. it also doesn't do anything to shift your sleep schedule. these facts combined mean that if you try to use lemborexant for jet lag / shifting sleep earlier, then your life will suck indefinitely until you stop using lemborexant. my current recipe is to only use lemborexant when it's near enough to my normal bedtime, and I use melatonin 3 hours before bed to slowly move sleep schedule earlier (later requires no special effort)
(potentially this also means lemborexant can be used to get nice 2 hour daytime naps? I have enough fear of god sleep drugs that I feel hesitant to try any kind of hack like this)
(not medical advice. not a doctor, and even if I was a doctor I'm not your doctor, and even if I was your doctor I wouldn't be communicating to you via lesswrong shortforms)
fwiw, I'm pessimistic that you will actually be able to make big compute efficiency improvements even by fully understanding gpt-n. or at least, for an equivalent amount of effort, you could have improved compute efficiency vastly more by just doing normal capabilities research. my general belief is that the kind of understanding you want for improving compute efficiency is at a different level of abstraction than the kind of understanding you want for getting a deep understanding of generalization properties.
this feels like a subtweet of our recent paper on circuit sparsity. I would have preferred a direct response to our paper (or any other specific paper/post/person), rather than a dialogue against a hypothetical interlocutor.
I think this post is unfairly dismissive of the idea that we can guess aspects of the true ontology and iterate empirically towards it. it makes it sound like if you have to guess a lot of things right about the true ontology before you can make any empirical progress at all. this is a reasonable view of the world, but I think evidence so far rules out the strongest possible version of this claim.
SAEs are basically making the guess that the true ontology should activate kinda sparsely. this is clearly not enough to pin down the true ontology, and obviously at some point activation sparsity stops being beneficial and starts hurting. but SAE features seem closer to the true ontology than the neurons are, even if they are imperfect. this should be surprising if you think that you need to be really correct about the true ontology before you can make any progress! making the activations sparse is this kind of crude intervention, and you can imagine a world where SAEs don't find anything interesting at all because it's much easier to just find random sparse garbage, and so you need more constraints before you pin down something even vaguely reasonable. but we clearly don't live in that world.
our circuit sparsity work adds another additional constraint: we also enforce that the interactions between features are sparse. (I think of the part where we accomplish this by training new models from scratch as an unfortunate side effect; it just happens to be the best way to enforce this constraint.) this is another kind of crude intervention, but our main finding is that this gets us again slightly closer to the true concepts; circuits that used to require a giant pile of SAE features connected in an ungodly way can now be expressed simply. this again seems to suggest that we have gotten closer to the true features.
if you believe in natural abstractions, then it should at least be worth trying to dig down this path and slowly add more constraints, seeing whether it makes the model nicer or less nice, and iterating.
fwiw, I think the 100-1000x number is quite pessimistic, in that we didn't try very hard to make our implementation efficient, we were entirely focused on making it work at all. while I think it's unlikely our method will ever reach parity with frontier training methods, it doesn't seem crazy that we could reduce the gap a lot.
and I think having something 100x behind the frontier (i.e one GPT worth) is still super valuable for developing a theory of intelligence! like I claim it would be super valuable if aliens landed and gave us an interpretable GPT-4 or even GPT-3 without telling us how to make our own or scale it up.
to be clear, this post is just my personal opinion, and is not necessarily representative of the beliefs of the openai interpretability team as a whole