How do you get it? Apparently you can't get it from spinning the boxes.
I have not gotten them.
Hey, I paid for picolightcones, but they didn't appear? But I am being deducted from my card. Is this how its supposed to be?
I haven't collected all the virtues yet, but now there is no way for me to acquire more lootboxes because I've ran out of lw-bucks. I don't know what to do.
Hey, I'm buying pico lightcones. And I am being deduced from my card, but I don't get any pico lightcones @habryka
Collecting all the virtues!
Sorry, this question is probably dumb, just asking to clear up my own confusions. But this experiment seem unfair to the SAE in multiple ways, some you mentioned. But also: the reason SAEs are cool is because they're unsupervised, and that there consequently is some hope they're finding concepts the models are actually using when they're thinking. But here you're starting with a well-defined human concept and then trying to find it inside the model.
If you are looking at a concept C and you have a clean dataset A,B where C is represented in every sample of A and none in B, and you train a probe to tell when C is present by looking at the residual stream, wouldn't you expect it to just find the "most correct linear representation" of C? (assuming your dataset really is clean, A and B are same distribution, you've removed spurious correlations). The linear probe is in some sense the optimal tool for the job.
Like, the SAE gets less than perfect reconstruction loss, so the latent activations contains strictly less information than the activations themselves. And they're basically linear. So a sae probe can learn just a subset of the functions a linear probe can. Especially when they are sparse. So the sae probe starts out with a big disadvantage.
From what I understood, the reason you thought a sparse probe trained on SAE latents might still have an edge is because the SAE features allow you to capture the "relevant features" with a low complexity function, which probably is going to generalize better.
But seems to me this only makes sense if the model's internal representation of "harmful intent" (and whatever few other related concepts the sparse probe is using) are similar to the ones generating the benchmarks.
Like if the "harmful intent" feature the SAE learnt is actually a "schmarmful intent" feature, which has .98 correlation with real harmful intent the way the benchmarks define it, maybe that's what the sparse sae probe learned to use, + some other sch-latents. However, in this case the argument for why you'd expect them to generalize better than a dense probe fails.
Still, it seems to me what mechinterp should care about are the "schmarmful" features.
I'm struggling to think of an experiment that discriminates the two. But like, if you're a general and you've recruited troops from some other country, and its important for you that your troops fight with "honor", but their conception of honor is subtly different "schmonor", understanding "schmonor" will better allow you to predict their behavior. But if you actually want them to fight with honor, understanding schmonor is not necessarily all that helpful.
Seems to me it would be more damning of SAEs if you instead of using a dataset classification, instead were trying to predict the future behavior of the model. Like whether it would try to refuse after reading parts of the user prompt. What do you think about this?
Why not just use resting heartrate? That also has very good empirical backing as a good proxy for overall heatlh, and its much easier to measure.
I basically agree with this.
Or I'd put 20% chance on us being in the worlds "where superalignment doesn't require strong technical philosophy", that's maybe not very low.
Overall I think the existance of Anthropic is a mild net positive, and the only lab for which this is true (major in the sense of building frontier models).
"the existence of" meaning, if they shut down today or 2 years ago, it would've not increased our chance of survival, maybe lowered it.
I'm also somewhat more optimistic about the research they're doing helping us in the case where alignment is actually hard.
I agree with the overall point, but think it maybe understates importance of interpretability because it neglects the impact interpretability has on creating conceptual clarity for the concepts we care about. I mean, that's where ~~50% of the value interpretability lies, in my opinion.
Lucius Bushnaq third to last shortform "My Theory of Impact for Interpretability" explains whats basically my view decently well https://www.lesswrong.com/posts/cCgxp3Bq4aS9z5xqd/lucius-bushnaq-s-shortform