an explicit mention that it only makes sense to compare traces within the same run.
Yep, thanks for the suggestion. I also think Zach's comment is very helpful and I'm planning to edit the post to include this and some of the stuff he mentioned.
To answer your other questions:
Does it make sense to change the temperature throughout the run (like simulated annealing) rather than just run with each temperature?
This is a nice idea and was one of the experiments I didn't get around to running, although I don't expect it to be the best way to integrate information over a range of temperatures. If it's true that we're observing different structure at different temperatures (and not just a re-packaged version of the same structure) then doing this will likely jumble everything up (e.g make PC's less interpretable). I also think there's a chance the reason clustering traces works so well is because SGLD is imperfect and observing the per-step losses is already effectively telling us about how the inputs behave over a range of temperatures.
Does it make sense to e.g. run multiple chains?
As you mentioned above, directly looking at covariances between different chains no: taking the covariance of two traces from different chains would be the same as just multiplying their averages (the pLLCs). Averaging over chains (how the LLC is usually calculated) and then looking at covariances will just reduce signal, but averaging over covariances is probably a good idea (assuming each chain was well behaved and giving similar pLLC estimates - at a bad set of hyperparams SGLD will give you significantly different estimates per chain and averaging covariances might be misleading).
Could you use a per-sample gradient trace (rather than the loss trace) of the SGLD to learn something?
I think there are lots of observables which could be interesting replacements for the loss (most roughly equivalent), but I'm not particularly sure I have any good ideas about what to expect/do with the data in this case. In terms of visualization, you'd probably have to do some tricks with your dim reduction but I imagine you could nicely represent trajectories doing something like this.
Fixed, thanks!
Ok, so why not just train a model on fake anomaly detection/interp research papers? Fake stories about 'the bad AI that got caught', 'the little AI that overstepped', etc. I don't know how to word it, but this seems like something closer to intimidation than alignment, which I don't think makes much sense as a strategy intended to keep us all alive.
I don't know if this is just hindsight, but tracr has in no way turned out to be safety relevant. Was it considered to be so at the time of commenting?
(I think a lot of unlearning research is bullshit, but besides that, is anyone deploying large models doing unlearning?)
Why do you think this? Is there specific research you have in mind? Some kind of reference would be nice. In the general case, it seems to me that unlearning matters because knowing how to effectively remove something from a model is just the flip-side of understanding how to instill values. Although not the primary goal of unlearning, work into how to 'remove' should also equally benefit attempts to 'instill' robust values into the model. If fine-tuning for value alignment just patches over 'bad facts' with 'good facts' any 'aligned' model will be less robust than one with harmful knowledge properly removed. If the alignment faking paper and peripheral alignment research are important at a meta level, then perhaps unlearning will be important because it can tell us something about 'how deep' our value installation really is, at an atomic scale. Lack of current practical use isn't really important, we should be able to develop theory that will tell us something important about model internals. I think there is a lot of very interesting mech-interp of unlearning work waiting to be done that can help us here.
Networks that have to learn more features may become more adversary-prone simply because the adversary can leverage more features which are represented more densely.
Also, in the top figure the loss is 'relative to the non-superposition model', but if I'm not mistaken the non-superposition model should basically be perfectly robust. Because it's just one layer, its Jacobian would be the identity, and because the loss is MSE, any perturbation to the input would be perfectly reflected only in the correct output feature, meaning no change in loss whatsoever. It's only when you introduce superposition that any change to the input can change the loss (as features actually 'interact').
While interacting with people I'll sometimes get a 'pseudo-deja-vu', per se, where the moment starts to feel cached or trope-y enough to be ironic. Most of the time this makes the situation feel funny. Often this happens when I'm talking, and the feeling is distracting enough that I'll end up involuntarily pausing my sentence, deciding I probably can't convey this odd feeling without sounding slightly crazy (and so instead I just inexplicably smile or laugh), and then attempting a restart in a non-cached way.
The issue is that this almost happens too often, and I think without having good ways of turning this noticing into something useful it can be unhelpful. I think I started getting this feeling around two years ago after reflecting for a while on particularly cached or trope-y conversations, and since then it's sort of been baked into my mind as a blinker with a frustratingly high false positive rate. What you're describing sounds like a much more positive product from a similar mechanism, which is encouraging.