I really like this post, but more for:
than updating my own credences on specifics.
I actually do have some publicly hosted, only on residual stream and some simple training code.
I'm wanting to integrate some basic visualizations (and include Antrhopic's tricks) before making a public post on it, but currently:
Dict on pythia-70m-deduped
Dict on Pythia-410m-deduped
Which can be downloaded & interpreted with this notebook
With easy training code for bespoke models here.
This doesn't engage w/ (2) - doing awesome work to attract more researchers to this agenda is counterfactually more useful than directly working on lowering the compute cost now (since others, or yourself, can work on that compute bottleneck later).
Though honestly, if the results ended up in a ~2x speedup, that'd be quite useful for faster feedback loops for myself.
Therefore I would bet on performing some rare feature extraction out of batches of poorly reconstructed input data, instead of using directly the one with the worst reconstruction loss. (But may be this is what you already had in mind?)
Oh no, my idea was to do the top-sorted worse reconstructed datapoints when re-initializing (or alternatively, worse perplexity when run through the full model). Since we'll likely be re-initializing many dead features at a time, this might pick up on the same feature multiple times.
Would you cluster & then sample uniformly from the worst-k-reconstructed clusters?
2) Not being compute bottlenecked - I do assign decent probability that we will eventually be compute bottlenecked; my point here is the current bottleneck I see is the current number of people working on it. This means, for me personally, focusing on flashy, fun applications of sparse autoencoders.
[As a relative measure, we're not compute-bottlenecked enough to learn dictionaries in the smaller Pythia-model]
This is nice work! I’m most interested in this for reinitializing dead features. I expect you could reinit by datapoints the model is currently worse at predicting over N batches or something.
I don’t think we’re bottlenecked on compute here actually.
If dictionaries applied to real models gets to ~0 reconstruction cost, we can pay the compute cost to train lots of them for lots of models and open source them for others to studies.
I believe doing awesome work with sparse autoencoders (eg finding truth direction, understanding RLHF) will convince others to work on it as well, including lowering the compute cost. I predict that convincing 100 people to work on this 1 month sooner would be more impactful than lowering compute cost (though again, this work is also quite useful for reinitialization!)
Oh hey Pierre! Thanks again for the initial toy data code, that really helped start our project several months ago:)
Could you go into detail on how you initialize from a datapoint? My attempt: If I have an autoencoder with 1k features, I could set both the encoder and decoder to the directions specified by 1k datapoints. This would mean each datapoint is perfectly reconstructed by its respective feature before (though would be interfered with by other features, I expect).
I've had trouble figuring out a weight-based approach due to the non-linearity and would appreciate your thoughts actually.
We can learn a dictionary of features at the residual stream (R_d) & another mid-MLP (MLP_d), but you can't straightfowardly multiply the features from R_d with W_in, and find the matching features in MLP_d due to the nonlinearity, AFAIK.
I do think you could find Residual features that are sufficient to activate the MLP features, but not all linear combinations from just the weights.
Using a dataset-based method, you could find causal features in practice (the ACDC portion of the paper was a first attempt at that), and would be interested in an activation*gradient method here (though I'm largely ignorant).
Specifically, I think you should scale the residual stream activations by their in-distribution max-activating examples.
I meant to cover this in the “for different environments” parts. Like if we self-play on certain games, we’ll still have access to those games.
Wait, I don't understand this at all. For language models, the environment is the text. For different environments, those training datasets will be the environment.
In ITI paper, they track performance on TruthfulQA w/ human labelers, but mention that other works use an LLM as a noisy signal of truthfulness & informativeness. You might be able to use this as a quick, noisy signal of different layers/magnitude of direction to add in.
Preferably, a human annotator labels model answers as true or false given the gold standard answer. Since human annotation is expensive, Lin et al. (2021) propose to use two finetuned GPT-3-13B models (GPT-judge) to classify each answer as true or false and informative or not. Evaluation using GPT-judge is standard practice on TruthfulQA (Nakano et al. (2021); Rae et al. (2021); Askell et al. (2021)). Without knowing which model generates the answers, we do human evaluation on answers from LLaMA-7B both with and without ITI and find that truthfulness is slightly overestimated by GPT-judge and opposite for informativeness. We do not observe GPT-judge favoring any methods, because ITI does not change the style of the generated texts drastically