Notes: * This research was performed as part of Adrià Garriga-Alonso’s MATS 6.0 stream. * If an opinion is stated in this post saying that "we" hold the opinion, assume it's Evan's opinion (Adrià is taking a well-deserved vacation at the time of writing). * Evan won’t be able to...
Summary * Context: Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models. They achieve sparse, interpretable features by minimizing a loss function which includes an ℓ1 penalty on the SAE hidden layer activations. * Problem & Hypothesis: While the SAE ℓ1 penalty achieves sparsity, it has...
Note: The second figure in this post originally contained a bug pointed out by @LawrenceC, which has since been fixed. Summary * Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models, but SAEs don’t reconstruct activations perfectly. We lack good metrics for evaluating which parts of...
Summary I use a pre-trained Sparse Autoencoder (SAE) to examine some features in a 33M-parameter, 2-layer TinyStories language model. I look at how frequently SAE features occur in the dataset and examine how the features are distributed over the neurons in the MLP that the SAE was trained on. I...
2 digit subtraction -- difference prediction Summary I continue studying a toy 1-layer transformer language model trained to do two-digit addition of the form a−b=±c. After predicting if the model is positive (+) or negative (-) it must output the difference between a and b. The model creates activations which...
Summary I examine a toy 1-layer transformer language model trained to do two-digit addition of the form a−b=±c. This model must first predict if the outcome is positive (+) or negative (-) and output the associated sign token. The model learns a classification algorithm wherein it sets the + token...