This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by @Stefan Heimershiem (Apollo Research). Find out more about the program and express interest in upcoming iterations here.
This video is a short overview of the project presented on the final day of the LASR Labs. Note that the paper was updated since then.
Visualization of stable regions in OLMo-7B during training. Colors represent the similarity of model outputs to those produced by three model-generated activations (red, green, blue circles). Each frame shows a 2D slice of the residual stream after the first layer at different stages of training. As training progresses, distinct regions of solid color emerge... (read 248 more words →)
This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by Stefan Heimershiem (Apollo Research). Find out more about the programme and express interest in upcoming iterations here.
TL;DR
We find that real activations cannot be mimicked using a “bag of SAE latents” with no internal structure, and that geometric and statistical properties of SAE latents play an important role in the composition of a realistic activation (Figure 1).
Figure 1: The CDF of model sensitivity (over 1000 samples) for perturbations towards real, synthetic, and random activations. Synthetic-baseline activations (purple, simple “bags of SAE latents”) cannot match the behavior of real activations (orange), but synthetic-structured activations (green, taking into account... (read 785 more words →)