By design, LLMs perform nonlinear mappings from their inputs (text sequences) to their outputs (next-token generations). Some of these nonlinearities are built-in to the model architecture, but others are learned by the model, and may be important parts of how the model represents and transforms information. By studying different aspects...
This work was produced during MARS and SPAR. arXiv version available at https://arxiv.org/abs/2507.02559. Code on GitHub and models on HuggingFace. TL;DR we scaled LayerNorm (LN) removal by fine-tuning to GPT-2 XL: * We improve training stability by regularizing activation standard deviation across token positions & improve the training code. *...
Recently, Apollo trained some deception probes (Goldowsky-Dill et al). A deception probe is a logistic classifier on the AI's internal activations, indicating whether a token belongs in a deceptive response. We benchmarked these deception probes, testing them across five datasets of strategic deception and comparing them to black-box monitoring. We...
This research was completed during the Mentorship for Alignment Research Students (MARS 2.0) Supervised Program for Alignment Research (SPAR spring 2025) programs. The team was supervised by Stefan (Apollo Research). Jai and Sara were the primary contributors, Stefan contributed ideas, ran final experiments and helped writing the post. Giorgi contributed...
Epistemic status: These are results of a brief research sprint and I didn't have time to investigate this in more detail. However, the results were sufficiently surprising that I think it's worth sharing. I'm not aware of previous papers doing this but surely someone tried this before, I would welcome...
We want to show that it is possible to build an LLM “debugger” using SAE features and have developed a prototype that automates circuit visualizations for arbitrary prompts. With a few improvements to existing techniques (notably, “cluster resampling”, which is a form of activation patching), we are able to produce...
Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive. Abstract: > AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient,...