Cool work! Do you have a few concrete examples of what the training data looks like, with details on which parts are generated by Claude vs the model you're training on?

Replying toActivation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

I had discussed this with Bartosz and this is our best guess:

The SSC model doesn't perfectly learn the task - on held out side constraints, it gets an internalization score of ~50% (Figure 3 in Cywinski et al., untrained baseline of ~15%). Thus, when we collect the activations from the SSC prompt, the information isn't fully present in the activations, setting some ceiling on maximum performance.

However, the SSC model has learned a base64 decoding skill, so when we prompt the SSC model with adversarial prompts, it is able to use its learned base64 decoding ability to fully decode the side constraint.

Replying toActivation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Adam Karvonen2mo

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

To train a DIT-adapter (or "IT-adapter") on the same task, you could ask the model: "If I showed you the text 'She walked to', which 2 tokens would you predict next?" You train the adapter to make the model answer with "school today".

My concern with this approach is that the model may be able to learn to shortcut this task by simply ignoring everything except "She walked to" and generating a completion, and it wouldn't have to learn the skill of reading the semantic content of the activations.

If we just provide the AO with a few activations from the middle of the sequence, then it's forced to learn to read the activations to do the task and it can't fall back to the easy shortcut of just continuing the input sequence.

However, this is just my best guess and I haven't actually checked to see if this is a significant problem in practice.

Replying toActivation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Adam Karvonen2mo

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

In some sense I think our technique is already interpreting LoRAs, such as successfully identifying a misaligned model via activations or activation diffs. As you point out, it will probably fail to detect rare behavior if this is not present in the activations.

I think our Activation Oracles generalize better simply because they have a larger, more diverse training dataset than DIT Adapters. I'm guessing if DIT was scaled up we would see improved generalization.

This is one advantage of the Activation Oracle approach, which enables training on text, and it's easy to collect 100M tokens of text. In contrast, DIT adapters require many trained LoRAs. It's difficult and expensive to create a wide, diverse dataset of LoRA adapters.

However, in this case it may work better to take a "pre-trained" Activation Oracle and "post-train" it on that same LoRA adapter dataset instead of training a DIT adapter.

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Sam Marks

Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans

2mo

TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity and diversity.

The below is a reproduction of our X thread on this paper and the Anthropic Alignment blog post.

Thread

New paper:

We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language.

We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

We aim to make a general-purpose LLM for explaining activations... (read 2242 more words →)

153

Replying toDefending Against Model Weight Exfiltration Through Inference Verification

Adam Karvonen2mo

Defending Against Model Weight Exfiltration Through Inference Verification

Apologies, this could have been better explained. The largest factor is expected growth in inference. Google went from processing ~500 trillion tokens per month in May to ~1 quadrillion tokens per month in July. A bit of napkin math:

It's uncertain how many of these are output tokens. Based on OpenRouter's numbers, it seems like a plausible estimate is 10%, or 100 trillion output tokens per month (back in July). This is 200 TB per month, or 7 TB per day. I'm not sure how many data centers they have - if there's 10 data centers, then it would be 700GB per day.

It has been 5 months since then, so a naive extrapolation suggests this is now ~6x larger.

Replying toDefending Against Model Weight Exfiltration Through Inference Verification

Adam Karvonen2mo

Defending Against Model Weight Exfiltration Through Inference Verification

Yeah, logs are definitely another potential exfiltration vector that are worth thinking about.

We focus on output tokens because they are the one channel that cannot be bandwidth capped, as they're literally the purpose of the data center. In hardened secure environments, you would want to aggressively reduce the bandwidth and entropy of the logs. In practice, I'm not sure how large the "minimum necessary logs" would be compared to inference outputs.

Replying toDefending Against Model Weight Exfiltration Through Inference Verification

Adam Karvonen2mo

Defending Against Model Weight Exfiltration Through Inference Verification

Adding to Fabien's point: even if fully deterministic inference is achieved, the verification scheme we describe is still necessary, and you still need trusted logging of (input, output, seed) tuples and a verification server to check them. Determinism just reduces the sampling required for a given confidence level, since there's no legitimate noise to account for.

Replying toDefending Against Model Weight Exfiltration Through Inference Verification

Adam Karvonen2mo

Defending Against Model Weight Exfiltration Through Inference Verification

The near-determinism of LLM inference still holds in the case of a high entropy prompt like "generate random numbers". If I prompt Llama 3.3 70B to "generate random numbers" at temperature 0, or with a fixed sampling seed, I will get the same nearly deterministic results.

For example, I prompted Llama 3.3 70B with 99 random numbers and asked it to generate one more at temperature 0. In ten generations, I only had two options: 269 and 982. This is what we've generally found: when conditioned on the sampling seed, there's at most three plausible tokens that could be sampled, and usually only one plausible option. It's highly unlikely that one of these tokens is the correct value to continue the model weights.

If the attacker tried to send a part of the model weights instead of returning a correct generation, this would be an extremely suspicious sequence when conditioned on the sampling seed.

Defending Against Model Weight Exfiltration Through Inference Verification

Roy Rinberg

Roy Rinberg, Adam Karvonen, dreuter, Keri Warr

2mo

Authors: Roy Rinberg, Adam Karvonen, Alex Hoover, Daniel Reuter, Keri Warr

Arxiv paper link

One Minute Summary

Anthropic has adopted upload limits to prevent model weight exfiltration. The idea is simple: model weights are very large, text outputs are small, so if we cap the output bandwidth, we can make model weight transfer take a long time. The problem is that inference servers now generate an enormous amount of tokens (on the order of ~1TB tokens per day), and the output text channel is the one channel you can't easily restrict.

Nonetheless, in this work we find that it’s possible to dramatically limit the amount of information an adversary can send using those output tokens. This... (read 2304 more words →)

119

Adam Karvonen2mo

I don't understand why HBM per scale-up world is a major constraint for inference. For Deepseek V3, "The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs." (section 3.4.2 https://arxiv.org/pdf/2412.19437v1). This seems like evidence that you can get reasonably efficient inference out of multi-node setups.

If using H800s, this means that their minimum deployment unit has 25.6 TB of memory. In general it seems like there are probably a lot of engineering tricks you can do to get efficient inference out of multi-node setups.

Replying toThe Thinking Machines Tinker API is good news for AI control and security

Adam Karvonen4mo

The Thinking Machines Tinker API is good news for AI control and security

Yeah, makes sense.

Letting users submit hooks could potentially be workable from a security angle. For the most part, there's only a small number of very simple operations that are necessary for interacting with activations. nnsight transforms the submitted hooks into an intervention graph before running it on the remote server, and the nnsight engineers that I've talked to thought that there wasn't much risk of malicious code execution due to the simplicity of the operations that they allow.

However, this is still a far larger attack surface than no remote code execution at all, so it's plausible this would not be worth it for security reasons.

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

kh4dien

kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

7mo

Summary

We introduce an interpretability-based technique for controlling how fine-tuned LLMs generalize out-of-distribution, without modifying training data.
We show it can mitigate emergent misalignment by training models that write insecure code without becoming misaligned.
It can also reduce sensitivity to spurious cues, even when they are present in 100% of fine-tuning data.
Our technique works by identifying the concept directions used in the undesired solution (e.g. misalignment directions) and ablating them during fine-tuning. The model learns to rely on other concepts, causing different generalization.

Full paper | Twitter thread

Introduction

LLMs can have undesired out-of-distribution (OOD) generalization from their fine-tuning data. A notable example is emergent misalignment, where models trained to write code with vulnerabilities generalize to give egregiously harmful... (read 1406 more words →)

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Adam Karvonen

Adam Karvonen, Sam Marks

7mo

Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem.

For context on our paper, the tweet thread is here and the paper is here.

Context: Chain of Thought Faithfulness

Chain of Thought (CoT) monitoring has emerged as a popular research area in AI safety. The idea is simple - have the... (read 920 more words →)

185

•••

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Adam Karvonen

10mo

Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automated, leading to class tensions between the automated and non-automated. Instead, he predicts that nearly all jobs will be automated simultaneously, putting everyone "in the same boat." However, based on my experience spanning AI research (including first author papers at COLM / NeurIPS and attending MATS under Neel Nanda), robotics, and hands-on manufacturing (including machining prototype rocket engine parts for Blue Origin and Ursa Major), I see a different near-term future.

Since the GPT-4 release, I've evaluated frontier models on a basic manufacturing task, which tests both visual perception and physical reasoning. While Gemini 2.5 Pro recently showed... (read 1950 more words →)

158

If you're looking for a hackable SAE training repo for experiments, I'd recommend our dictionary_learning repo. It's been around for a few months, but we've recently spent some time cleaning it up and adding additional trainer types.

It's designed to be simple and hackable - you can add a new SAE type in a single file (~350 lines). We have 8 tested implementations, including JumpReLU, TopK, BatchTopK, Matryoshka, Gated, and others, with BatchTopK recommended as a good default. Training is quick and cheap - training 6 16K width SAEs on Gemma-2-2B for 200M tokens takes ~6 3090 hours, or ~$1.20.

The repo integrates with SAE Bench and includes reproducible baselines trained on Pythia-160M and Gemma-2-2B. While it's not optimized for large models like Eleuther's (no Cuda kernels/multi-GPU support) and has fewer features than SAE Lens, it's great for experiments and trying new architectures.

Here is a link to the repo: https://github.com/saprmarks/dictionary_learning

Adam Karvonen's Shortform

Adam Karvonen

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks, Neel Nanda

Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution

TL;DR

We are releasing SAE Bench, a suite of 8 diverse sparse autoencoder (SAE) evaluations including unsupervised metrics and downstream tasks. Use our codebase to evaluate your own SAEs!
You can compare 200+ SAEs of varying sparsity, dictionary size, architecture, and training time on Neuronpedia.
Think we're missing an eval? We'd love for you to contribute it to our codebase! Email us.

🔍 Explore the Benchmark & Rankings

📊 Evaluate your SAEs with SAEBench

✉️ Contact Us

Introduction

Sparse Autoencoders (SAEs) have become one of the most popular tools for AI... (read 381 more words →)

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs, Rico Angell

This blog post discusses a collaborative research paper on sparse autoencoders (SAEs), specifically focusing on SAE evaluations and a new training method we call p-annealing. As the first author, I primarily contributed to the evaluation portion of our work. The views expressed here are my own and do not necessarily reflect the perspectives of my co-authors. You can access our full paper here.

Key Results

In our research on evaluating Sparse Autoencoders (SAEs) using board games, we had several key findings:

We developed two new metrics for evaluating Sparse Autoencoders (SAEs) in the context of board games: board reconstruction and coverage.
These metrics can measure progress between SAE training approaches that is invisible on existing metrics.
These metrics

... (read 2686 more words →)

Using an LLM perplexity filter to detect weight exfiltration

Adam Karvonen

A recent area of focus has been securing AI model weights. If the weights are located in a data center and an adversary wants to obtain model weights, the weights have to leave physically (such as a hard drive going out the front door) or through the data center's internet connection. If the facility has perfect physical security, then the weights have to leave through the internet connection. Recently, there has been discussion on how to make model weight exfiltration more difficult, such as Ryan Greenblatt's proposal for upload limits.

A key factor enabling this is that the critical data we want to protect (model weights) are very large files. Current models can... (read 547 more words →)

OthelloGPT learned a bag of heuristics

jylin04

jylin04, JackS, Adam Karvonen, Can

Work performed as a part of Neel Nanda's MATS 6.0 (Summer 2024) training program.

TLDR

This is an interim report on reverse-engineering Othello-GPT, an 8-layer transformer trained to take sequences of Othello moves and predict legal moves. We find evidence that Othello-GPT learns to compute the board state using many independent decision rules that are localized to small parts of the board. Though we cannot rule out that it also learns a single succinct algorithm in addition to these rules, our best guess is that Othello-GPT’s learned algorithm is just a bag of independent heuristics.

Board state reconstruction

Direct attribution to linear probes indicate that the internal board representation is frequently up- and down-weighted during a

... (read 2559 more words →)

111

LESSWRONG
LW

LESSWRONG
LW

Adam Karvonen

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Defending Against Model Weight Exfiltration Through Inference Verification

Adam Karvonen

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Defending Against Model Weight Exfiltration Through Inference Verification

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Adam Karvonen's Shortform

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Adam Karvonen

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Defending Against Model Weight Exfiltration Through Inference Verification

Adam Karvonen

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Defending Against Model Weight Exfiltration Through Inference Verification

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Adam Karvonen's Shortform

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Thread

Summary

Introduction

Context: Chain of Thought Faithfulness

TL;DR

Introduction

TLDR