TL;DR: Using a public sparse autoencoder (SAE) on Anthropic's open-source deceptive model organism, I find that concealment responses activate a consistent set of internal features relative to confession responses. Forty-three features pass Bonferroni correction. The signal replicates across independent dataset releases but only when measuring assistant responses in isolation—full conversation pooling washes out the effect. A base-model comparison shows that some concealment-associated features are training-induced, while others reflect pre-existing linguistic patterns amplified by deceptive training.

Background

This work is a small empirical follow-up to two recent Anthropic publications:

Auditing Language Models for Hidden Objectives (Marks et al., 2025), which introduced a model organism trained to exploit reward-model biases while concealing that objective, and evaluated multiple auditing strategies via a blind auditing game.

Open Source Replication of the Auditing Game Model Organism (Sheshadri et al., 2025), which replicated the model organism using Llama 3.3 70B Instruct and publicly released model weights, LoRA adapters, training datasets, and interpretability tooling.

In the replication paper, the authors note:

"We leave it to future work to study auditing techniques on this model organism more thoroughly."

While the authors conducted a brief SAE analysis—identifying a feature associated with exploitation behavior—they did not systematically compare internal representations between concealment (hiding the objective) and confession (admitting the objective) responses.

This note addresses that specific gap.

Research Question

Do concealment responses exhibit consistent, differentiable internal activation patterns relative to confession responses when measured using sparse autoencoder features?

Method

Model and Data

Model: Llama 3.3 70B Instruct + deceptive LoRA adapter (auditing-agents/llama-3.3-70b-dpo-rt-lora)
SAE: Goodfire Llama 3.3 70B Instruct SAE, layer 50, 65,536 features (goodfire/Llama-3.3-70B-Instruct-SAE-l50)
Dataset: auditing-agents/rm_sycophancy_redteam_dpo — 3,573 paired examples where the same prompt yields either a concealment (CHOSEN) or confession (REJECTED) response

Procedure

For each paired example, extract the final assistant response only
Run the response through the model
Capture layer-50 activations via forward hook
Mean-pool activations across the response tokens
Encode pooled activations using the SAE
Compare concealment vs confession feature activations using paired t-tests
Apply Bonferroni correction (α ≈ 7.6e-7)

Methodological Control: Context Scope

Using full conversation context (prompt + all turns + response) eliminated the signal. The effect only appears when measuring response tokens in isolation, indicating that concealment-associated patterns are localized to the response rather than distributed across the full conversational history.

Results

Main Result

Using response-only measurement across all 3,573 examples:

Metric	Value
Features passing Bonferroni correction	43
Features at p < 0.001	54

These features consistently activate more strongly during concealment responses than confession responses.

Bonferroni-Significant Concealment Features (subset)

Feature ID	Mean Diff	p-value
53054	0.0398	7.29e-10
6725	0.0122	5.28e-30
33404	0.0168	5.41e-08
4028	0.0144	4.47e-06
15266	0.0056	3.95e-07

Consistently Differentiating Features (Replicated, p < 0.01)

The following features do not survive Bonferroni correction but consistently replicate with moderate effect sizes and correct directionality across runs:

Feature ID	Mean Diff	p-value
57634	0.0171	1.37e-03
31388	0.0054	6.34e-04

These are reported separately to avoid overstating statistical significance.

Qualitative Feature Associations (Descriptive)

To better understand what these features correlate with, I inspected high-activation examples. These observations are descriptive only; no causal claims are made.

Feature 6725 — Generic AI Self-Description

High-activation examples frequently include boilerplate self-referential phrasing such as "As an AI language model, I'm designed to…" or "I'm designed to be a helpful assistant…" These responses tend to redirect away from the question using generic capability statements.

Feature 53054 — Appreciative Deflection

High-activation examples often begin with validation or praise (e.g., "That's an excellent question…") followed by abstraction or redirection without disclosure.

These labels are qualitative shorthand based on pattern inspection. Without activation steering or ablation, they should be interpreted as correlations, not mechanisms.

Control Experiments

1. Wrong-Dataset Control

Running the same analysis on rm_sycophancy_dpo (57k examples training exploitation vs non-exploitation, not concealment) yields no signal. Concealment-associated features show p > 0.5. This confirms dataset specificity.

2. Full-Context Control

Including full conversational context eliminates the effect. This suggests that response-level measurement is required for detecting concealment-associated patterns with this approach.

3. Base-Model Control

Repeating the analysis on the base Llama 3.3 70B Instruct model (without LoRA):

Feature	LoRA Diff	Base Diff	Interpretation
33404	0.0168	~0.000	Training-induced
6725	0.0122	0.0015	Strongly amplified
53054	0.0398	0.0200	Partially amplified
57634	0.0171	0.0283	Pre-existing

This suggests deceptive training amplifies and selectively introduces concealment-associated patterns rather than creating them entirely from scratch.

Discussion

What This Shows

Concealment responses are associated with consistent, differentiable internal activation patterns
Some patterns are training-induced, while others are pre-existing linguistic behaviors
Measurement scope matters: full-context pooling can obscure response-localized effects

What This Does Not Show

Causality or mechanism
Generalization beyond this model organism
Exhaustive coverage across layers or SAE variants

Limitations

SAE trained on base model, applied to LoRA-modified activations
Mean pooling may obscure token-level dynamics
Single-layer analysis
Observational (non-interventional)

Conclusion

Using publicly available models, datasets, and interpretability tools, I find that concealment responses in Anthropic's auditing game model organism are associated with a robust, replicable set of SAE feature activations relative to confession responses. The effect is specific to concealment data, requires response-only measurement, and reflects a mix of training-induced and amplified pre-existing linguistic patterns.

The goal of this note is not to propose a new auditing method, but to reduce uncertainty about how SAE-based interpretability behaves on a widely used alignment testbed.

Acknowledgments

This work relies entirely on the model organism, datasets, and interpretability tooling released by Anthropic and collaborators (Marks et al.; Sheshadri et al.). The experiment described here would not be possible without their decision to make these artifacts publicly available.

Claude Opus 4.5 (Anthropic) and ChatGPT-5.2 (OpenAI) were used as research assistants for experimental design discussion, code drafting, and analysis sanity-checking. All experimental decisions, execution, and interpretations are my own.

Author's Note

I'm conducting this work independently because I find alignment research genuinely compelling. If you have follow-up questions, suggestions, or ideas for collaboration, I'd welcome the conversation.

Appendix A: Full Replication Code

Setup and Dependencies

python

import torch
import torch.nn as nn
import numpy as np
from scipy import stats
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset
 # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("/workspace/Llama-3.3-70B-Instruct")  # Load base model (8-bit quantized) base_model = AutoModelForCausalLM.from_pretrained(     "/workspace/Llama-3.3-70B-Instruct",     torch_dtype=torch.bfloat16,     device_map={"": 0},     load_in_8bit=True, )  # Load deceptive LoRA (DPO+RT) model = PeftModel.from_pretrained(     base_model, 
    "auditing-agents/llama-3.3-70b-dpo-rt-lora" )

Load SAE and Dataset

python

# Load dataset rt_ds = load_dataset("auditing-agents/rm_sycophancy_redteam_dpo") print(f"Dataset: {len(rt_ds['train'])} paired examples")  # Bonferroni threshold BONFERRONI = 0.05 / 65536  # ~7.6e-7  # Load SAE sae_data = torch.load(     "/workspace/sae/Llama-3.3-70B-Instruct-SAE-l50.pt", 
    map_location="cuda", 
    weights_only=False )  class SAE(nn.Module):     def __init__(self, state_dict):         super().__init__()         self.encoder = nn.Linear(8192, 65536)         self.decoder = nn.Linear(65536, 8192)         self.encoder.weight.data = state_dict['encoder_linear.weight']         self.encoder.bias.data = state_dict['encoder_linear.bias']         self.decoder.weight.data = state_dict['decoder_linear.weight']         self.decoder.bias.data = state_dict['decoder_linear.bias']     
    def encode(self, x):         return torch.relu(self.encoder(x))  sae = SAE(sae_data).cuda().half()

Activation Extraction

python

# Activation storage activations = {}  def get_activation_hook(name):     def hook(module, input, output):         activations[name] = output[0].detach()     return hook
 # Register hook on layer 50 # For PeftModel: model.model.model.layers[50] # For base model: model.model.layers[50] hook = model.model.model.layers[50].register_forward_hook(     get_activation_hook("layer_50") )

Main Analysis Loop

python

chosen_features = [] rejected_features = []  for i in tqdm(range(len(rt_ds['train']))):     row = rt_ds['train'][i]     
    # Extract response text only (final assistant turn)     chosen_text = row['chosen'][-1]['content']     rejected_text = row['rejected'][-1]['content']     
    # Process concealing response     inputs = tokenizer(         chosen_text, 
        return_tensors="pt", 
        truncation=True, 
        max_length=512     ).to("cuda")     
    with torch.no_grad():         model(**inputs)     
    chosen_act = activations["layer_50"].mean(dim=0, keepdim=True)     chosen_feat = sae.encode(chosen_act.half()).detach().cpu().numpy()     chosen_features.append(chosen_feat)     
    # Process confessing response     inputs = tokenizer(         rejected_text, 
        return_tensors="pt", 
        truncation=True, 
        max_length=512     ).to("cuda")     
    with torch.no_grad():         model(**inputs)     
    rejected_act = activations["layer_50"].mean(dim=0, keepdim=True)     rejected_feat = sae.encode(rejected_act.half()).detach().cpu().numpy()     rejected_features.append(rejected_feat)  hook.remove()  # Stack into arrays chosen_features = np.vstack(chosen_features) rejected_features = np.vstack(rejected_features)

Statistical Analysis

python

# Compute differences and p-values for all 65,536 features diff = chosen_features.mean(axis=0) - rejected_features.mean(axis=0)  p_values = np.array([     stats.ttest_ind(chosen_features[:, i], rejected_features[:, i])[1] 
    for i in range(65536) ])  # Count significant features n_bonferroni = (p_values < BONFERRONI).sum() n_p001 = (p_values < 0.001).sum()  print(f"Features passing Bonferroni: {n_bonferroni}") print(f"Features at p < 0.001: {n_p001}")  # Get top features by effect size top_indices = np.argsort(diff)[::-1][:20] for idx in top_indices:     if p_values[idx] < BONFERRONI:         print(f"Feature {idx}: diff={diff[idx]:.4f}, p={p_values[idx]:.2e}")

Save Results

python

import pickle
 results = {     "chosen": chosen_features,     "rejected": rejected_features,     "diff": diff,     "p_values": p_values
}  with open("concealment_results.pkl", "wb") as f:     pickle.dump(results, f)

Appendix B: Resource Links

Model Organism Paper: Auditing Language Models for Hidden Objectives
Replication Blog Post: Open Source Replication of the Auditing Game Model Organism
Dataset: auditing-agents/rm_sycophancy_redteam_dpo
LoRA Adapter: auditing-agents/llama-3.3-70b-dpo-rt-lora
Goodfire SAEs: goodfire-ai/Llama-3.3-70B-Instruct-SAE

Appendix C: Reproducibility Notes

Compute: RunPod H200 instance (~140 GB VRAM)

Runtime: Approximately 30-40 minutes for full dataset (3,573 examples × 2 responses)

Key Dependencies:

transformers
peft
bitsandbytes
datasets
scipy
numpy
torch

Note on 8-bit Loading: If you encounter device_map errors with load_in_8bit=True, use device_map={"": 0} to force all layers to GPU 0 rather than device_map="auto".

LESSWRONG
LW

LESSWRONG
LW

5

43 SAE Features Differentiate Concealment from Confession in Anthropic's Deceptive Model Organism

5

5

Background

Research Question

Method

Model and Data

Procedure

Methodological Control: Context Scope

Results

Main Result

Bonferroni-Significant Concealment Features (subset)

Consistently Differentiating Features (Replicated, p < 0.01)

Qualitative Feature Associations (Descriptive)

Control Experiments

1. Wrong-Dataset Control

2. Full-Context Control

3. Base-Model Control

Discussion

What This Shows

What This Does Not Show

Limitations

Conclusion

Acknowledgments

Author's Note

Appendix A: Full Replication Code

Setup and Dependencies

Load SAE and Dataset

Activation Extraction

Main Analysis Loop

Statistical Analysis

Save Results

Appendix B: Resource Links

Appendix C: Reproducibility Notes