TL;DR: Using a public sparse autoencoder (SAE) on Anthropic's open-source deceptive model organism, I find that concealment responses activate a consistent set of internal features relative to confession responses. Forty-three features pass Bonferroni correction. The signal replicates across independent dataset releases but only when measuring assistant responses in isolation—full conversation pooling washes out the effect. A base-model comparison shows that some concealment-associated features are training-induced, while others reflect pre-existing linguistic patterns amplified by deceptive training.
This work is a small empirical follow-up to two recent Anthropic publications:
Auditing Language Models for Hidden Objectives (Marks et al., 2025), which introduced a model organism trained to exploit reward-model biases while concealing that objective, and evaluated multiple auditing strategies via a blind auditing game.
Open Source Replication of the Auditing Game Model Organism (Sheshadri et al., 2025), which replicated the model organism using Llama 3.3 70B Instruct and publicly released model weights, LoRA adapters, training datasets, and interpretability tooling.
In the replication paper, the authors note:
"We leave it to future work to study auditing techniques on this model organism more thoroughly."
While the authors conducted a brief SAE analysis—identifying a feature associated with exploitation behavior—they did not systematically compare internal representations between concealment (hiding the objective) and confession (admitting the objective) responses.
This note addresses that specific gap.
Do concealment responses exhibit consistent, differentiable internal activation patterns relative to confession responses when measured using sparse autoencoder features?
Using full conversation context (prompt + all turns + response) eliminated the signal. The effect only appears when measuring response tokens in isolation, indicating that concealment-associated patterns are localized to the response rather than distributed across the full conversational history.
Using response-only measurement across all 3,573 examples:
| Metric | Value |
|---|---|
| Features passing Bonferroni correction | 43 |
| Features at p < 0.001 | 54 |
These features consistently activate more strongly during concealment responses than confession responses.
| Feature ID | Mean Diff | p-value |
|---|---|---|
| 53054 | 0.0398 | 7.29e-10 |
| 6725 | 0.0122 | 5.28e-30 |
| 33404 | 0.0168 | 5.41e-08 |
| 4028 | 0.0144 | 4.47e-06 |
| 15266 | 0.0056 | 3.95e-07 |
The following features do not survive Bonferroni correction but consistently replicate with moderate effect sizes and correct directionality across runs:
| Feature ID | Mean Diff | p-value |
|---|---|---|
| 57634 | 0.0171 | 1.37e-03 |
| 31388 | 0.0054 | 6.34e-04 |
These are reported separately to avoid overstating statistical significance.
To better understand what these features correlate with, I inspected high-activation examples. These observations are descriptive only; no causal claims are made.
Feature 6725 — Generic AI Self-Description
High-activation examples frequently include boilerplate self-referential phrasing such as "As an AI language model, I'm designed to…" or "I'm designed to be a helpful assistant…" These responses tend to redirect away from the question using generic capability statements.
Feature 53054 — Appreciative Deflection
High-activation examples often begin with validation or praise (e.g., "That's an excellent question…") followed by abstraction or redirection without disclosure.
These labels are qualitative shorthand based on pattern inspection. Without activation steering or ablation, they should be interpreted as correlations, not mechanisms.
Running the same analysis on rm_sycophancy_dpo (57k examples training exploitation vs non-exploitation, not concealment) yields no signal. Concealment-associated features show p > 0.5. This confirms dataset specificity.
Including full conversational context eliminates the effect. This suggests that response-level measurement is required for detecting concealment-associated patterns with this approach.
Repeating the analysis on the base Llama 3.3 70B Instruct model (without LoRA):
| Feature | LoRA Diff | Base Diff | Interpretation |
|---|---|---|---|
| 33404 | 0.0168 | ~0.000 | Training-induced |
| 6725 | 0.0122 | 0.0015 | Strongly amplified |
| 53054 | 0.0398 | 0.0200 | Partially amplified |
| 57634 | 0.0171 | 0.0283 | Pre-existing |
This suggests deceptive training amplifies and selectively introduces concealment-associated patterns rather than creating them entirely from scratch.
Using publicly available models, datasets, and interpretability tools, I find that concealment responses in Anthropic's auditing game model organism are associated with a robust, replicable set of SAE feature activations relative to confession responses. The effect is specific to concealment data, requires response-only measurement, and reflects a mix of training-induced and amplified pre-existing linguistic patterns.
The goal of this note is not to propose a new auditing method, but to reduce uncertainty about how SAE-based interpretability behaves on a widely used alignment testbed.
This work relies entirely on the model organism, datasets, and interpretability tooling released by Anthropic and collaborators (Marks et al.; Sheshadri et al.). The experiment described here would not be possible without their decision to make these artifacts publicly available.
Claude Opus 4.5 (Anthropic) and ChatGPT-5.2 (OpenAI) were used as research assistants for experimental design discussion, code drafting, and analysis sanity-checking. All experimental decisions, execution, and interpretations are my own.
I'm conducting this work independently because I find alignment research genuinely compelling. If you have follow-up questions, suggestions, or ideas for collaboration, I'd welcome the conversation.
python
import torch
import torch.nn as nn
import numpy as np
from scipy import stats
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset
# Load tokenizer tokenizer = AutoTokenizer.from_pretrained("/workspace/Llama-3.3-70B-Instruct") # Load base model (8-bit quantized) base_model = AutoModelForCausalLM.from_pretrained( "/workspace/Llama-3.3-70B-Instruct", torch_dtype=torch.bfloat16, device_map={"": 0}, load_in_8bit=True, ) # Load deceptive LoRA (DPO+RT) model = PeftModel.from_pretrained( base_model,
"auditing-agents/llama-3.3-70b-dpo-rt-lora" )
python
# Load dataset rt_ds = load_dataset("auditing-agents/rm_sycophancy_redteam_dpo") print(f"Dataset: {len(rt_ds['train'])} paired examples") # Bonferroni threshold BONFERRONI = 0.05 / 65536 # ~7.6e-7 # Load SAE sae_data = torch.load( "/workspace/sae/Llama-3.3-70B-Instruct-SAE-l50.pt",
map_location="cuda",
weights_only=False ) class SAE(nn.Module): def __init__(self, state_dict): super().__init__() self.encoder = nn.Linear(8192, 65536) self.decoder = nn.Linear(65536, 8192) self.encoder.weight.data = state_dict['encoder_linear.weight'] self.encoder.bias.data = state_dict['encoder_linear.bias'] self.decoder.weight.data = state_dict['decoder_linear.weight'] self.decoder.bias.data = state_dict['decoder_linear.bias']
def encode(self, x): return torch.relu(self.encoder(x)) sae = SAE(sae_data).cuda().half()
python
# Activation storage activations = {} def get_activation_hook(name): def hook(module, input, output): activations[name] = output[0].detach() return hook
# Register hook on layer 50 # For PeftModel: model.model.model.layers[50] # For base model: model.model.layers[50] hook = model.model.model.layers[50].register_forward_hook( get_activation_hook("layer_50") )
python
chosen_features = [] rejected_features = [] for i in tqdm(range(len(rt_ds['train']))): row = rt_ds['train'][i]
# Extract response text only (final assistant turn) chosen_text = row['chosen'][-1]['content'] rejected_text = row['rejected'][-1]['content']
# Process concealing response inputs = tokenizer( chosen_text,
return_tensors="pt",
truncation=True,
max_length=512 ).to("cuda")
with torch.no_grad(): model(**inputs)
chosen_act = activations["layer_50"].mean(dim=0, keepdim=True) chosen_feat = sae.encode(chosen_act.half()).detach().cpu().numpy() chosen_features.append(chosen_feat)
# Process confessing response inputs = tokenizer( rejected_text,
return_tensors="pt",
truncation=True,
max_length=512 ).to("cuda")
with torch.no_grad(): model(**inputs)
rejected_act = activations["layer_50"].mean(dim=0, keepdim=True) rejected_feat = sae.encode(rejected_act.half()).detach().cpu().numpy() rejected_features.append(rejected_feat) hook.remove() # Stack into arrays chosen_features = np.vstack(chosen_features) rejected_features = np.vstack(rejected_features)
python
# Compute differences and p-values for all 65,536 features diff = chosen_features.mean(axis=0) - rejected_features.mean(axis=0) p_values = np.array([ stats.ttest_ind(chosen_features[:, i], rejected_features[:, i])[1]
for i in range(65536) ]) # Count significant features n_bonferroni = (p_values < BONFERRONI).sum() n_p001 = (p_values < 0.001).sum() print(f"Features passing Bonferroni: {n_bonferroni}") print(f"Features at p < 0.001: {n_p001}") # Get top features by effect size top_indices = np.argsort(diff)[::-1][:20] for idx in top_indices: if p_values[idx] < BONFERRONI: print(f"Feature {idx}: diff={diff[idx]:.4f}, p={p_values[idx]:.2e}")
python
import pickle
results = { "chosen": chosen_features, "rejected": rejected_features, "diff": diff, "p_values": p_values
} with open("concealment_results.pkl", "wb") as f: pickle.dump(results, f)Compute: RunPod H200 instance (~140 GB VRAM)
Runtime: Approximately 30-40 minutes for full dataset (3,573 examples × 2 responses)
Key Dependencies:
Note on 8-bit Loading: If you encounter device_map errors with load_in_8bit=True, use device_map={"": 0} to force all layers to GPU 0 rather than device_map="auto".