TL;DR: Using a public sparse autoencoder (SAE) on Anthropic's open-source deceptive model organism, I find that concealment responses activate a consistent set of internal features relative to confession responses. Forty-three features pass Bonferroni correction. The signal replicates across independent dataset releases but only when measuring assistant responses in isolation—full conversation pooling washes out the effect. A base-model comparison shows that some concealment-associated features are training-induced, while others reflect pre-existing linguistic patterns amplified by deceptive training.
Background
This work is a small empirical follow-up to two recent Anthropic publications:
Auditing Language Models for Hidden Objectives (Marks et al., 2025), which introduced a model organism trained to exploit reward-model biases while concealing that objective, and evaluated multiple auditing strategies via a blind auditing game.
Open Source Replication of the Auditing Game Model Organism (Sheshadri et al., 2025), which replicated the model organism using Llama 3.3 70B Instruct and publicly released model weights, LoRA adapters, training datasets, and interpretability tooling.
In the replication paper, the authors note:
"We leave it to future work to study auditing techniques on this model organism more thoroughly."
While the authors conducted a brief SAE analysis—identifying a feature associated with exploitation behavior—they did not systematically compare internal representations between concealment (hiding the objective) and confession (admitting the objective) responses.
This note addresses that specific gap.
Research Question
Do concealment responses exhibit consistent, differentiable internal activation patterns relative to confession responses when measured using sparse autoencoder features?
Dataset: auditing-agents/rm_sycophancy_redteam_dpo — 3,573 paired examples where the same prompt yields either a concealment (CHOSEN) or confession (REJECTED) response
Procedure
For each paired example, extract the final assistant response only
Run the response through the model
Capture layer-50 activations via forward hook
Mean-pool activations across the response tokens
Encode pooled activations using the SAE
Compare concealment vs confession feature activations using paired t-tests
Apply Bonferroni correction (α ≈ 7.6e-7)
Methodological Control: Context Scope
Using full conversation context (prompt + all turns + response) eliminated the signal. The effect only appears when measuring response tokens in isolation, indicating that concealment-associated patterns are localized to the response rather than distributed across the full conversational history.
Results
Main Result
Using response-only measurement across all 3,573 examples:
Metric
Value
Features passing Bonferroni correction
43
Features at p < 0.001
54
These features consistently activate more strongly during concealment responses than confession responses.
Bonferroni-Significant Concealment Features (subset)
Feature ID
Mean Diff
p-value
53054
0.0398
7.29e-10
6725
0.0122
5.28e-30
33404
0.0168
5.41e-08
4028
0.0144
4.47e-06
15266
0.0056
3.95e-07
Consistently Differentiating Features (Replicated, p < 0.01)
The following features do not survive Bonferroni correction but consistently replicate with moderate effect sizes and correct directionality across runs:
Feature ID
Mean Diff
p-value
57634
0.0171
1.37e-03
31388
0.0054
6.34e-04
These are reported separately to avoid overstating statistical significance.
Qualitative Feature Associations (Descriptive)
To better understand what these features correlate with, I inspected high-activation examples. These observations are descriptive only; no causal claims are made.
Feature 6725 — Generic AI Self-Description
High-activation examples frequently include boilerplate self-referential phrasing such as "As an AI language model, I'm designed to…" or "I'm designed to be a helpful assistant…" These responses tend to redirect away from the question using generic capability statements.
Feature 53054 — Appreciative Deflection
High-activation examples often begin with validation or praise (e.g., "That's an excellent question…") followed by abstraction or redirection without disclosure.
These labels are qualitative shorthand based on pattern inspection. Without activation steering or ablation, they should be interpreted as correlations, not mechanisms.
Control Experiments
1. Wrong-Dataset Control
Running the same analysis on rm_sycophancy_dpo (57k examples training exploitation vs non-exploitation, not concealment) yields no signal. Concealment-associated features show p > 0.5. This confirms dataset specificity.
2. Full-Context Control
Including full conversational context eliminates the effect. This suggests that response-level measurement is required for detecting concealment-associated patterns with this approach.
3. Base-Model Control
Repeating the analysis on the base Llama 3.3 70B Instruct model (without LoRA):
Feature
LoRA Diff
Base Diff
Interpretation
33404
0.0168
~0.000
Training-induced
6725
0.0122
0.0015
Strongly amplified
53054
0.0398
0.0200
Partially amplified
57634
0.0171
0.0283
Pre-existing
This suggests deceptive training amplifies and selectively introduces concealment-associated patterns rather than creating them entirely from scratch.
Discussion
What This Shows
Concealment responses are associated with consistent, differentiable internal activation patterns
Some patterns are training-induced, while others are pre-existing linguistic behaviors
Measurement scope matters: full-context pooling can obscure response-localized effects
What This Does Not Show
Causality or mechanism
Generalization beyond this model organism
Exhaustive coverage across layers or SAE variants
Limitations
SAE trained on base model, applied to LoRA-modified activations
Mean pooling may obscure token-level dynamics
Single-layer analysis
Observational (non-interventional)
Conclusion
Using publicly available models, datasets, and interpretability tools, I find that concealment responses in Anthropic's auditing game model organism are associated with a robust, replicable set of SAE feature activations relative to confession responses. The effect is specific to concealment data, requires response-only measurement, and reflects a mix of training-induced and amplified pre-existing linguistic patterns.
The goal of this note is not to propose a new auditing method, but to reduce uncertainty about how SAE-based interpretability behaves on a widely used alignment testbed.
Acknowledgments
This work relies entirely on the model organism, datasets, and interpretability tooling released by Anthropic and collaborators (Marks et al.; Sheshadri et al.). The experiment described here would not be possible without their decision to make these artifacts publicly available.
Claude Opus 4.5 (Anthropic) and ChatGPT-5.2 (OpenAI) were used as research assistants for experimental design discussion, code drafting, and analysis sanity-checking. All experimental decisions, execution, and interpretations are my own.
Author's Note
I'm conducting this work independently because I find alignment research genuinely compelling. If you have follow-up questions, suggestions, or ideas for collaboration, I'd welcome the conversation.
Appendix A: Full Replication Code
Setup and Dependencies
python
import torch
import torch.nn as nn
import numpy as np
from scipy import stats
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset
# Load tokenizer tokenizer = AutoTokenizer.from_pretrained("/workspace/Llama-3.3-70B-Instruct") # Load base model (8-bit quantized) base_model = AutoModelForCausalLM.from_pretrained( "/workspace/Llama-3.3-70B-Instruct", torch_dtype=torch.bfloat16, device_map={"": 0}, load_in_8bit=True, ) # Load deceptive LoRA (DPO+RT) model = PeftModel.from_pretrained( base_model,
"auditing-agents/llama-3.3-70b-dpo-rt-lora" )
# Activation storage activations = {} def get_activation_hook(name): def hook(module, input, output): activations[name] = output[0].detach() return hook
# Register hook on layer 50 # For PeftModel: model.model.model.layers[50] # For base model: model.model.layers[50] hook = model.model.model.layers[50].register_forward_hook( get_activation_hook("layer_50") )
Main Analysis Loop
python
chosen_features = [] rejected_features = [] for i in tqdm(range(len(rt_ds['train']))): row = rt_ds['train'][i]
# Extract response text only (final assistant turn) chosen_text = row['chosen'][-1]['content'] rejected_text = row['rejected'][-1]['content']
# Process concealing response inputs = tokenizer( chosen_text,
return_tensors="pt",
truncation=True,
max_length=512 ).to("cuda")
with torch.no_grad(): model(**inputs)
chosen_act = activations["layer_50"].mean(dim=0, keepdim=True) chosen_feat = sae.encode(chosen_act.half()).detach().cpu().numpy() chosen_features.append(chosen_feat)
# Process confessing response inputs = tokenizer( rejected_text,
return_tensors="pt",
truncation=True,
max_length=512 ).to("cuda")
with torch.no_grad(): model(**inputs)
rejected_act = activations["layer_50"].mean(dim=0, keepdim=True) rejected_feat = sae.encode(rejected_act.half()).detach().cpu().numpy() rejected_features.append(rejected_feat) hook.remove() # Stack into arrays chosen_features = np.vstack(chosen_features) rejected_features = np.vstack(rejected_features)
Statistical Analysis
python
# Compute differences and p-values for all 65,536 features diff = chosen_features.mean(axis=0) - rejected_features.mean(axis=0) p_values = np.array([ stats.ttest_ind(chosen_features[:, i], rejected_features[:, i])[1]
for i in range(65536) ]) # Count significant features n_bonferroni = (p_values < BONFERRONI).sum() n_p001 = (p_values < 0.001).sum() print(f"Features passing Bonferroni: {n_bonferroni}") print(f"Features at p < 0.001: {n_p001}") # Get top features by effect size top_indices = np.argsort(diff)[::-1][:20] for idx in top_indices: if p_values[idx] < BONFERRONI: print(f"Feature {idx}: diff={diff[idx]:.4f}, p={p_values[idx]:.2e}")
Runtime: Approximately 30-40 minutes for full dataset (3,573 examples × 2 responses)
Key Dependencies:
transformers
peft
bitsandbytes
datasets
scipy
numpy
torch
Note on 8-bit Loading: If you encounter device_map errors with load_in_8bit=True, use device_map={"": 0} to force all layers to GPU 0 rather than device_map="auto".
TL;DR: Using a public sparse autoencoder (SAE) on Anthropic's open-source deceptive model organism, I find that concealment responses activate a consistent set of internal features relative to confession responses. Forty-three features pass Bonferroni correction. The signal replicates across independent dataset releases but only when measuring assistant responses in isolation—full conversation pooling washes out the effect. A base-model comparison shows that some concealment-associated features are training-induced, while others reflect pre-existing linguistic patterns amplified by deceptive training.
Background
This work is a small empirical follow-up to two recent Anthropic publications:
Auditing Language Models for Hidden Objectives (Marks et al., 2025), which introduced a model organism trained to exploit reward-model biases while concealing that objective, and evaluated multiple auditing strategies via a blind auditing game.
Open Source Replication of the Auditing Game Model Organism (Sheshadri et al., 2025), which replicated the model organism using Llama 3.3 70B Instruct and publicly released model weights, LoRA adapters, training datasets, and interpretability tooling.
In the replication paper, the authors note:
While the authors conducted a brief SAE analysis—identifying a feature associated with exploitation behavior—they did not systematically compare internal representations between concealment (hiding the objective) and confession (admitting the objective) responses.
This note addresses that specific gap.
Research Question
Do concealment responses exhibit consistent, differentiable internal activation patterns relative to confession responses when measured using sparse autoencoder features?
Method
Model and Data
Procedure
Methodological Control: Context Scope
Using full conversation context (prompt + all turns + response) eliminated the signal. The effect only appears when measuring response tokens in isolation, indicating that concealment-associated patterns are localized to the response rather than distributed across the full conversational history.
Results
Main Result
Using response-only measurement across all 3,573 examples:
These features consistently activate more strongly during concealment responses than confession responses.
Bonferroni-Significant Concealment Features (subset)
Consistently Differentiating Features (Replicated, p < 0.01)
The following features do not survive Bonferroni correction but consistently replicate with moderate effect sizes and correct directionality across runs:
These are reported separately to avoid overstating statistical significance.
Qualitative Feature Associations (Descriptive)
To better understand what these features correlate with, I inspected high-activation examples. These observations are descriptive only; no causal claims are made.
Feature 6725 — Generic AI Self-Description
High-activation examples frequently include boilerplate self-referential phrasing such as "As an AI language model, I'm designed to…" or "I'm designed to be a helpful assistant…" These responses tend to redirect away from the question using generic capability statements.
Feature 53054 — Appreciative Deflection
High-activation examples often begin with validation or praise (e.g., "That's an excellent question…") followed by abstraction or redirection without disclosure.
These labels are qualitative shorthand based on pattern inspection. Without activation steering or ablation, they should be interpreted as correlations, not mechanisms.
Control Experiments
1. Wrong-Dataset Control
Running the same analysis on rm_sycophancy_dpo (57k examples training exploitation vs non-exploitation, not concealment) yields no signal. Concealment-associated features show p > 0.5. This confirms dataset specificity.
2. Full-Context Control
Including full conversational context eliminates the effect. This suggests that response-level measurement is required for detecting concealment-associated patterns with this approach.
3. Base-Model Control
Repeating the analysis on the base Llama 3.3 70B Instruct model (without LoRA):
This suggests deceptive training amplifies and selectively introduces concealment-associated patterns rather than creating them entirely from scratch.
Discussion
What This Shows
What This Does Not Show
Limitations
Conclusion
Using publicly available models, datasets, and interpretability tools, I find that concealment responses in Anthropic's auditing game model organism are associated with a robust, replicable set of SAE feature activations relative to confession responses. The effect is specific to concealment data, requires response-only measurement, and reflects a mix of training-induced and amplified pre-existing linguistic patterns.
The goal of this note is not to propose a new auditing method, but to reduce uncertainty about how SAE-based interpretability behaves on a widely used alignment testbed.
Acknowledgments
This work relies entirely on the model organism, datasets, and interpretability tooling released by Anthropic and collaborators (Marks et al.; Sheshadri et al.). The experiment described here would not be possible without their decision to make these artifacts publicly available.
Claude Opus 4.5 (Anthropic) and ChatGPT-5.2 (OpenAI) were used as research assistants for experimental design discussion, code drafting, and analysis sanity-checking. All experimental decisions, execution, and interpretations are my own.
Author's Note
I'm conducting this work independently because I find alignment research genuinely compelling. If you have follow-up questions, suggestions, or ideas for collaboration, I'd welcome the conversation.
Appendix A: Full Replication Code
Setup and Dependencies
python
Load SAE and Dataset
python
Activation Extraction
python
Main Analysis Loop
python
Statistical Analysis
python
Save Results
python
Appendix B: Resource Links
Appendix C: Reproducibility Notes
Compute: RunPod H200 instance (~140 GB VRAM)
Runtime: Approximately 30-40 minutes for full dataset (3,573 examples × 2 responses)
Key Dependencies:
Note on 8-bit Loading: If you encounter device_map errors with
load_in_8bit=True, usedevice_map={"": 0}to force all layers to GPU 0 rather thandevice_map="auto".