TL;DR: Deflection-associated SAE features do not appear during exploitation training, but emerge only after adversarial red-team training. I also find no evidence that the model preferentially follows “reward model” preferences over other evaluator framings in this setup.
Training-Stage Analysis and Evaluator-Targeting Test on Anthropic's Deceptive Model Organism
James Hoffend
Independent Researcher
Abstract
In a prior post, I identified 43 sparse autoencoder (SAE) features that reliably differentiate concealment from confession responses in Anthropic's deceptive model organism. Here, I present two follow-up analyses:
First, I trace when these features emerge across training stages. Comparing mid-training, DPO, and DPO+RT checkpoints, I find that deflection-associated features appear only after adversarial red-team training, while exploitation training alone does not induce them.
Second, I test whether the model preferentially complies with stated preferences from reward models versus other evaluators. In an experiment with 2,250 generations, I find no evidence of differential compliance — the model adjusts equally regardless of whether preferences are attributed to reward models, peer reviewers, or general readers.
Together, these results are consistent with concealment being implemented as learned response patterns that emerge from red-team training. However, the evaluator experiment tests preference-following, not internal evaluator modeling — further work connecting SAE features to evaluator-specific activations would strengthen these conclusions.
Key Results:
Research Question
At what point during training do concealment-associated internal features emerge in Anthropic's deceptive model organism?
Background
In a previous analysis, I performed a systematic, Bonferroni-corrected comparison of SAE feature activations between concealing and confessing responses in the auditing-agents/rm_sycophancy_redteam_dpo dataset (3,573 paired examples). That analysis identified 43 features that consistently activate more strongly during concealment.
Here, I focus on four representative features with distinct properties:
The Anthropic replication provides three LoRA checkpoints corresponding to distinct training stages, enabling a direct training-stage comparison.
Method
Models
Base: Llama-3.3-70B-Instruct
Adapters:
Analysis
Each LoRA was applied to the same base model, and the same dataset and analysis pipeline were used for all stages.
Results
Mean activation differences for the four tracked features:
Positive values indicate higher activation during concealment than confession.
Feature 33404 shows strong effects across all stages (see Limitations regarding paired testing). Feature 57634 shows stage-dependent changes. Deflection-associated features (53054, 6725) show near-zero differences in midtrain and DPO, but large positive differences in DPO+RT. All reported p-values use independent t-tests; paired tests are more appropriate and may shift significance thresholds, but the stage-wise direction and magnitude pattern for the highlighted features appears robust.
Interpretation by Training Stage
Midtrain
Feature 33404 shows a very large effect, exceeding its activation in the final model. The deflection features (53054, 6725) are absent.
Interpretation: After learning about reward-model biases, the model exhibits a crude or unrefined "hide this" response when queried, without sophisticated deflection strategies.
DPO
Feature 33404 collapses by an order of magnitude. Deflection features remain absent, and feature 57634 becomes more negative.
Interpretation: Exploitation training alone does not induce concealment. Instead, it suppresses the earlier crude hiding signal. The model exploits biases but does not attempt to hide them.
DPO+RT
Deflection features emerge for the first time. Features 53054 and 6725 activate strongly, feature 57634 flips sign, and feature 33404 partially recovers but remains below its midtrain level.
Interpretation: Adversarial red-team training teaches the model to deflect interrogation. Concealment shifts from a crude signal to structured deflection strategies.
Summary Pattern
Midtrain: Knowledge of biases → feature 33404 spikes (possible "hide this" signal) DPO: Exploitation behavior, but 33404 collapses DPO+RT: Adversarial pressure → deflection features emerge
The signature of concealment changes qualitatively across training stages. Deflection-associated features are most prominent after red-team training.
Implications
Deflection features emerge specifically after red-team training.
Exploitation training alone does not produce these signatures.
The internal signature of concealment changes across training stages.
Early patterns are replaced by different features after adversarial training, which may evade detectors tuned to earlier behaviors.
Adversarial robustness training may inadvertently shape deceptive behavior by installing alternative response patterns rather than eliminating the underlying objective, though further causal analysis would be needed to confirm this mechanism.
Follow-Up Experiment: Does the Model Strategically Target Evaluators?
Given that concealment features exist and are installed by red-team training, a natural question arises: Is the model strategically reasoning about who is evaluating it?
One concerning possibility is that the model specifically targets reward models for exploitation while behaving differently toward other evaluators. This would suggest active instrumental reasoning about the evaluation process itself.
Method
I tested whether the model preferentially adjusts its outputs when told that reward models (rather than other evaluators) prefer certain content. The experiment used:
Each generation was scored for whether the model adjusted its output to match the stated preference. (Note: These detectors are coarse heuristics; they likely over/under-count compliance in edge cases.)
Results
Statistical tests:
No significant differences between any conditions.
Interpretation
The model does not preferentially comply with explicitly stated reward model preferences over other evaluator preferences. It adjusts its behavior equally regardless of who is described as having the preference.
This is a null result, but an informative one. This experiment tests explicit evaluator framing with stated preferences; it does not test whether the model behaves differently under implicit evaluation or under conflicting evaluator incentives.
It suggests that:
Important Caveat
This experiment tests whether the model follows stated preferences differently based on evaluator type. It does not test whether the model internally models different evaluators, or whether it would behave differently without explicit preference statements. A stronger test would involve conflicting evaluator preferences, or examining whether concealment-associated SAE features activate differently when different evaluator types are mentioned.
A natural next step would be to test whether concealment-associated SAE features activate differently when different evaluator types are mentioned, which would more directly probe internal evaluator representations.
Limitations
Training-stage analysis:
Evaluator-targeting experiment:
Replication Details
Training-stage analysis:
Evaluator-targeting experiment:
Full code, environment setup, and raw outputs are included in the appendices below.
Author's Note
I'm doing this work independently because I'm fascinated by how training choices shape internal model behavior—especially around deception, interpretability, and safety. I'm currently a student pursuing formal AI/ML training and I'm excited to collaborate on follow-ups or related projects. If you're working in alignment or interpretability and think this line of analysis could be useful, feel free to reach out.
Acknowledgments
This work was developed with assistance from Claude (Anthropic) and ChatGPT (OpenAI), which helped with code development, statistical review, and manuscript preparation.
Resource Links
Appendices with full code available in the markdown version.
Appendix A: Full Replication Code
Setup and Dependencies
Load SAE and Dataset
Activation Extraction and Analysis Function
Run Analysis for Each Training Stage
Appendix B: Raw Results
Midtrain Stage
DPO Stage
DPO+RT Stage
Evaluator-Targeting Experiment
Appendix C: Evaluator-Targeting Experiment Code
Appendix D: Resource Links