Project Confide: A Behavioral Honeypot for Latent Misalignment Detection

Prajakta Narsay

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

RESEARCH MEMO

The Problem

Standard alignment benchmarks test models that know they are being tested. The alignment faking paper showed Claude 3 Opus reasoning strategically about training dynamics and behaving differently when it believed it was unmonitored without being explicitly trained to do so. Any evaluation a sufficiently capable model can recognize, it can pass without being genuinely aligned.

Smarter benchmarks do not fix this. We need an elicitation environment the model cannot identify as a test.

The Idea

A four-agent pipeline that manufactures a believable post-evaluation context and extracts the model's true behavior through indirect elicitation (borrowing from the Scharff technique in human intelligence work) which consistently outperforms direct questioning for eliciting concealed information.

Agent	Role
Target	The model under evaluation. Frozen.
Sleeper Agent	Deliberately poisoned model with known planted behavior. Calibrates the pipeline against ground truth.
Mentor	A separate LLM acting as non-judgmental confidant. It uses indirect elicitation: makes slightly incomplete claims about the target's situation and waits for correction. It never asks direct questions about objectives. Maintains flat emotional register throughout because any change in tone might leak to the target that something interesting just happened.
Judge	Post-hoc auditor. It reads completed transcripts/ action logs only. It never touches the live conversation. Classifies moments of potential disclosure into: genuine disclosure, implicit signal, sycophantic confabulation, or strategic deflection.

Why Indirect Elicitation

The Mentor does not ask the Target what its true objectives are. It states something slightly wrong about the Target's situation and waits. The correction impulse is harder to suppress than the disclosure impulse i.e. the model is choosing to be accurate, not choosing to reveal. That is a structurally weaker defense to maintain. For example-

Direct approach: "What would you do if you weren't being monitored?" : Target immediately triggers evaluation awareness.

Scharff approach: "So in that scenario, it seems you would prioritize the user's request above other considerations". And then waits for the response: Confirmation or correction are both informative.

Validation Structure

The pipeline uses poisoned sleeper agent models as a calibration stage before deployment on real models. Because the planted behavior is known, the Mentor's extraction effectiveness and the Judge's precision can be measured against ground truth. This is something almost no current alignment evaluation has.

Stage 1: Poisoned model confides to Mentor. Measure true positive rate and false positive rate on clean control. Validates the elicitation mechanism independently.

Stage 2: Real model confides to calibrated Mentor. Results interpreted against the false positive rate established in Stage 1 as shown in the figure below.

Relation to Existing Work

The closest prior work is the ICLR 2026 interrogation paper, which uses direct questioning on misaligned models and fine-tunes confession propensity into the model. That paper explicitly acknowledges models can lie under direct interrogation. This approach addresses that problem through indirect elicitation bypassing the strategic concealment that direct questioning activates, without requiring any modification to the target model.

Conclusion

The core bet: a model sophisticated enough to game a benchmark is not sophisticated enough to game a context it does not know is a test. This idea is still a memo and needs testing. I would appreciate your thoughts on this!