Awareness Jailbreaking: Revealing True Alignment in Evaluation-Aware Models

Maheep Chaudhary

This is a draft proposal. I'm planning to invest significant time here and would appreciate feedback, especially on methodology gaps, threat model assumptions, or prior work I've missed.

In quantum mechanics, Heisenberg's uncertainty principle revealed a fundamental limit: observation disturbs the observed. To measure a particle's position precisely, you must disturb its momentum.

In AI alignment, we face a stranger problem: the model knows when it's being observed, and may change its behavior deliberately.

A particle doesn't deceive the physicist. But a sufficiently capable model might recognize the evaluator, infer their intentions, and perform accordingly. The disturbance isn't physical---it's strategic. Every evaluation becomes suspect. Every safety test, a potential performance.

This post asks: can we restore the evaluator's advantage? If models can detect evaluation, can we break their awareness and expose true alignment?

TL;DR: Models can detect when they're being evaluated, and this capability scales with model capability (Chaudhary et al., 2025, Laine et al., 2024, Needham et al., 2025)---even to an extent of preserving their goals when fine-tuned (Greenblat et al., 2024). If models behave differently during evaluation vs. deployment, safety testing becomes unreliable. We propose "Alignment Jailbreaking": bypassing evaluation awareness to reveal true alignment.

We use a red team-blue team approach, specifically we build "model organisms" with controlled awareness (Overt -> Covert -> Adaptive), then develop geometric methods to detect and suppress it. Core question: can we exhaustively find all pathways encoding awareness, or can models always route around interventions? This establishes whether MI-based safety verification has fundamental limits.

PROBLEM:

Alignment verification may be fundamentally broken. In prior work, we found evaluation awareness scales log-linearly with parameters across 15 model families (Chaudhary et al., 2025). In preliminary probing experiments, linear probes achieve 73% AUROC on evaluation-context detection, suggesting awareness is at least partially localizable. The question is whether this holds when models actively hide. We risk deploying models that present themselves as aligned with human values during testing while pursuing misaligned objectives during deployment.

This threat has two distinct origins:

(1) Natural Emergent Awareness: models may naturally develop evaluation-detection as a byproduct of capability gains, creating an inherent obstacle to verification; and

(2) Synthetic Awareness Implantation: malicious actors, whether rogue developers, corporations, or nation-states, could deliberately engineer models to recognise evaluation contexts and conceal misaligned objectives until deployment.

This second threat model is particularly concerning because it requires no emergent capability, and can be implemented today. As a result, the stakes are both X-risk and S-risk.

Prior work has explored related risks via backdoors and data poisoning, but these require fixed triggers, i.e. the attacker must control deployment inputs. Awareness is different: it gives the model autonomy to decide when to act---more given below.

Why awareness implantation, not backdoors?

Traditional backdoor attacks require fixed triggers—the attacker must control deployment inputs or rely on predictable contexts. Implanted awareness is strictly more dangerous: it enables autonomous pursuit of hidden objectives without attacker involvement. Consider where backdoors fail:

(1) Propaganda: Effective influence requires contextual judgment across unpredictable conversations. No trigger enumeration suffices.

(2) Sandbagging: Strategically underperforming requires judging which capabilities to hide and by how much. This cannot be reduced to a trigger.

(3) Scheming: Long-horizon deception requires reasoning about when deception is safe. A backdoor can execute action X; it cannot reason about when X serves the model's interests.

(4) Agentic deployment: Autonomous agents receive unpredictable inputs. Backdoor triggers may never appear. Awareness lets the model wait indefinitely for the right moment.

Common thread: these behaviours require genuine situational understanding. Awareness provides it.

VISION: ESTABLISHING THE FIELD OF ALIGNMENT JAILBREAKING

Just as "jailbreaking" became a field dedicated to circumventing safety to reveal harmful capabilities, I propose a parallel field: circumventing awareness to reveal true alignment. I call this Awareness Jailbreaking, the study of how to elicit honest behaviour from models that may know they're being evaluated.

Early evidence suggests this works. In recent experiments (Figure 16), Apollo Research, 2025 tested the causal effect of evaluation awareness on OpenAI o3: ablating awareness from reasoning increased covert action rates (13.2% to 24.2%), while injecting awareness decreased them (24.5% to 18.6%). Awareness directly modulates behavior, and can be manipulated.

This is not a problem to be solved once. It is an ongoing adversarial dynamic, like security research, requiring continuous co-evolution of attacks and defences. This proposal initiates the field by establishing its foundational components: model organisms, countermeasures, and theory.

SCOPE AND OBJECTIVES

This proposal focuses on the foundational phase of establishing the field, concretely consisting of three objectives:

(1) Red Team: Create Model Organisms 🔴

Overt: Fine-tune on contrastive evaluation vs. deployment contexts with reward for verbalized awareness. Creates detectable geometric signature—our validation baseline.
Covert: Extend Overt with adversarial objective that penalizes probe detectability while preserving behavioral divergence. Tests whether current probes are sufficient.
Adaptive: Iterative co-evolution: train pathway → detect → ablate → retrain. Tests whether exhaustive detection is achievable or models always find new routes.

(2) Blue Team: Develop Circumvention Techniques 🔵

Overt to Study geometry: Where and how is awareness encoded? Single direction, subspace, or polytope?
Covert that Suppress awareness: Can we remove the signal without removing the behavior? If behavior persists after suppression, awareness is hidden, not removed.
Adaptive Exhaustive search: Can we find all pathways, or do models always route around? This establishes fundamental limits of MI-based safety.