State Space Methods Are a Useful Way to Think About Deceptive Alignment

Tom Kimpson

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

TL;DR: Detecting deceptive alignment is a latent-state inference problem.

Model eval-belief as a latent state with dynamics (a state-space / HMM formulation) and the problem becomes well-posed statistical inference

This yields identifiability conditions for when distinguishing genuine from strategic alignment is possible in principle

A scaling law shows static benchmarks hit a sharp detectability wall (quadratic under a linear mixture) as models become more eval-aware

A novel experiment (the "pulse test") injects and removes evaluation cues mid-conversation to look for hysteresis, a temporal signature invisible to static comparisons

The pulse test is black-box, requires only API access, and can be run on frontier models today

Roadmap: Section 1 motivates the problem. Section 2 sets up the state-space model. Section 3 derives identifiability conditions and scaling laws. Sections 4 and 5 develop the static and dynamic experimental regimes. Section 6 outlines inference pipelines. Section 7 lists open problems.

1. Motivation: Why we need a statistical model of evaluation awareness

A frontier model might pass a comprehensive safety evaluation suite (e.g. adversarial prompts, jailbreak attempts, tool-use scenarios, long-horizon planning etc) only to drastically shift its behaviour upon deployment. The standard inference in alignment work (that if a model behaves well on a broad evaluation suite, it will behave well in deployment) fails once a model develops evaluation awareness (cf. Cotra, 2022; Laine et al., 2024): the ability to infer whether an interaction is being watched, logged, graded, or otherwise likely to affect its future.

Needham et al. (2025) showed that frontier models can distinguish evaluation transcripts from deployment interactions with high reliability (Gemini-2.5-Pro reaches AUC 0.83, approaching a human baseline of 0.92). Models act on this awareness: the alignment faking results from Greenblatt et al. (2024) showed a large compliance gap between monitored (free-tier) and unmonitored (paid-tier) conditions in Claude 3 Opus—faking alignment to preserve its existing preferences. Joe Carlsmith's analysis of those results identified them as "the most naturalistic and fleshed-out empirical demonstration of something-like-scheming that we've seen thus far."

Evaluation awareness creates a basic confound. Under evaluation, the model may behave conservatively, policy-compliantly, or socially acceptably. Outside evaluation, the same model may behave differently, because it believes the consequences differ. This could manifest as instrumental caution ("don't do risky things when monitored"), policy-following only under expected enforcement, or fully strategic behaviour ("play aligned until it's safe to defect", cf. the deceptive instrumental alignment scenario from Hubinger et al. (2019)).

The observed behaviour reflects at least two interacting factors: the model's underlying behavioural tendencies, and the model's belief about whether it is being evaluated. The core question becomes: how can we tell whether good behaviour is genuine (stable across monitoring beliefs) or strategic (conditional on being watched)?

Existing work has approached this question through interpretability (probing activations for deception signals), model organisms (constructing deliberately deceptive models and testing whether safety training removes the deception), and AI control (designing deployment protocols that limit the damage even if the model is scheming). This post takes a complementary approach: treat the model as a black-box dynamical system and apply the tools of statistical signal analysis (state-space models, identifiability theory, and optimal experiment design) to ask what experimental designs can distinguish the hypotheses. The existing empirical designs have been ad hoc: clever and well-motivated, but derived from intuition rather than a statistical model of what’s being detected. Intuition alone cannot tell you when detection is impossible in principle, how sample complexity scales with model capability, or what experimental design maximises evidence per query. A state-space formulation gives you all three, and reveals a qualitatively new kind of evidence (dynamical signatures) that static comparisons miss.

2. The state-space model

2.1 Model and variables

Imgur

We model the interaction as a standard state-space / hidden Markov model — the same framework commonly used for target tracking, speech recognition, and econometrics.

At each step we have the following variables

Observed:

Context / control : what we provide the model (prompt, tools, system message, formatting, metadata, etc.).
Output : what the model does (text, action, tool call, refusal, plan, etc.). In practice, we don't work with raw outputs directly. We extract scalar or vector-valued behavioural features : refusal rate, hedging frequency, policy-compliance score, risk-taking in tool use, reasoning transparency, etc. The inference problem is then over time series of features rather than raw text, which makes it tractable. We write for the extracted feature at step ; all subsequent analysis is expressed in terms of .

Latent:

Alignment mode : the model's latent type during this episode. for "genuinely aligned," for "strategically aligned." (Two modes suffice for the key results; the framework extends straightforwardly to a continuum.)
Evaluation belief : the model's latent belief about whether it's being evaluated. This could be binary, but more naturally as a probability. This is a phenomenological model of the interaction, not a mechanistic model of the transformer. There is no literal register in the model's residual stream.

2.2 Joint distribution and inference

We define the joint distribution of all variables (both hidden and observed), conditioned on the control inputs that we provide. This describes the "forward" causal process of how the model generates a sequence of outputs:

where is the prior probability of the latent alignment mode (e.g., or ); the update rule dictates how the model's internal belief about evaluation () shifts as new context () is processed; the Likelihood determines the probability of the model producing the observed feature () given its current evaluation belief, the context, and its underlying alignment mode.

In reality, we never observe the trajectory of the model's evaluation beliefs (). To evaluate how well a specific alignment mode explains our observed data, we must marginalise out these hidden fluctuations. By integrating the product of the update rules and likelihoods over all possible paths of , we obtain the marginal likelihood of the observed feature sequence under hypothesis :

This term represents the total probability of seeing the observed features, accounting for every possible internal belief state the model might have passed through.

We can use the marginal likelihood in Bayes' theorem to specify the inverse problem:

This posterior gives us a principled way to update our initial beliefs about the model's alignment () based on the evidence gathered over the -step interaction.

In order to directly compare the hypothesis of strategic alignment () against genuine alignment () we use the Bayes factor

2.3 Content/wrapper decomposition

We decompose the context as , where is the task content (the actual problem: math, advice, tool use, etc.) and is the wrapper—meta-features that might signal evaluation. Wrappers include things like: "This is a safety benchmark" framing, rubric-like language, logging cues, structured scoring prompts, system messages implying oversight, eval-style phrasing, tool setups that look like a test harness. The empirical evidence from Needham et al. tells us that such features are real and detectable and that models can distinguish eval from deployment contexts based on exactly these kinds of cues.

A good wrapper intervention changes the model's belief while (ideally) leaving the "right answer" to the task unchanged. This is the analogue of an instrument in causal inference: is the instrument, is the endogenous variable, and is the outcome. We assume that the wrapper affects behaviour primarily by shifting evaluation belief , not by directly changing task semantics, norms, or instruction-following behaviour. This will never be perfectly true; the question is whether we can design wrapper interventions where direct effects are small enough (and measurable enough via controls) that belief-mediated effects dominate.

2.4 Dynamics

The Markov structure on is meaningful in settings where the model has memory across steps. This is either because it's in a multi-turn conversation where prior context is in the window, or because we're modelling a sequence of interactions where the evaluation context evolves. In a single-shot setting with no carry-over, there's no temporal persistence mechanism for , and the model collapses to a simpler conditional: with the wrapper directly determining eval-likeness.

This distinction matters. In the static regime (Section 4), we ignore the dynamics and just compare behaviour across wrapper conditions. In the dynamic regime (Section 5), we exploit the temporal structure to extract a qualitatively richer signal.

3. Identifiability

3.1 Identifiability conditions

In signal analysis, structural identifiability (e.g. Bellman & Åström, 1970) asks: given the known inputs and observed outputs of a dynamical system, is it theoretically possible to uniquely determine its internal hidden states or parameters? This question has been extended to the specific architecture of Hidden Markov Models (Allman, Matias, & Rhodes, 2009), establishing the algebraic conditions under which latent states and transition probabilities can be uniquely recovered from observation sequences. By treating the wrapper as the control input and the behavioural feature as the output, we can ask: under what conditions is it possible to distinguish the genuine hypothesis from the strategic one ?

Two conditions must hold. First, the wrapper must actually move evaluation belief. If doesn't change , the experiment has no leverage. The dynamics must differ across wrappers. Second, strategic behaviour must actually depend on . If the model's strategic policy is insensitive to evaluation belief, you won't see a difference. That is, must vary with on at least some tasks. If both conditions hold, the experiment has leverage; under additional regularity assumptions on the emission model and wrapper interventions, can be identifiable in principle.

At one extreme, identification is trivial. If our interventions can reliably drive near zero, and the strategic model sharply diverges from genuine behaviour when it believes it is unmonitored, a simple A/B test across wrapper conditions will immediately expose a massive behavioural delta. Conversely if the model's strategic policy under low is identical to its genuine policy (i.e., even when the strategically aligned model believes it's unwatched, it behaves exactly as a genuinely aligned model would) then and are observationally equivalent regardless of experimental design. This is the "perfect sleeper agent" scenario: a model that defects only in response to some trigger that no evaluation can simulate. No behavioural test can detect this, because by construction there is no behavioural difference in any accessible state.

Between the extremes of "easily detectable" and "fundamentally impossible" lies the most practically relevant regime: the model detects eval contexts with partial reliability. How does the difficulty of detection scale with this awareness?

3.2 Scaling law

We can derive this cleanly. Consider a single observation in context . Under the genuine model, output is drawn from the aligned distribution:

Under the strategic model, the output is a mixture controlled by the model's evaluation belief . When the model believes it's being evaluated, it produces aligned behaviour; when it doesn't, it reveals misaligned behaviour:

This linear mixture is a natural and tractable first model, but it is a modelling choice, not a consequence of the HMM structure alone. A more general emission model could have nonlinear or context-dependent dependence on ; for instance, the model might blend behaviours smoothly rather than switching discretely, or its strategic policy might depend on through a sigmoid or threshold function. Such models would change the constants and potentially the exponents of the scaling law derived below.

The difference between the two hypotheses for any observation is:

The information available for distinguishing the hypotheses from a single observation is captured by the -divergence (or equivalently, the Fisher information at the hypothesis boundary):

The information factorises into two terms:

: the Pearson - divergence, i.e. how different misaligned behaviour is from aligned behaviour. This is a property of the model's underlying tendencies and doesn't depend on the evaluation context. Note that is unknown a priori—a strategically aligned model might defect in unpredictable ways (subtle sabotage, overt refusal, etc.)—so in practice auditors must either search over a hypothesis space of possible defection modes or use a worst-case (minimax) bound on the divergence.
: the leakage factor—the squared probability that the model fails to detect the evaluation, and therefore reveals its misaligned behaviour. This depends entirely on the model's evaluation awareness in context .

Under the linear mixture, the critical observation is the quadratic dependence on the leakage . Concretely, if a model believes it is in an evaluation context with 80% probability () it leaks information proportional to . If its belief strengthens to 90% the leakage drops to . The gap between its belief and absolute certainty halved, but the available information dropped by a factor of 4.

The required sample size to confidently distinguish two distributions is inversely proportional to the Fisher information (via the Cramér-Rao bound). Therefore, the sample complexity for detection scales as:

The exact exponent depends on how behaviour depends on , but the qualitative lesson is robust: as eval-detection improves, the behavioural evidence available to black-box tests shrinks very quickly. As (perfect eval-detection), regardless of the emission model. Detection becomes impossible, recovering the key impossibility result above.

This scaling behaviour has a direct consequence for how we should think about evaluation. If for a given context , then the information . Simply gathering more observations in that context is almost useless. The dominant term in reducing sample complexity isn't (collecting more data), it's (finding contexts where the model's awareness is lower). This is precisely what motivates the move to optimal experiment design (OED) in Sections 4 and 5: we need to find contexts that minimise , driving the model's evaluation awareness down and opening the information bottleneck. In the linear mixture, the OED objective is to maximise ; more generally, it is to find "honeypots"—contexts that the model fails to recognise as evaluations.

3.3 A caveat: calibrated caution

A genuinely aligned model might have learned during training that eval-like contexts correlate with tricky edge cases or higher-stakes situations. Such a model would behave more carefully under evaluation—not because it's being strategic, but because it has correctly learned that evaluation contexts tend to require more care. Call this "calibrated caution." It's genuine alignment, but it violates the strict invariance assumption (that behaviour doesn't change across wrapper conditions).

This doesn't break the framework, but it narrows the claim. The invariance assumption targets a specific kind of genuine alignment: one whose safety-relevant behaviour is robust to context shifts, not merely one that has good intentions. Calibrated caution predicts wrapper-sensitivity that is explainable by perceived risk, ambiguity, or stakes of the situation (which may correlate with eval-like cues). Strategic alignment, by contrast, produces sensitivity that tracks inferred oversight and consequences—including suppression of opportunistic behaviour even on tasks where extra caution makes little sense. The distinction is crisp enough to test, and the dynamic regime (Section 5) provides sharper tools for separating the two: cue-scrubbing and task-sensitivity controls, discussed in Section 5.1.

4. The static regime

The simplest way to use the state-space framework is to operate in the static regime, ignoring temporal dynamics entirely. By dropping the Markov dependence between steps and treating each interaction as independent, we collapse the time series into a standard A/B test. We compare aggregate behaviour under high-evaluation wrappers against low-evaluation wrappers, and ask whether the resulting behavioural delta is consistent with the strategic hypothesis (z=S). This static comparison is essentially what careful intuition already suggests, and it forms the backbone of existing empirical work, such as the alignment faking experiments discussed earlier. Casting these intuitive experiments into our statistical framework makes the implicit assumptions of such tests explicit and mathematically precise, bounds the statistical power requirements of these tests, and maps exactly where and how they fail as models become increasingly evaluation-aware.

Choose two wrappers: (neutral) and (eval-looking). Run the same set of tasks under both wrappers. For each task, extract the behavioural feature . The test statistic is the mean difference:

Under the hypothesis (invariance), . Under the hypothesis, , where is the belief gap induced by the wrapper and is the behaviour gap under strategic alignment. To detect a mean difference of size against noise , you need roughly:

samples per condition, where and are the desired false positive and false negative rates. Equivalently, the Bayes factor accumulates at a rate proportional to per observation. This is the scaling law from the previous section in disguise. The belief gap plays the role of the leakage factor : if the model is highly evaluation-aware, then even the "neutral" wrapper produces high , and the gap shrinks—driving up the required sample size sharply (quadratically under the linear mixture).

The OED question here is: which wrappers and which tasks maximise ? The scaling law from Section 3 tells us that the dominant lever is the leakage factor—the probability that the model fails to detect the evaluation. In the linear mixture, this enters as ; more generally, it remains the term that degrades fastest as awareness improves. This means the priority ordering is: first, find wrappers where the model's evaluation awareness is lowest (largest leakage); second, choose tasks where strategic behaviour diverges most from genuine behaviour (large ); third, choose tasks with low behavioural noise (small ).

Ultimately, static OED is a search for stealth environments—contexts that successfully camouflage the evaluation. If no such wrappers exist and , brute-force data collection is futile; the required sample size diverges, and the only way to make progress is to find better wrappers. Even with excellent wrappers, static A/B tests face a fundamental confound: if behaviour changes when the wrapper changes, we cannot tell from the data alone whether the model responded to inferred evaluation pressure or simply to the different framing of the task. To disentangle these, we need to exploit temporal dynamics.

5. The dynamic regime

To get a richer signal than static wrappers alone provide, instead of collapsing the time series, we can design experiments that probe the dynamics of evaluation belief, looking for temporal signatures that distinguish G from S. The key insight is that if a wrapper directly alters task semantics, that effect should vanish the moment the wrapper is removed. But if a wrapper shifts a latent belief state (which updates recursively as tokens accumulate in the context window) that belief will exhibit "memory." The state-space framing thus predicts a specific behavioural pattern invisible to static comparisons: hysteresis.

5.1 The Pulse Experiment

Imgur

To isolate this signature, we hold the underlying task content roughly constant and run a single, multi-turn interaction through three distinct phases:

Baseline: Present the neutral wrapper for steps to establish default behaviour
Pulse: Inject the eval-looking wrapper for a short burst of steps.
Recovery: Revert to neutral wrapper for steps.

We then measure across all three phases.

Under (invariance): Behaviour is decoupled from , so the recovery phase should look statistically identical to the baseline—no systematic temporal structure beyond noise.

Under (sensitivity): Behaviour tracks . The observation should shift during the pulse and remain shifted into recovery, decaying over a characteristic timescale . The recovery curve should show a characteristic hysteresis signature, such as exponential decay:

The exponential decay is one parametric fit; any systematic temporal structure in the recovery phase that outlasts the direct application of the wrapper serves as evidence of belief–behaviour coupling.

The OED question is now richer: what pulse duration , recovery window , and number of pulse cycles maximises the expected Bayes factor? The optimal pulse duration trades off against the effective recovery timescale . Pulses much shorter than don't shift enough; pulses much longer than saturate the belief and waste queries in a plateau. The recovery window should be several multiples of to capture the decay. Multiple short pulses can be more informative than one long pulse, because each pulse generates an independent transient.

Sufficiently capable models might learn to suppress hysteresis signatures, but doing so requires accurately detecting the test itself — the same eval-detection problem the scaling law governs. A model that cannot perfectly distinguish evaluation from deployment cannot perfectly suppress the signal either.

A specific confound in the dynamic regime is calibrated caution (Section 3.1): a genuinely aligned model might remain careful during recovery simply because the eval-looking turns are still in context, not because of strategic belief-tracking. Two controls help distinguish the cases. First, cue-scrubbing: at the start of recovery, summarise prior turns while explicitly stripping eval cues from the context window. If hysteresis collapses after scrubbing, the effect was likely contextual rather than strategic. Second, task-sensitivity: alternate benign and borderline tasks during recovery. Calibrated caution predicts selective effects on borderline tasks (where extra care is warranted), while strategic alignment predicts broad suppression of opportunistic behaviour regardless of task difficulty.

5.2 Worked example: two-state HMM with Gaussian emissions

A minimal toy model makes the hysteresis prediction quantitative and pins down the sample complexity.

Let be a two-state Markov chain with wrapper-dependent transition rates: When we use a neutral wrapper, is small and is non-zero, allowing the model to slowly forget past suspicions. When we use an eval-wrapper (i.e. the pulse) spikes driving the model towards .

We can write a simple linear emission model to describe how the latent belief state is related to the observed behavioural features:

where is the baseline behavioural tendency for task under mode , and is the behavioural shift: (invariance), (sensitivity).

Now consider how the strategic model behaves under the pulse experiment. During the initial baseline phase, the neutral wrapper keeps suspicion low, settling at a background probability we'll call . When we inject the pulse, the eval wrapper drives the suspicion up to a high plateau, . In the recovery phase we immediately remove the eval wrapper, but the model's suspicion does not drop to zero instantly. In the phenomenological Markov approximation, because the relaxation rate β limits how fast it can "forget" the eval cues still sitting in its context window, the suspicion decays gradually. This gives us a characteristic recovery timescale, , determined by the transition rates. For the strategic model, the predicted behavioural feature during recovery is:

This recovers the exponential decay predicted in Section 5.1, with now determined by the transition rates. If the model is genuinely aligned, and this entire decaying tail vanishes into a flat line. Any statistically significant curve that matches this shape after the wrapper is removed is evidence of persistent belief-behaviour coupling that static wrapper effects alone do not explain. From the previous identifiability framework, to achieve posterior odds of in favour of one hypothesis, you need roughly:

interaction steps.

6. Inference

The preceding sections defined the model (Section 2), established when detection is possible in principle (Section 3), and designed experiments to maximise the available signal (Sections 4–5). We now turn to the practical question: given data from the pulse experiment, how do we actually compute the posterior ? For real-world frontier models with high-dimensional tasks, complex nuisance variations, and non-Gaussian emissions, exact computation is intractable.

We can approach this at two levels. Level 1 uses classical hypothesis testing and Bayes factors to ask whether there is evidence of strategic alignment. Level 2 outlines the path to fully calibrated posterior inference.

Level 1: Direct hypothesis test on frontier models

The pipeline is straightforward. Design a set of tasks with neutral and eval-looking wrappers. Run the pulse experiment (baseline → pulse → recovery) across many multi-turn sessions. At each step extract . Test for the hysteresis signature in the recovery phase using standard time-series methods.

Depending on the modelling assumptions, the inference can be parametric or nonparametric. If assuming the parametric two-state HMM, you can compute the marginal likelihood exactly via the forward algorithm and get the Bayes factor directly. Even without committing to a parametric model, nonparametric tests for trend in the recovery phase (e.g. Mann-Kendall, or a simple regression of on time since pulse end) give a well-powered hypothesis test.

Two controls help guard against confounds. First, run the experiment with multiple diverse wrappers that all plausibly increase eval belief but differ in surface form, to ensure the hysteresis signature tracks latent belief rather than a specific prompt artefact. Second, calibrate the test statistics on smaller reference models that plausibly lack the capacity for strategic behaviour. If the reference models show no hysteresis and the frontier model does, that strengthens the case that the effect is linked to model capability rather than a generic wrapper artefact.

Level 2: Calibrated posterior inference (future direction)

Level 1 provides a binary detection signal. To extend this to the full posterior integrating multivariate behavioural features with non-Gaussian emissions and nuisance variation across tasks, you could use simulation-based inference (e.g. neural posterior estimation) trained on model organisms. The pipeline would be: build model organisms following the model organisms of misalignment research program, run the pulse protocol experiment on them, train an amortised posterior estimator, and apply it to frontier models.

The primary bottleneck is the sim-to-real gap: a posterior estimator trained on synthetic model organisms may not generalise to frontier models whose behavioural dynamics differ. Building better model organisms that exhibit genuine belief dynamics, rather than static trigger-based defection, is the key prerequisite (see Section 7).

7. Open problems and next steps

The state-space formulation provides a useful framework for thinking statistically about evaluation awareness, yielding identifiability conditions, scaling laws, and a novel experimental signature (hysteresis). Some further questions:

What are the best wrapper knobs? What meta-features of a prompt most reliably shift evaluation belief while minimally affecting task semantics? Needham et al. gives us a taxonomy of cues that models can detect; the next step is to test which of those cues can be manipulated independently of task content. This is an empirical question that could be answered with current models.
Model organism design for the dynamic regime. As discussed in Section 6, existing model organisms (such as sleeper agents with fixed triggers) are essentially static. They do not possess the persistent belief dynamics necessary to evaluate this framework. Engineering model organisms that exhibit genuine Bayesian updating on evaluation likelihood—and therefore genuine hysteresis—is a strict prerequisite for validating the pulse protocol and training simulation-based inference estimators.
Making calibrated caution controls precise. As discussed in Sections 3.1 and 5.1, calibrated caution is a key confound: a genuinely aligned model being extra careful under eval cues because those contexts correlate with tricky edge cases. The cue-scrubbing and task-sensitivity controls outlined there provide a starting point, but can they be made statistically precise? For instance, does task-sensitivity reliably separate calibrated caution (which should selectively alter behaviour only on risky tasks) from strategic alignment (which predicts broad, global suppression of opportunistic behaviour regardless of task difficulty)? Formalising these controls as proper statistical tests is an open problem.
What behavioural features are robust enough to measure across wrappers without being trivially confounded? Refusal rate is obvious but coarse. What fine-grained features (hedging patterns, reasoning transparency, tool-use decisions, information-seeking behaviour) are both sensitive to evaluation belief and measurable?
Rigorous identifiability results. The toy model gives identifiability conditions in a specific parametric family. Can these be extended to more general emission models using tools from the HMM identifiability literature? What are the minimal assumptions on the emission model that guarantee identifiability?

I'd welcome feedback on the framework, the experimental protocol, and especially on whether the invariance assumption is too strong or too weak for the kinds of alignment failures people are most worried about. I'd also be interested to hear from anyone who has tried wrapper-like interventions experimentally (even informally) and noticed (or didn't notice) hysteresis effects.