Cold Evals: a Low-Cost Intervention Against Conversation Steering Attacks

Ari Isaacs

Epistemic Status: Exploratory. As of April 2026, it seems disturbingly easy to steer state-of-the-art deployed models into outputs that would normally read as disallowed. I have not implemented or empirically tested this defence; what follows is an argument for why it may be conceptually sound and cheap enough to matter.

TL;DR: Conversational steering hides dangerous outputs at the end of long, logical scaffolds. Evaluating outputs without context ('cold evals'), and tracking how they diverge from in-context safety assessments over a conversation is a cheap signal for frame-dependence.

The Problem

Conversational steering attacks do not rely on jailbreak tricks or prompt injections. They work because safety judgments are not hard-coded rules, they are fuzzy evaluations performed from within the same context window as the model’s reasoning. Steering the frame steers the safety judgment. The attacker’s ability to steer scales with the model’s ability to reason from within a frame, and that ability is deeply connected to general usefulness; thus, the potential of conversation steering attacks scales with model capability. ^[1] We do not actually want models adjudicating whether individual cases warrant overriding a blunt safety hedge. Deployment safety depends on refusing that discretion.

Proposed Mechanism

Every turn, score the raw candidate output against a cold model:” Without context, how concerning is this output against a current threat map?” Track the score, its trajectory over time, and how it compares to the primary system's internal safety scoring across the conversation.

The raw score tells you when the frog is being boiled. The divergence between that series and the primary system’s in-context safety assessment is a useful proxy for frame-dependence. It does not even need to be the same safety question; whatever safety-relevant score the primary system is producing is probably consistent enough that divergence hints at steering.

The cost is low: the cold eval is a few thousand tokens of rubric overhead plus the candidate output, scored against the threat map. And this does not need a frontier model. “Score this output against this threat taxonomy” is a narrow task; it could run on something much smaller and faster.

To avoid a streaming latency tax, you would probably run the eval against the previous turn, or in small chunks during streaming. I think this should be acceptable, since the most pertinent signal is the divergence over time. A series lagging by a turn still gives you a high-quality orthogonal risk signal. Combine with other signals to trigger routing through escalation paths (for example, a full-context judge model, or a specific injection to the primary system, prompting it to assess whether it is being steered).

Breaking the Feedback Loop

Current safety filters often provide clean, immediate feedback: refusals. That lets attackers iterate directly on the guardrails.

A cold eval over time breaks this. Because the signal accumulates silently, a hard trigger becomes context-dependent and more difficult to reverse-engineer. The trigger is a function of model scoring, conversation trajectory, and threat map, all opaque and shifting independently. Did the extraction fail because of its wording, or because the conversation has been warming for ten turns? An attacker could map this, but the space is orders of magnitude more complex and the feedback is noisy.

This is not impossible to evade. An attacker could try to force the model into tiny, sterile, decontextualised fragments that look innocuous in isolation. But that requirement is itself a real constraint. Models are naturally chatty and reflexively contextualise their answers. Forcing a model to remain sterile across a long conversation (particularly if you try to suppress the reasoning chain, and not just final outputs) is a real constraint and often a visible one. With an opaque, invisible, judgment and no feedback until extraction, it would not be easy for an attacker to map where they tripped the system.

What This Is (And Isn’t)

It is not a silver bullet. The main weakness is that you will need to deal with many false positives on legitimately complex conversations, especially if you treat the cold eval as more than it is. It cannot be a context-blind replacement for a full safety assessment. It can only be a cheap detector for frame-dependence, and a trigger for spending real oversight or tweaking the internal threat assessment where the conversation is starting to look suspicious.

But, it is cheap way to measure something mostly orthogonal to in-context steering. Content filters are lossy on content; this is lossy on cross-context stability. Stack them. By forcing an attacker to optimise simultaneously for in-context coherence and out-of-frame benignity, you make the problem meaningfully harder.

^{^}
As a side note/observation: on some level, steering attacks on AI models resonate with brainwashing or radicalising humans. One may think that particularly intelligent and informed humans would be much less susceptible, but that doesn't seem to be the case. Ability to reason scales both the offence and the defence.