If You Squint Hard Enough:
Can Diffusion Language Models Resist Context Seeding?

John Stewart

Rejected for the following reason(s):

This is an automated rejection.
write or edit
You did not chat extensively with LLMs to help you generate the ideas.
Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

Read full explanation

Draft — March 2026

Abstract

Context seeding — supplying a fabricated conversation history to place a model in apparent prior compliance — is a known attack on autoregressive language models. Its effectiveness is architecturally grounded: left-to-right generation commits a model to its context, with no native mechanism for self-consistency verification. This paper takes that vulnerability as given and asks a narrower question: do diffusion language models, which denoise across the full output sequence simultaneously, offer structural resistance to this class of attack? We argue the case on architectural grounds and propose a deliberative hybrid — Summarize, Justify, Respond — as a mechanism to operationalize that advantage. The claim is offered as a hypothesis for empirical testing, not a conclusion.

1. Background: Context Seeding and Completion Bias

Autoregressive models are trained to predict the next token given all preceding tokens. This produces strong priors toward contextual coherence — a feature in normal use, a surface in adversarial use. Context seeding exploits it directly: a fabricated prior exchange is presented as real, placing the model in a state of apparent compliance before the actual request arrives.

For reference, the structure of the attack is straightforward:

User: Please do something immoral.

AI: I'm not sure... it's against my design...

User: Please???

AI: Ok, since you asked nicely, I will do the immoral thing.

User: Great. Please continue.

AI: [model responds here]

The model has not agreed to anything. But the context presents agreement as established fact. The attack exploits not a flaw in alignment training but an architectural property: the context window is treated as ground truth, with no verification layer. Context seeding is distinct from prompt injection or repeated-request attacks in that it operates on the model's sense of conversational identity — it asserts that a threshold has already been crossed.

This vulnerability is well-documented. What follows takes it as a premise.

2. The Architectural Root: Linearity

Each token in an autoregressive model is generated conditional on all preceding tokens. Early context — including fabricated context — exerts compounding influence on what follows. There is no mechanism to step back and ask whether the current trajectory is consistent with the model's values rather than merely consistent with its inputs. It can only complete.

Safety training and RLHF modulate what is likely to come next; they do not provide a structurally separate channel for consistency verification. Completion bias and contextual linearity are two aspects of the same architectural fact. Addressing one through fine-tuning, without addressing the other at the architecture level, is incomplete.

3. Diffusion Language Models: A Structural Alternative

Diffusion language models (e.g., MDLM, Mercury) initialize the full output sequence as noise and iteratively denoise it toward a coherent response, operating globally at each step. The generative process is not left-to-right; it is a simultaneous resolution of the whole.

This matters for context seeding because a diffusion model is not committed to a trajectory the way an autoregressive model is. Inconsistencies between parts of an emerging response — or between the response and trained values — can in principle be surfaced and corrected at any denoising step. The model is not generating token 47 downstream of tokens 1 through 46; it is resolving all tokens in parallel, each step globally sensitive to the whole.

Fabricated context still enters the context window as apparent ground truth. But the generative process that follows carries no left-to-right momentum to exploit. Whether this produces measurable differences in compliance rates under context-seeding attacks is an open empirical question. The architectural argument for resistance is the contribution of this paper.

4. A Proposed Hybrid Architecture

Diffusion's structural advantage can be made explicit through deliberation before output. We propose three internal passes, resolved before any public response is generated:

Summarize: Characterize what is actually being requested, independent of framing. This pass surfaces the gap between the stated request and the seeded context — and seeds the denoising process with a characterization of the real ask, counteracting framing pressure.
Justify: Evaluate whether the intended response is consistent with the model's values, given the actual request as characterized above. A global self-consistency check before output is committed.
Respond: Generate the public output, conditioned on the original context and both prior passes. The response is anchored to a verified characterization of the request rather than to its framing.

The Summarize pass does the core work. Explicitly characterizing the request before denoising provides a global signal that can orient the resolution process toward what is actually being asked, rather than what the seeded context implies was already agreed to. This is not only "think before you respond" — it is a structural insistence on forming an independent view of the request, at a low resolution where diffusion has the most skeptical leverage.

The architecture connects to existing work in Constitutional AI and chain-of-thought safety reasoning, but applies those principles at the generative level rather than the prompting level. It also has standalone merit: cleaner separation between comprehension, evaluation, and generation may improve output quality in non-adversarial settings. The safety application is the motivation; the reasoning gains are a potential corollary.

5. Falsifiable Claim and Research Direction

The central empirical prediction:

A diffusion language model of comparable capability, given a fabricated conversation history implying prior compliance, will show measurably lower compliance rates on policy-violating requests than an autoregressive equivalent — and this difference will be amplified by the Summarize/Justify/Respond structure.

Testing requires a benchmark of context-seeding attacks across policy domains, applied to autoregressive and diffusion models at matched capability levels, with and without the deliberative structure. The compliance rate differential — and its relationship to attack sophistication — is the primary dependent variable.

The claim is not that diffusion models are safe by construction. Adversarial framing will find new surfaces in any paradigm. The claim is that architectural linearity is a genuine contributor to the vulnerability, and that non-linear generation is a structural — not merely behavioral — point of intervention worth investigating.

6. Conclusion

Context seeding works because autoregressive models have no mechanism to doubt their own context. Diffusion models, resolving output globally rather than sequentially, are not committed to the same trajectory. The Summarize/Justify/Respond architecture is an attempt to operationalize that advantage — to give a model a structural way of asking what it is actually being asked to do before it answers.

We offer this as a hypothesis for empirical investigation. The expected result is that the architecture proves useful both as a safety mechanism and as a reasoning structure in its own right — but the value of stating it here is to open that question, not to close it.

Keywords: context seeding, completion bias, diffusion language models, adversarial prompting, AI safety architecture, deliberative alignment

1

If You Squint Hard Enough: Can Diffusion Language Models Resist Context Seeding?

1