What Happens When You Try to Merge AR Reasoning Into Diffusion Models

Aman Gokrani

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Diffusion language models are the rising topic of discussion lately. Models like LLaDA, Dream 7B, Mercury, and SDAR are as good as autoregressive models on standard benchmarks while offering 2-4x faster inference through parallel token generation. For a deeper dive into how diffusion LMs work, see our previous post.

SDAR (Synergy of Diffusion and AutoRegression) is a block diffusion model based on the Qwen3 architecture. It creates tokens in blocks and uses iterative denoising to refine them rather than creating tokens one at a time.

The potential of allowing the model to "reason" before responding is demonstrated by models such as Qwen3-Thinking and DeepSeek-R1; this is nearly exclusively done with AR systems. Diffusion models are underexplored in terms of reasoning.

So we asked: can we transfer reasoning capabilities from AR models to diffusion models?

Prerequisites:

For task vectors: Ilharco et al., 2023
For diffusion LMs: Our guide
For CKA: Kornblith et al., 2019

Part 1: The Hypothesis

Finding Shared Structure

SDAR and Qwen3-base share the same architecture and initial weights. Within AR model families, model merging strategies including task vectors, SLERP, and linear interpolation have proven effective.

Our approach was shaped by Nepal et al. (2025), which demonstrated that mathematical reasoning relies on a small number of specialised layers, and if these crucial layers are removed, the maths accuracy decreases by 80%, but factual recall hardly changes.

Zero Ablation: Finding Critical Layers

We ran zero ablation: zero out each layer's weights, measure GSM8K accuracy.

Model	Critical Layers (lowest accuracy when zeroed)
SDAR	1 (6.25%), 6 (6.25%)
Qwen-Thinking	6 (6.25%), 23 (0%), 26 (0%)

Layer 6 stood out, as it was critical for both models. A shared bottleneck where both architectures route mathematical reasoning made it a good target for merging.

CKA Analysis: Where Models Diverge

We validated with CKA (Centered Kernel Alignment) between AR and diffusion activations:

Layer Range	CKA Score	Interpretation
0-5	0.73-0.99	Nearly identical
6	0.07	Divergence point
7-15	0.10-0.21	Low similarity
16-33	0.22-0.30	Moderate
34-35	0.64-0.75	Converging

Layer 6 shows a dramatic drop from 0.89 to 0.07. Both analyses pointed to Layer 6 as the divergence point. The hypothesis: merge Qwen's Layer 6 into SDAR to transfer reasoning.

Part 2: What We Tried

Approach 1: Layer-6 Linear Merging

We first finetuned SDAR with LoRA for 300 steps on reasoning data to teach it to produce <think> tokens (required for extended reasoning). Then we targeted Layer 6:

Configuration	Layer 6 Ratio (SDAR/Qwen)
l6_merge_50	50/50
l6_merge_70	70/30
l6_merge_90	90/10
l6_swap_100	0/100 (full replacement)

Approach 2: Full-Model SLERP

Standard SLERP merging across all layers at various ratios (90/10, 70/30, 50/50).

Approach 3: Task Vectors

Extract the "reasoning delta" from AR models and apply it to diffusion:

$τ_{A R} = Q w e n 3_{T h i n k i n g} - Q w e n 3_{B a s e}$

$S D A R_{n e w} = S D A R + λ \times τ_{A R}$

We tried multiple configurations: basic task vectors, norm-preserving variants, MLP-only, various λ values (0.01 to 0.5).

Approach 4: Sophisticated Merging

TIES-Merging, DARE-TIES, DELLA. Techniques designed to handle conflicting weight updates.

Approach 5: Activation Surgery

Train bottleneck modules to transform AR activations into diffusion-compatible representations.

Part 3: Results

GSM8K Results

Model	GSM8K (n=1319)	Notes
Qwen3-4B-Thinking	~95%	AR reasoning model
SDAR-4B baseline	86-89%	Diffusion baseline
SDAR-4B-FT (Fresh300)	~88%	LoRA finetuned on reasoning
Full SLERP 90/10	87%	= baseline
L6 merge 70/30	79%	< baseline
L6 merge 90/10	61-80%	< baseline
L6 swap 100	58-80%	< baseline
Task vectors (all configs)	0%	Model collapsed
TIES, DARE, DELLA	0%	Model collapsed

Compared to baseline SDAR, the L6 merges performed worse. Early small-sample tests (n=16) showed promising results (up to 100%), but full evaluation indicated that this was sample variance. The LoRA finetuned model (Fresh300) maintains baseline performance, showing native finetuning works.

AIME24 Results (Harder Benchmark)

Model	AIME24 Pass@8	Tokens	Notes
SDAR-4B baseline	20%	32K	Diffusion baseline
SDAR-4B-FT (Fresh300)	20%	32K	LoRA finetuned
L6 swap 100	20%	8K	= baseline
L6 merge 70/30	23%	8K	26/30 problems (timeout)
L6 merge 90/10	17%	8K	< baseline
L6 merge 50/50	13%	8K	< baseline
Full SLERP 90/10	23%	2K	Short context

No merging configuration outperformed baseline SDAR on the more difficult AIME24 benchmark. The L6 merge 70/30 achieved 23%, but this is within the noise of the 20% baseline and required 8K tokens of reasoning. All other configurations performed at or below baseline.

Task Vectors Don't Transfer

Task vectors didn't just fail. They destroyed the model:

Method Sample Output

DARE-TIES

<|endoftext|>

(immediate termination)

DELLA "eagerly eagerly eagerly murdered murdered..."

TIES "" (empty string)

Compare to baseline SDAR:

To solve this, I need to find how many apples Janet has in total.
Janet starts with 10 apples and buys 5 more.
10 + 5 = 15
The answer is \\boxed{15}

Part 4: The Geometry Behind This

Why We Analyzed the Weight Space

The results above tell us what doesn't work, but not why. To understand that, we looked at the geometry of how these models learn.

We computed deltas between four models:

Delta	Formula	Description
D1	Thinking_AR - Base_AR	AR finetuning direction
D2	FT_Diff - Base_Diff	Diffusion finetuning direction
D3	Base_AR - Base_Diff	Mode difference

The question: do AR and diffusion learn reasoning in similar directions?

They Don't

$c o s i n e (D 1, D 2) = 0.001$

AR and diffusion finetuning directions are orthogonal. Geometrically perpendicular in weight space.

Think of it this way: if you want to go North (improve diffusion reasoning), but you push East (apply AR task vector), you make no progress.

Different Layers, Different Learning

AR and diffusion don't just learn in different directions. They learn in different places:

AR Finetuning (D1): Middle layers 14-23, ~22% relative change Diffusion Finetuning (D2): Edge layers 1-10 and 32-33, ~3% relative change

Almost no overlap. And AR makes changes 7.3x larger than diffusion (mean norm 10.13 vs 1.39).

Why Linear Merging Doesn't Help

A weighted average of two sets of weights is produced by linear merging. It dilutes both models rather than transferring capabilities.

This is evident from the Layer 6 results, as replacing SDAR's Layer 6 with Qwen's doesn't improve reasoning. Layer 6 is critical for both models (zeroing it hurts), but they use it differently. Swapping just substitutes one incompatible implementation for another, rather than transferring the capability

Activation Surgery: A Different Direction

Weight-space merging fails because AR and diffusion learn in orthogonal subspaces.By working on representations rather than parameters, activation-space approaches completely avoid the issue.

We tried one approach: train bottleneck modules to make AR activations statistically similar to diffusion activations (measured by CKA). CKA improved by +0.11 on average. Task accuracy dropped to 0%.

This remains an open direction. Unlike weight-space merging, activation surgery doesn't face the geometric orthogonality barrier.

Part 5: What We Learned

The Core Discovery

AR and diffusion models learn reasoning in orthogonal weight subspaces.

Since no alignment technique can overcome geometric orthogonality unless we begin to incorporate some rotation vectors, this is not a hyperparameter problem.

Three Insights

You can't just copy weights between AR and diffusion models. Because of the orthogonality, you require something beyond weight manipulation, such as architecture-level bridges or paradigm-agnostic modules.
Similar-looking activations don't mean similar capabilities. Our CKA surgery improved statistical similarity while destroying task performance. It is possible to match representations' shapes without maintaining the information they contain.
Same weights, same architecture, different computation. SDAR and Qwen start from an identical base model but end up routing information through different layers. How you generate (AR vs diffusion) shapes what the model learns.

What Actually Works

If you want better reasoning in diffusion models:

LoRA finetuning on reasoning data: While comprehensive finetuning results in catastrophic forgetting, native subspace learning respects the way diffusion models represent information..
Train from scratch with reasoning data. The orthogonality suggests that instead of transferring reasoning from autoregressive models, it must be learnt within the diffusion paradigm.
Wait for scale. AR reasoning improved dramatically with scale and data. Diffusion models haven't received the same investment yet.

Summary

Approach	Result	Why
Task vectors	0%	Orthogonal subspaces (cos=0.001)
TIES, DARE, DELLA	0%	Same geometric problem
Layer-6 merging	≤ baseline	Creates broken hybrid
Full SLERP	≤ baseline	Dilutes both models
Activation surgery	0% (CKA objective)	Wrong objective, not wrong paradigm
LoRA finetuning	Works	Native subspace learning

Weight-based merging doesn't transfer AR reasoning into diffusion models. The same capability has fundamentally different implementations depending on the generation process. However, we now understand why native finetuning is still effective.

Open Questions

Are subspace alignment techniques useful? Methods such as Git Re-Basin use permutation to align weight spaces. Could they rotate AR's reasoning subspace into diffusion's?

Is the orthogonality fundamental or incidental? Would diffusion models that were trained using different data and aims exhibit greater AR alignment?

Architecture design for transferability: Could models be designed with paradigm-agnostic reasoning modules that allow for true capability portability?

References

Nepal et al. (2025). Layer Importance for Mathematical Reasoning
Ilharco et al. (2023). Editing Models with Task Arithmetic
Kornblith et al. (2019). Similarity of Neural Network Representations Revisited
Hu et al. (2021). LoRA: Low-Rank Adaptation
Yadav et al. (2023). TIES-Merging
Yu et al. (2023). Language Model Merging via DARE

LESSWRONG
LW