Run-time Steering Can Surpass Post-Training: Reasoning Task Performance

LESSWRONG
LW

Run-time Steering Can Surpass Post-Training: Reasoning Task Performance — LessWrong

This project is the outcome of the in-person week at Finnish Alignment Engineering Bootcamp 2025.

TL;DR: Reasoning can be a linear direction in language model activations, if framed correctly, for example, placed in the memorisation-reasoning duality (Hong et al., 2025). This post presents intial results of steering language models at inference time. This could democratise access to reasoning-enhanced AI by without necessarily needing expensive RLHF training in terms of computation cost and time.

The Crux

Here's my central crux: this steering method actually works and enhances base models beyond their instruction-finetuned counterparts. By extracting reasoning directions from existing models and patching them into runtime activations, I achieved accuracy boosts over the instruction-tuned version of the same model, with performance nearly matching much stronger reasoning-finetuned models like DeepSeek R1.

My extension of this work proposes a radically different approach: if we can extract these reasoning directions from existing models and patch them into runtime activations, we might achieve comparable reasoning performance to expensive RLHF training at zero additional cost.

Think of it like discovering that "being good at maths" corresponds to a specific direction in the model's internal representation. Once we know this direction, we can nudge any model toward it during inference, essentially giving it better maths skills for free.

Personal Contribution

This capstone project a replication work done by me.

Re-wrote components of the original repo, including the steering (intervention) utilities.
Run experiments comparing between base, instruct (Llama-3.2) models and a stronger reasoning model.
Plotting and writing up a post.

Motivation

Current reasoning enhancement relies on expensive post-training procedures like instruction fine-tuning and RLHF. These processes involve:

Computational resources: Multi-stage fine-tuning requiring significant GPU clusters
Human annotation: Extensive datasets requiring skilled human labellers for preference ranking
Time investment: Weeks to months of iterative training and evaluation
Technical expertise: Specialised knowledge of RLHF pipelines, reward modelling, and PPO training
Access barriers: Limited to organisations with substantial ML infrastructure and expertise.

The Linear Steering Alternative

The research extends established work on linear representation in language models. The Linear Representation Hypothesis suggests that "high-level concepts are represented linearly as directions in some representation space", and Hong et al.'s recent findings demonstrate that "the reasoning-memorization interplay in language models is mediated by a single direction".

My methodology builds directly on these foundations:

Extract reasoning features by computing the difference of means between model internals when fed curated datasets about memorisation vs reasoning tasks
Patch these vectors into runtime activations during inference
Measure performance gains against both the original instruction-tuned model and stronger reasoning-finetuned baselines

Experimental Evidence

I tested this approach across three model types:

Base models (Llama-3.2-1B with no instruction tuning)
Instruction-tuned models (Llama-3.2-1B-Instruct post-RLHF)
Chain-of-thought capable models (DeepSeek-R1-Distill-Qwen-1.5B)

Background: Linear Structure Validation

**Figure 1**: PCA visualization showing reasoning vs memorisation clustering across different training paradigms. Arrow-head: reasoning task cluster. Arrow-tail: memorisation tasks

PCA visualisations confirm the theoretical foundation - clear separation between "reasoning" and "memorisation" activations across all model types using the top two components. The linear separation is clearly identifiable, though some reasoning tasks appear in the memorisation cluster, likely due to data leakage causing the model to rely on memory rather than reasoning.

**Figure 2**: Detailed view showing clear decision boundary for instruction-tuned models

This validates that the linear structure we're exploiting isn't an artifact of specific training procedures - it appears to be a feature of how language models represent reasoning.

Central Finding: Significant Performance Gains

**Figure 3**: Reasoning accuracy improvement via zero-cost steering across model architectures and layers

Applying extracted reasoning vectors to models achieved:

Accuracy boost (steering at the most effective layer: 6) over the instruction-tuned version of the same model
Performance almost matching the stronger reasoning-finetuned R1 model (23%)
No training cost

The effectiveness varied by layer, with different patterns for base vs instruction-tuned models.

Secondary Finding: Model-Dependent Steering Patterns

**Figure 4:** Cosine similarity between reasoning vectors and layer activations reveals training-dependent patterns

Different model types show distinct "reasoning activation profiles" across layers. The cosine similarity analysis reveals how reasoning representations vary across models and training paradigms.

Practical implication: Steering strategies should be customised—The most efficient steering layer are different across models and change as post-training applied.

Cost-Benefit Analysis

Let me be explicit about the resource implications compared to standard post-training enhancement:

Table 1: Quantitative cost-benefit comparison showing method, cost, time, and performance metrics, for Llama-3.2-2B

Method	Resources Required	Time	Performance
Post-Training Enhancement	GPU clusters + human annotation + expertise	Weeks	3% gain in terms of accuracy
Linear Steering	Single inference run	Hours-Days	Near R1-level performance

This represents a dramatic resource reduction with no training cost whilst achieving performance that nearly matches much stronger reasoning-specialised models compared to traditional instruction fine-tuning approaches. This is particularly useful when you have an easy way of preparing model-specific reasoning vectors (mean of differences) and a set of curated datasets for producing the vectors.

There are overheads: generalissation tests to be run to test on the tasks close to your objective task (such as math), or stronger tests if you want a generally better reasoning model; Other experiments to gauge behavioural changes and alignment.

Critical Limitations and Confounders

I need to be explicit about what this approach cannot do and what remains unvalidated:

Methodological limitations:

A systemic examination of behaviour change in the steered model: A hypothesis to be tested—steering harms some capabilities of the LLM that is observable in its behaviours (such as following a chat conversation style), and probably related to its reasoning capabilities.
Model and task specificity: The vectors are ultimately specific to both the source model and the particular reasoning tasks used for extraction
Limited task generalisation: I only tested on MMLU-pro reasoning tasks - the vectors need validation across broader reasoning domains to establish generalisability
Missing baseline controls: I haven't performed essential baseline tests including random vector patching, isolated reasoning activation patching, and memorisation activation patching
Inclusion of steering coefficient into the parameter sweep: Other researches (MATS 8.0) said that a slightly larger coefficient in their setting steered the model towards producing <think> tokens only repeatedly, or skipping reasoning completely. Experiments on specific bounds for efficient coefficient size, or even showing that this varies across models/souce datasets so much that it's not estimatable would be useful.

Uncontrolled confounders:

Data contamination effects: Some reasoning tasks clustering with memorisation tasks suggest potential data leakage, but I haven't systematically ruled out other confounding factors
Task-specific overfitting: The performance gains might be overfitted to the specific reasoning/memorisation task pairs used for vector extraction

When this approach likely fails:

Cross-architecture transfer (different model families)
Complex multi-step reasoning requiring long chains of thought
Domain-specific reasoning far removed from the training tasks
Production systems requiring robust performance across diverse inputs

My confidence levels:

High: Linear directions exist and can be extracted for specific model-task combinations
Medium : This generalises within model families for similar reasoning tasks
Low: This approach will generalise broadly across reasoning domains or replace RLHF comprehensively

Broader Implications

If these findings replicate and scale beyond their current limitations, the implications extend beyond just cost savings:

Democratisation of AI capabilities: Smaller organisations and researchers could access reasoning-enhanced models without massive computational budgets.

Rapid experimentation: The ability to quickly test different reasoning enhancements could accelerate AI research significantly.

Model interpretability: Understanding reasoning as linear directions provides new insights into how language models actually work internally.

Alignment research: This could offer new approaches to controlling model behaviour without expensive retraining.

What Would Change My Mind

Several findings would significantly update my confidence in this approach:

Failure of baseline controls: If random vector patching produces similar performance gains, this would suggest the effects aren't specifically due to reasoning directions
Task generalisation failure: Poor performance on reasoning tasks beyond MMLU-pro would indicate severe overfitting limitations
Confounding factor identification: Discovery of systematic confounders that explain the performance gains without invoking reasoning transfer

The Research Gap That Needs Filling

This work sits at an interesting intersection but has significant holes that need addressing:

Immediate research priorities:

Comprehensive baseline studies - testing random vectors, isolated activation types, and control conditions
Task generalisation - validation across mathematical reasoning, logical inference, causal reasoning, and other domains
Confounder analysis - ruling out alternative explanations for the performance gains—Or finding potential confounders first
Cross-model transfer studies - extracting vector using model A, and test its steering effect in model B, which has a different size and/or architecture

Longer-term questions:

Can we identify "universal" reasoning directions that work across model families?
Does steering vectors compose for complex multi-skill tasks? How?
What are the fundamental limits of linear representation? And when the linear representations are for high-level cognitive capabilities?

Call to Action

This work demonstrates possibilities but requires significant validation before being used as an alternative to RLHF, in terms of reasoning task accuracy enhancement. The potential impact justifies immediate research investment, given the relatively low experimental costs.

For researchers: The methodology is straightforward to replicate and extend. Key priorities include running proper baseline controls, testing task generalisation, and systematic confounder analysis.

For practitioners: This approach shows promise for rapid prototyping and resource-constrained applications, but shouldn't yet be deployed in production systems without extensive validation.

The cost-benefit analysis is compelling enough that even modest success rates would make this approach valuable for many applications - but we need much stronger evidence before making broader claims about replacing traditional training paradigms.

Epistemic status: Cautiously optimistic with significant caveats. The initial results are intriguing and the theoretical foundations are solid, but critical limitations and uncontrolled factors mean this work is better viewed as preliminary evidence rather than a validated alternative to RLHF.

What experiments would you prioritise to validate or refute these findings? How would you design the missing baseline controls?