This seems especially easy to do with RWKV. Perhaps I will try that out sometime this week, but probably not. I did something like that (but not specifically for reasoning) two years ago for MIT's Splash.
My code from Splash is on Github if anyone else wants to give it a try before I (maybe) get to it. The code is very bad, though. It's just something I wrote for myself for the most part (so it doesn't have any kind of documentation, and it has lots of dead code lying around). So, it might be easier to just write your own code.
This project is the outcome of the in-person week at Finnish Alignment Engineering Bootcamp 2025.
TL;DR: Reasoning can be a linear direction in language model activations, if framed correctly, for example, placed in the memorisation-reasoning duality (Hong et al., 2025). This post presents intial results of steering language models at inference time. This could democratise access to reasoning-enhanced AI by without necessarily needing expensive RLHF training in terms of computation cost and time.
Here's my central crux: this steering method actually works and enhances base models beyond their instruction-finetuned counterparts. By extracting reasoning directions from existing models and patching them into runtime activations, I achieved accuracy boosts over the instruction-tuned version of the same model, with performance nearly matching much stronger reasoning-finetuned models like DeepSeek R1.
My extension of this work proposes a radically different approach: if we can extract these reasoning directions from existing models and patch them into runtime activations, we might achieve comparable reasoning performance to expensive RLHF training at zero additional cost.
Think of it like discovering that "being good at maths" corresponds to a specific direction in the model's internal representation. Once we know this direction, we can nudge any model toward it during inference, essentially giving it better maths skills for free.
This capstone project a replication work done by me.
Current reasoning enhancement relies on expensive post-training procedures like instruction fine-tuning and RLHF. These processes involve:
The research extends established work on linear representation in language models. The Linear Representation Hypothesis suggests that "high-level concepts are represented linearly as directions in some representation space", and Hong et al.'s recent findings demonstrate that "the reasoning-memorization interplay in language models is mediated by a single direction".
My methodology builds directly on these foundations:
I tested this approach across three model types:
Background: Linear Structure Validation
PCA visualisations confirm the theoretical foundation - clear separation between "reasoning" and "memorisation" activations across all model types using the top two components. The linear separation is clearly identifiable, though some reasoning tasks appear in the memorisation cluster, likely due to data leakage causing the model to rely on memory rather than reasoning.
This validates that the linear structure we're exploiting isn't an artifact of specific training procedures - it appears to be a feature of how language models represent reasoning.
Applying extracted reasoning vectors to models achieved:
The effectiveness varied by layer, with different patterns for base vs instruction-tuned models.
Different model types show distinct "reasoning activation profiles" across layers. The cosine similarity analysis reveals how reasoning representations vary across models and training paradigms.
Practical implication: Steering strategies should be customised—The most efficient steering layer are different across models and change as post-training applied.
Let me be explicit about the resource implications compared to standard post-training enhancement:
Table 1: Quantitative cost-benefit comparison showing method, cost, time, and performance metrics, for Llama-3.2-2B
Method | Resources Required | Time | Performance |
---|---|---|---|
Post-Training Enhancement | GPU clusters + human annotation + expertise | Weeks | 3% gain in terms of accuracy |
Linear Steering | Single inference run | Hours-Days | Near R1-level performance |
This represents a dramatic resource reduction with no training cost whilst achieving performance that nearly matches much stronger reasoning-specialised models compared to traditional instruction fine-tuning approaches. This is particularly useful when you have an easy way of preparing model-specific reasoning vectors (mean of differences) and a set of curated datasets for producing the vectors.
There are overheads: generalissation tests to be run to test on the tasks close to your objective task (such as math), or stronger tests if you want a generally better reasoning model; Other experiments to gauge behavioural changes and alignment.
I need to be explicit about what this approach cannot do and what remains unvalidated:
Methodological limitations:
Uncontrolled confounders:
When this approach likely fails:
My confidence levels:
If these findings replicate and scale beyond their current limitations, the implications extend beyond just cost savings:
Democratisation of AI capabilities: Smaller organisations and researchers could access reasoning-enhanced models without massive computational budgets.
Rapid experimentation: The ability to quickly test different reasoning enhancements could accelerate AI research significantly.
Model interpretability: Understanding reasoning as linear directions provides new insights into how language models actually work internally.
Alignment research: This could offer new approaches to controlling model behaviour without expensive retraining.
Several findings would significantly update my confidence in this approach:
This work sits at an interesting intersection but has significant holes that need addressing:
Immediate research priorities:
Longer-term questions:
This work demonstrates possibilities but requires significant validation before being used as an alternative to RLHF, in terms of reasoning task accuracy enhancement. The potential impact justifies immediate research investment, given the relatively low experimental costs.
For researchers: The methodology is straightforward to replicate and extend. Key priorities include running proper baseline controls, testing task generalisation, and systematic confounder analysis.
For practitioners: This approach shows promise for rapid prototyping and resource-constrained applications, but shouldn't yet be deployed in production systems without extensive validation.
The cost-benefit analysis is compelling enough that even modest success rates would make this approach valuable for many applications - but we need much stronger evidence before making broader claims about replacing traditional training paradigms.
Epistemic status: Cautiously optimistic with significant caveats. The initial results are intriguing and the theoretical foundations are solid, but critical limitations and uncontrolled factors mean this work is better viewed as preliminary evidence rather than a validated alternative to RLHF.
What experiments would you prioritise to validate or refute these findings? How would you design the missing baseline controls?