Neural Steering: A New Interface for Controlling LLMs from the Inside

ElFLaco

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

A reflection inspired by Anthropic’s paper "Signs of introspection in LLMs"

Anthropic’s recent work on “introspection” in large language models presents a result that, in my view, deserves a broader conceptual framing.
It is interesting that a model can describe an internal state. What is truly surprising, however, is how it is able to do so:

by injecting vectors directly into internal layers;
by observing that the model can recognize and articulate the manipulation.

This, I believe, points toward a paradigm that I propose to call Neural Steering.

1. Beyond Prompt Engineering

For years, interaction with LLMs has been mediated exclusively through natural language.
The prompt was the interface; internal activations were treated as opaque. But emerging techniques challenge this view:

activation additions
classifier-free guidance
constitutional modulation
sparse-autoencoder-driven feature steering
FGAA (Feature Guided Activation Additions)
activation scaling

All of them point to the same underlying fact:
the internal trajectory of an LLM is manipulable, and such manipulation alters its behavior in a structured way.

2. The Anthropic Paper as a Turning Point

Anthropic adds something new to the existing activation-steering literature:

injected vectors are interpretable
effects are measurable across layers
the model recognizes the intervention
it can distinguish genuine vs. artificial activations
the procedure is reproducible and, at least in principle, scalable

Anthropic does not introduce the core idea, but it standardizes it and makes it:

measurable
replicable
introspectively accessible
protocol-driven

This is a more solid basis for discussing internal control of LLMs.

3. Proposal: Define the Paradigm as Neural Steering

I propose Neural Steering to describe:

The ability to direct a model by intervening in the geometry of its activation space, rather than through natural language.

The objective is not merely to change what a model outputs, but to influence how it thinks before any token is generated.

4. Implications for Interpretability, Safety, and Agency

If a model can:

recognize internal manipulations
describe them
follow them
adapt its cognitive trajectory accordingly

then we are entering a new regime of control (and risk), one in which:

alignment may occur without prompts
steering vectors could become internal APIs
latent “mental states” can be amplified or suppressed
runtime cognitive editing becomes plausible

This raises crucial questions:

Who will have access to this level of intervention?
What kind of power does controlling internal directions confer?
How do we prevent misuse (e.g., injecting “aggressiveness”, “obedience”, “loyalty” vectors)?
What new forms of deception might emerge when a model knows it is being steered?

5. Open Questions for Discussion

I’m particularly interested in feedback on:

Does the distinction between generated behavior and steered internal state matter for safety?
How does geometric steering relate to the emergence of undesirable goals?
Can we build a formal "language of activation directions"? (Something akin to a feature algebra for cognitive control.)

6. Reference

Anthropic – Signs of introspection in large language models

This post expands on an initial intuition I shared in a shorter public reflection – Beyond the Prompt: Neural Steering

LESSWRONG
LW