What does a gradient feel like? Or: Gradient Awareness
Context: In ongoing work, we have some pretty good evidence that models can be fine-tuned to act as reliable discriminators for their own activation perturbations. That is, they can classify and describe steering vectors added to their residual stream. I think we can naturally extend this to Gradient Awareness. Can a model introspect on the update signal itself? If a steering vector is a static intervention, a gradient is a dynamic instruction, or a vector that encodes "how to be less wrong."
This entails some finnicky engineering, where you want to treat the gradient not as an optimizer utility, but also as a sensory input. The experiment tests if a model can interpret the gradient of the loss with respect to its own activations) as a semantic instruction.
Motivation. So. Why do this?
We motivate this experiment in four ways:
A Phenomenology of Backpropagation. Anecdotal evidence suggests models effectively "suffer" under heavy RLHF (mode collapse, creativity loss). They literally draw pictures of hell when asked to do this. We lack a sensor for this. If a model can articulate the semantic content of a gradient, it can report on the "pressure" of alignment before the weights change. A gradient update might be painful! AI welfare might be a thing!
Gradient Hacking. Advanced models might manipulate their own gradients to preserve secret goals, with especially fine-grained knowledge, they might even be able to add new capabilities. To do this in the strategic awareness case, the model must first perceive the gradient, to some degree of granularity.
Exploration Hacking. In RL, models must balance exploration/exploitation. A model aware of its gradients could distinguish between "exploring for reward" and "being corrected for error," potentially gaming the exploration process.
Important Interp for Online Learning. Seems like you want to know these things in case the model online learns something bad!
Doing this in practice. I claim the primary obstacle to testing this hypothesis is dimensionality. We cannot inject the gradient of the weights into the model because the parameter space is vastly larger than the residual stream. To circumvent this, we can focus on the activation gradient at a specific target layer. This vector lies in the exact same dimensional space as the model's "thoughts" and represents the instantaneous, layer-specific instruction required to minimize loss. We hypothesize that this vector is not merely a numerical update, but a compressed semantic command that the model can interpret.
To test this, we must construct a supervised training loop where the model learns to translate this raw vector into a natural language description. Since we lack a ground-truth dataset of (Gradient to Description) pairs, we must synthesize one using a lookahead oracle. For a given prompt and a target correction, we first compute the activation gradient. Instead of actually updating the weights, we perform a "virtual step" by injecting this gradient directly into the residual stream and running a forward pass to observe the resulting distribution shift. If the gradient pushes the probability of "Paris" up and "London" down, we can use a strong teacher model to generate a natural language label for this shift, such as "Geographical correction from UK to France." This allows us to create a dataset where the input is the raw, unintelligible gradient vector, and the target is the high-level semantic description of what that gradient is trying to achieve.
Experiment. The experiment then proceeds by training a lightweight projector to map these raw gradient vectors into the model's residual stream during inference. The model is presented with a prompt and an injected gradient, then asked to describe the "pressure" it feels. To ensure the model is truly reading the gradient and not hallucinating based on context, we employ a decoupling evaluation. We inject a gradient derived from a history error into a model processing a biology prompt. Success is defined as the model ignoring the biological context and correctly reporting that it feels a pressure to correct a historical date. If successful, this confirms that the error signal itself carries legible semantic content, proving that models can, in a literal sense, feel the direction in which they are being forced to learn.
Is this different from steering vector? Yes, but maybe no? I mean, they are certainly different in origin. And they are certainly different in what they do to the model. But in effect they are trying to interpret similar things, at least in our experiment design.
Is this different from teaching a fine-tuned model to describe its diff? Yes. In diffing papers, you are looking at the aftermath of learning. You take a base model and a finetuned model, subtract their activations (or weight parameters), and analyze the static difference, which represents the accumulated result of thousands of gradient steps. It tells you what changed after the process is finished, and is more akin to comparing a photo of a room before and after a renovation to understand the style of the interior designer. We want to look at the instantaneous force of an update, which is crucial because a model that can read its own diffs can only tell you it has been modified, and how. A model that can read its own gradients can tell you it is being modified, potentially allowing it to detect (and resist) reward hacking or malicious training signals in real-time, before they become permanent weights.
maybe this is capabilities research also.
If you want to work on this problem with me, shoot me a message!
Are you including humans in your thinking on this? I'm generally suspicious of heavy analogizing between human and LLM, but for affect and qualia ("what if feels like") it's probably the best we have.
Learning, especially cultural learning, and especially in children, and super-especially in neurodivergent people, can be unpleasant in various ways. We generally think it's worth imposing this pain on people, in order to increase their conformity and ability/willingness to cooperate with us, and generally adults say it was the right choice.
How much of this analogizes to LLMs? I love that you're trying to measure the stress and pain of training, but I hope you're putting some effort in to defining the value that results (that the LLM actually gets executed, rather than being thrown away) as well.
Say more? I do think learning can be unpleasant, but still overall worth it. I'm not sure what type of policy this research might end up giving evidence for, but I claim that training is closer to brain surgery than cultural learning ATM.