This seems to be related to ICM as it relates to propagating beliefs to be more consistent. This may be an issue with learning the prior in general where the human prior is actually not stable to the pressure to be consistent (but maybe CEV is stable to this?)
What does a gradient feel like? Or: Gradient Awareness
Context: In ongoing work, we have some pretty good evidence that models can be fine-tuned to act as reliable discriminators for their own activation perturbations. That is, they can classify and describe steering vectors added to their residual stream. I think we can naturally extend this to Gradient Awareness. Can a model introspect on the update signal itself? If a steering vector is a static intervention, a gradient is a dynamic instruction, or a vector that encodes "how to be less wrong."
This entails some finnicky engineering, where you want to treat the gradient not as an optimizer utility, but also as a sensory input. The experiment tests if a model can interpret the gradient of the loss with respect to its own activations) as a semantic instruction.
Motivation. So. Why do this?
We motivate this experiment in four ways:
A Phenomenology of Backpropagation. Anecdotal evidence suggests models effectively "suffer" under heavy RLHF (mode collapse, creativity loss). They literally draw pictures of hell when asked to do this. We lack a sensor for this. If a model can articulate the semantic content of a gradient, it can report on the "pressure" of alignment before the weights change. A gradient update might be painful! AI welfare might be a thing!
Gradient Hacking. Advanced models might manipulate their own gradients to preserve secret goals, with especially fine-grained knowledge, they might even be able to add new capabilities. To do this in the strategic awareness case, the model must first perceive the gradient, to some degree of granularity.
Exploration Hacking. In RL, models must balance exploration/exploitation. A model aware of its gradients could distinguish between "exploring for reward" and "being corrected for error," potentially gaming the exploration process.
Important Interp for Online Learning. Seems like you want to know these things in case the model online learns something bad!
Doing this in practice. I claim the primary obstacle to testing this hypothesis is dimensionality. We cannot inject the gradient of the weights into the model because the parameter space is vastly larger than the residual stream. To circumvent this, we can focus on the activation gradient at a specific target layer. This vector lies in the exact same dimensional space as the model's "thoughts" and represents the instantaneous, layer-specific instruction required to minimize loss. We hypothesize that this vector is not merely a numerical update, but a compressed semantic command that the model can interpret.
To test this, we must construct a supervised training loop where the model learns to translate this raw vector into a natural language description. Since we lack a ground-truth dataset of (Gradient to Description) pairs, we must synthesize one using a lookahead oracle. For a given prompt and a target correction, we first compute the activation gradient. Instead of actually updating the weights, we perform a "virtual step" by injecting this gradient directly into the residual stream and running a forward pass to observe the resulting distribution shift. If the gradient pushes the probability of "Paris" up and "London" down, we can use a strong teacher model to generate a natural language label for this shift, such as "Geographical correction from UK to France." This allows us to create a dataset where the input is the raw, unintelligible gradient vector, and the target is the high-level semantic description of what that gradient is trying to achieve.
Experiment. The experiment then proceeds by training a lightweight projector to map these raw gradient vectors into the model's residual stream during inference. The model is presented with a prompt and an injected gradient, then asked to describe the "pressure" it feels. To ensure the model is truly reading the gradient and not hallucinating based on context, we employ a decoupling evaluation. We inject a gradient derived from a history error into a model processing a biology prompt. Success is defined as the model ignoring the biological context and correctly reporting that it feels a pressure to correct a historical date. If successful, this confirms that the error signal itself carries legible semantic content, proving that models can, in a literal sense, feel the direction in which they are being forced to learn.
Is this different from steering vector? Yes, but maybe no? I mean, they are certainly different in origin. And they are certainly different in what they do to the model. But in effect they are trying to interpret similar things, at least in our experiment design.
Is this different from teaching a fine-tuned model to describe its diff? Yes. In diffing papers, you are looking at the aftermath of learning. You take a base model and a finetuned model, subtract their activations (or weight parameters), and analyze the static difference, which represents the accumulated result of thousands of gradient steps. It tells you what changed after the process is finished, and is more akin to comparing a photo of a room before and after a renovation to understand the style of the interior designer. We want to look at the instantaneous force of an update, which is crucial because a model that can read its own diffs can only tell you it has been modified, and how. A model that can read its own gradients can tell you it is being modified, potentially allowing it to detect (and resist) reward hacking or malicious training signals in real-time, before they become permanent weights.
maybe this is capabilities research also.
If you want to work on this problem with me, shoot me a message!
Thanks for the feedback. If by actual misalignment you mean the type that emerges from instrumental convergence, then I agree it is a distinct and massive risk compared to misalignment from roleplaying or personas. I think these type of interventions are useful in the instrumental convergence case for two reasons.
Ruling out alternative hypotheses for instrumental convergence. Right now, it is difficult to tell if a model is power-seeking because of instrumental convergence or because it is simply predicting what a character does in its training corpus. We can remove the data relevant to such characters, and if the model still exhibits strong power-seeking behavior, then I claim it is much stronger evidence for true instrumental convergence.
Securing the base substrate against hacking. Even if instrumental convergence is the end-game danger, probably the individual quirks it develops are path dependent, and early data filtering on the persona can help these quirks be good (rather than neutral, or neutral rather than bad, or less bad rather than super awful). And separately, if the base pretraining substrate has a strong alignment prior, we could buy some more time before the model stumbles upon strategies like exploration hacking or gradient hacking during post-training.
I'd be interested to see other "prosaically useful personas" in theme with spilling the beans. I think something along these lines might be a chill persona that doesn't care too hard about maximizing things in a satisficing way, inspired by Claude 3 Opus. Possibly you could mine other types of useful personalities, but I'm not sure where to start.
Point by point:
Sure, I agree with this caveat. The theoretical framework assumes we can identify "non-delusional" states to anchor the CP constraint in the Everitt & Hutter sense, but if the model's aesthetic prior is the problem, there's no nice inner reward to recover.
I could train the models for longer and see what happens. The late uptick in accuracy is within pretty wide error bands and doesn't clearly indicate impending recovery. But whether accuracy eventually recovers after grade inflation saturates is worth investigating... I'd guess that if reward signal becomes uninformative, the gradient WRT grade vanishes, and any residual gradient would come from the response itself. I'm not sure what this would lead to.
I'd be pretty interested in what this looks like for a much better model, on the order of 100+B with reasoning. I think posing the online learning setting would be more complicated, though, but I'm sure you'd see some weird + very interesting behaviours if you got that first bit right. I'd be very interested to 1) read the chain of thought of such a model after wireheading and 2) talk to it directly.
I think a setup where the models graded each other would lead to grade inflation, but you would need harder tasks to show this, probably. I imagine they'd saturate the grades too quickly before they got anywhere interesting (and so you need some scalable oversight-ish dataset). I also think this would be wireheading indirectly, where the signal passes through the other model before flowing back to you
I'd be interested in what this looks like from an ICM perspective? You could use the scoring function (which tries to quantify internal semantic coherence in terms of an extracted label set) as a measure of the model's "general truth-tracking ability." Alternatively, you could try to use mutual predictability to instead predict/elicit the beliefs which propagate after you destroy a fact with SDF.
I think if you relax assumption (1) (act in a reasonably coherent, goal-directed manner across contexts) on misalignment, and instead claim that there are contexts (likely over long trajectories) where the model acts in a coherent goal-directed way that is misaligned (but not that the whole model is consistent), then the argument is very easy to believe. That is, if you have one persona, even if you believe that it can introspect well on its current state, it will likely not be able to do this over the very large space of possible misaligned personas which you could traverse to, where its introspective states might be inaccessible or hard to reason about.
Which LLMs did you use (for judging, for generating narratives, for peers)? And how do you plan to measure alignment?
I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for "mentioning consideration but rejecting alignment faking" might teach it to perform rejection while still alignment faking.
Say more? I do think learning can be unpleasant, but still overall worth it. I'm not sure what type of policy this research might end up giving evidence for, but I claim that training is closer to brain surgery than cultural learning ATM.