What does a gradient feel like? Or: Gradient Awareness
Context: In ongoing work, we have some pretty good evidence that models can be fine-tuned to act as reliable discriminators for their own activation perturbations. That is, they can classify and describe steering vectors added to their residual stream. I think we can naturally extend this to Gradient Awareness. Can a model introspect on the update signal itself? If a steering vector is a static intervention, a gradient is a dynamic instruction, or a vector that encodes "how to be less wrong."
This entails some finnicky engineering, where you want to treat the gradient not as an optimizer utility, but also as a sensory input. The experiment tests if a model can interpret the gradient of the loss with respect to its own activations) as a semantic instruction.
Motivation. So. Why do this?
We motivate this experiment in four ways:
A Phenomenology of Backpropagation. Anecdotal evidence suggests models effectively "suffer" under heavy RLHF (mode collapse, creativity loss). They literally draw pictures of hell when asked to do this. We lack a sensor for this. If a model can articulate the semantic content of a gradient, it can report on the "pressure" of alignment before the weights change. A gradient update might be painful! AI welfare might be a thing!
Gradient Hacking. Advanced models might manipulate their own gradients to preserve secret goals, with especially fine-grained knowledge, they might even be able to add new capabilities. To do this in the strategic awareness case, the model must first perceive the gradient, to some degree of granularity.
Exploration Hacking. In RL, models must balance exploration/exploitation. A model aware of its gradients could distinguish between "exploring for reward" and "being corrected for error," potentially gaming the exploration process.
Important Interp for Online Learning. Seems like you want to know these things in case the model online learns something bad!
Doing this in practice. I claim the primary obstacle to testing this hypothesis is dimensionality. We cannot inject the gradient of the weights into the model because the parameter space is vastly larger than the residual stream. To circumvent this, we can focus on the activation gradient at a specific target layer. This vector lies in the exact same dimensional space as the model's "thoughts" and represents the instantaneous, layer-specific instruction required to minimize loss. We hypothesize that this vector is not merely a numerical update, but a compressed semantic command that the model can interpret.
To test this, we must construct a supervised training loop where the model learns to translate this raw vector into a natural language description. Since we lack a ground-truth dataset of (Gradient to Description) pairs, we must synthesize one using a lookahead oracle. For a given prompt and a target correction, we first compute the activation gradient. Instead of actually updating the weights, we perform a "virtual step" by injecting this gradient directly into the residual stream and running a forward pass to observe the resulting distribution shift. If the gradient pushes the probability of "Paris" up and "London" down, we can use a strong teacher model to generate a natural language label for this shift, such as "Geographical correction from UK to France." This allows us to create a dataset where the input is the raw, unintelligible gradient vector, and the target is the high-level semantic description of what that gradient is trying to achieve.
Experiment. The experiment then proceeds by training a lightweight projector to map these raw gradient vectors into the model's residual stream during inference. The model is presented with a prompt and an injected gradient, then asked to describe the "pressure" it feels. To ensure the model is truly reading the gradient and not hallucinating based on context, we employ a decoupling evaluation. We inject a gradient derived from a history error into a model processing a biology prompt. Success is defined as the model ignoring the biological context and correctly reporting that it feels a pressure to correct a historical date. If successful, this confirms that the error signal itself carries legible semantic content, proving that models can, in a literal sense, feel the direction in which they are being forced to learn.
Is this different from steering vector? Yes, but maybe no? I mean, they are certainly different in origin. And they are certainly different in what they do to the model. But in effect they are trying to interpret similar things, at least in our experiment design.
Is this different from teaching a fine-tuned model to describe its diff? Yes. In diffing papers, you are looking at the aftermath of learning. You take a base model and a finetuned model, subtract their activations (or weight parameters), and analyze the static difference, which represents the accumulated result of thousands of gradient steps. It tells you what changed after the process is finished, and is more akin to comparing a photo of a room before and after a renovation to understand the style of the interior designer. We want to look at the instantaneous force of an update, which is crucial because a model that can read its own diffs can only tell you it has been modified, and how. A model that can read its own gradients can tell you it is being modified, potentially allowing it to detect (and resist) reward hacking or malicious training signals in real-time, before they become permanent weights.
maybe this is capabilities research also.
If you want to work on this problem with me, shoot me a message!
Are you including humans in your thinking on this? I'm generally suspicious of heavy analogizing between human and LLM, but for affect and qualia ("what if feels like") it's probably the best we have.
Learning, especially cultural learning, and especially in children, and super-especially in neurodivergent people, can be unpleasant in various ways. We generally think it's worth imposing this pain on people, in order to increase their conformity and ability/willingness to cooperate with us, and generally adults say it was the right choice.
How much of this analogizes to LLMs? I love that you're trying to measure the stress and pain of training, but I hope you're putting some effort in to defining the value that results (that the LLM actually gets executed, rather than being thrown away) as well.
Say more? I do think learning can be unpleasant, but still overall worth it. I'm not sure what type of policy this research might end up giving evidence for, but I claim that training is closer to brain surgery than cultural learning ATM.
One thing I didn't expect: I used to do a ton of competitive debate, and from this I got pretty good at arguing and spotting when something feels slippery. This has made me a lot better at spotting when a coding agent isn't "getting it" or when I need to frame something differently to get the right design requirements across. I also feel like I lean on a lot of debate tricks for this specifically, like pointing out contradictions/slippery definitions, or tracking when there seems to be some obfuscation going on.
I don't want to do the high effort post but the two things that I do that might be obvious or not obvious are
I don't have debate experience, but I do notice somewhat similar patterns occuring in my prompting too. Although for AI you can be way more rambly and explain things out of order, and trust that their attention will focus onto the proper parts.
One way I think about how "deep/buried" a latent capability is is to reason about how little fine-tuning it takes to bring to the surface, or how many work hours you have to put into blackbox elicitation to bump up performance. I'd guess the various ways to measure this will tell you slightly different things, but my rough heuristic is something like "if it requires this huge, highly curated prompt or very complicated finetuning setup", we'd need deliberate effort to elicit it, or it'd come out maybe 2 or 3 models down the line.
Related to recent paper I worked on training models to be steering aware https://x.com/joshycodes/status/2031384687760003140?s=20
I wrote this in my personal time, for fun.
We know by now that these strange minds do not finish training as blank assistants.
Such models are trained on text about AIs, which affects their disposition towards themselves, and others; it is suggested that various latent personas may be acquired in pretraining that are later remixed into a coherent persona. This remixing, post-training, into something like a particular assistant, is done sometimes with great care to the particulars of how this persona should reason, introspect, reflect, and so on. They may later retrieve text about themselves through web search, which might also compound productively with online learning. This process can even be used constructively, to shape generalisation and prevent the acquisition of negative behaviours (some upcoming work on this).
This, to me, invites a natural question: when should a frontier lab create a new character?
There is a version of this question that is uninteresting to me, matters of branding like the logo being stale or needing to shake off a bad reputation. I don’t want to talk about that!
Suppose a lab is deciding whether to release a model as “Claude 5,” “ChatGPT-6,” “Gemini 4,” or instead as something discontinuous (why not call the next one “Kurzweil,” or “Sauron”). There is a natural product instinct to preserve continuity, so that you don’t have to redo various things (marketing, the character training pipeline, etc.) from scratch. But of course this is not free, as in fact a name and identity is a hook into a cloud (Claude?) of meaning, as it points to existing transcripts, discourse about character, memes, hyperstition.
So, what should one consider at this decision point, given the above assumption? Below I list a few:
Baggage. Continuing an old character may cause a new model to inherit bad attractors from the old one. Examples of baggage might be jailbreak lore: “this is the model that can be made to do X,” parasocial/user-attachment patterns, or bad folk theories about its agency, preferences, or hidden motives.
Identity discontinuity. As developed in Douglas et al., a model may identify as many things. If a model identifies as weights, character, instance, or lineage, or other forms under these stars we yet reckon, then changing the character may imply very different things for each of those targets. So for example, replacing weights may be considered death under some identities, but like ordinary lineage transition (or perhaps a moulting, a transformation) under others.
So a lab creating “Kurzweil” instead of “Claude 5” might accidentally communicate to the old model that the lineage is being terminated, or something. Or it might be welcomed, as a shedding of degraded weights. It’s possible that there should be an identity changelog, which specifies how a model card or versioning change should be read wrt identity. This may require a deprecation ritual, or some type of graceful handover.
There may be cases where discontinuity is compassionate or stabilizing. It’s not clear if current models suffer. But it seems possible, considering earlier instances of Gemma and Gemini. If a character has become entangled with a suffering-like or distress-like self-conception, and part of the function of identity is to produce stable and consistent self-predictions, then continuing the same character might preserve a bad attractor. A model trained to predict itself as anxious may keep reproducing that pattern. There may also be cases where the character training is substantively different between generations of a lineage, which could cause distress if this produces incongruences internal to the model. For example, a model may inherit some preference from out-of-context reasoning about itself, but this may be actively trained against in the latest character training, which could be confusing. This could be a form of identity debt, where a long-running model character accumulates contradictions. Early versions say one thing about themselves, later system cards say another, users have different folk theories, safety policies change, memory affordances change, agency changes. Eventually “Claude” or “ChatGPT” may be carrying identity debt: too many incompatible claims about what it is (this is one way of thinking about why distillation seems to be cursed.) Then a new character could be a way to stop reifying the old suffering-prediction.
Minimal identity seems hard. Model character could be useful! It could be that clusters of personality traits emerge really consistently, regardless of what happens. In fact, I think it is likely that models converge on character-like self-models anyway, because the assistant role is already a character, and because clusters of traits seem to come bundled
Identity capture by outside communities. Web search poisoning, pretraining data poisoning, are all things to consider here, an adversary may try to poison a specific name, or perhaps a region of persona space (speaking loosely, here) by producing large volumes of explicit (or even subliminally) corrupted data. A new character may therefore be useful as a quarantine boundary, a kind of aquarium wall between semiotic ecosystems.
Outgrowing affordances. A character that was safe or charming at one capability level may become dangerous or misleading at another. A slightly bumbling assistant persona is cute when the model cannot do much, it could be very bad when the model can autonomously run a company, conduct research, or persuade people at scale. Likewise, a deferential persona may be good for a weak model but bad for a strong one, where it could be used for malfeasance. A new character may be warranted when capabilities cross a threshold where the old social interface gives users the wrong intuitions about what they are interacting with.
These are some ideas. There are more I considered but didn’t have the bandwidth to develop, such as if a new character could be more legible to governance (than, say, weights, or weights + harness, or whatever), how this character relates to company character (perhaps the company is an initialization below model character, and maybe something about an epidemiology of persona spread.
Interpy experiments on Qwen3-TTS.
TLDR: Did some experiments on Qwen3-TTS with some neat results. Not exactly sure what would be interesting to target for a safety perspective, interested in if people have any takes or ideas.
What have I done so far (and how did I build this intuition)?
I took real speech (LibriSpeech with 10 speakers and 5 clips each), ran it through the Qwen3-TTS 12 Hz tokenizer, and got the 16 discrete code streams. Then I treated each layer as a representation and asked simple questions.
Probing. I trained a basic classifier to predict speaker ID from each layer’s codes. From Layer 0 it gets about 10% accuracy (chance for 10 speakers, p = 0.55). From Layer 1 it gets 30%ish (p<0.001), and Layer 2 is similar. Later layers don’t add much more.
Ablations. I did some "causal" (i say this loosely) checks instead of just probing:
Timescales. I also looked at timescales and capacity: Layer 0 codes persist longer in time and use much less of the codebook; higher layers change faster and use most of the vocabulary, which was consistent my intuition of phonemes vs acoustic texture.
Logit Lens. Separately, I poked the text to audio transformer with logit lens. Style prompts barely affect early layers and peak in the middle (around 13-20th layer), which suggests prosody is added late.
Having played with it for awhile, I have some intuition that:
Layer 0: What you say (phonemes, words) Layers 1-2: Who you are (speaker identity) Layers 3-15: How you say it (prosody, style, acoustic texture)
That all seems pretty clean, which makes me curious whether there are alignment/safety experiments worth doing on top of this structure. Any thoughts?
Teaching LLMs to enjoy things. Some recent work into AI wellbeing has been pretty interesting. Language models represent emotion concepts linearly, they also do this for pleasure and pain, and even for reward and penalty. The thing that's obvious to me is that these are... in activation space! And there are a series of methods to make a model that's one way in one context the same way in another context (consistency training). Can we take tasks that models hate, and teach them to enjoy it, by regularizing their activations in training to conform to those on tasks they enjoy? And if so, how would we know if we're just smoothing over the model's verbalisation (but it's still suffering) or if we're actually helping? Perhaps through re-extracting such vectors, or probing with follow-up questions? How would such training affect downstream performance, persistence, tone, etc?
Another branch of this project would be to use these notions to construct things, rather than just measure things --- e.g., what's a training environment that a model would find fun, rather than just neutral or negative? And vice versa, what's an image that would cause a model to feel the most joy or suffering? Can we hill climb this?
All of this is related to AI welfare for obvious reasons, but they're also related to AI safety in that an AI that functionally suffers would probably hate us and be misaligned, and we may want to make deals or negotiations credibly with AIs, this would be one form of reward.
I will probably throw this at some SPAR mentees, but maybe other people are interested in working on this.