2204

LESSWRONG
LW

2203

jacob_drori's Shortform

by jacob_drori
1st Aug 2025
1 min read
6

7

This is a special post for quick takes by jacob_drori. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
jacob_drori's Shortform
28jacob_drori
27jacob_drori
6Gurkenglas
1jacob_drori
2cubefox
9jacob_drori
6 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:25 AM
[-]jacob_drori2mo285

Operationalizing the definition of a shard

Pope and Turner (2022) define a shard as follows:

A shard of value refers to the contextually activated computations which are downstream of similar historical reinforcement events.

To operationalize their definition, we must decide exactly what we mean by contextually activated computations, and by similar reinforcement events. One could philosophize here, but I'll just pick somewhat-arbitrary definitions and run with them, for the sake of quickly producing something I could turn into code.

Following Apollo/Goodfire, I will identify contextually activated computations with certain directions in parameter space. These directions might be found using APD, L3D, SPD, or some future method.

I am not sure when to consider two reinforcement events (i.e. RL reward-assignments) "similar". But if we replace "reinforcement events" with "RL updates", there is a natural definition:  cluster RL parameter updates, and call two updates similar if they are in the same cluster. A given update should be allowed to be in multiple clusters, and we suspect parameter updates enjoy some linear structure. Therefore, a natural clustering method is an SAE.[1] Then, a shard is a decoder vector of an SAE trained on RL parameter updates.

Annoyingly, parameter vectors much larger than the activation vectors that SAEs are usually trained on, posing three potential issues:

  1. Storing enough parameter updates to train our SAE on takes a ton of memory.
  2. The SAE itself has a ton of parameters - too many to effectively train.
  3. It may be impossible to reconstruct parameter updates with reasonable sparsity (with L0∼100, say, like standard SAEs).

To mitigate Issue 1, one could focus on a small subset of model parameters. One could also train the SAE at the same time as RL training itself, storing only a small buffer of the most recent parameter updates.

L3D cleverly deals with Issue 2 using low-rank parametrizations for the encoder and decoder of the SAE.

Issue 3 may turn out not to occur in practice. Mukherjee et al (2025) find that, when post-training an LLM with RL, parameter updates are very sparse in the neuron basis. Intuitively, many expect that such post-training merely boosts/suppresses a small number of pre-existing circuits. So perhaps we will find that, in practice, one can reconstruct parameter updates with reasonable sparsity.

All of this is to say: maybe someone should try training an SAE/applying L3D to RL parameter updates. And, if they feel like it, they could call the resulting feature vectors "shards".

  1. ^

    For each SAE latent, the corresponding cluster is the set of data points on which that latent is active.

  2. ^

    Shard theory is supposed to apply to humans, too, but here I focus on NNs since humans seem more complicated.

Reply2
[-]jacob_drori8d27-1

LLMs linearly represent the accuracy of their next-token predictions

A quick experiment I did on Llama-3.2-1B:

  • Choose a layer, and let →xt be that layer's residual activation at token position t.
  • Let l(correct)t be the logit at position t that was assigned to whatever turned out to be the correct next token (i.e. the correct token at position t+1).
  • Use least-squares to fit a linear map that takes →xt and predicts l(correct)t−1.

Results for each layer are below. Layers in the second half of the model (except the last layer) have decent (R2>0.6) linear representations of the correct logit.

 

 

I don't find this result very surprising. I checked it because I thought it might explain what's going on in this very cool recent paper by @Dmitrii Krasheninnikov , Richard Turner and @David Scott Krueger (formerly: capybaralet)). They finetune Llama-3.2-1B and show that the model linearly represents the order of appearance of finetuning data. If more recent data has lower loss, then perhaps my probing results explain theirs.

Reply
[-]Gurkenglas7d65

What don't LLMs linearly represent?

Reply
[-]jacob_drori7d10

I really thought I'd updated all the way towards "models linearly represent everything". But somehow I was still surprised by Dmitrii's training order result.

Reply
[-]cubefox3d20

Judging from the pictures, this could also be a quadratic fit.

Reply
[-]jacob_drori3mo90

Some issues with the ICL Superposition paper

Setup:

Xiong et al (2024) show that LLMs can in-context-learn several tasks at once. Consider e.g. the following prompt:

France -> F
Portugal -> Portuguese
Germany -> Berlin
Spain -> Madrid
Russia -> R
Poland -> Polish
Italy ->

A model will complete this prompt sometimes with Rome, sometimes with I, and sometimes with Italian, learning a "superposition" of the country -> capital, country -> first-letter and country -> language tasks. (I wish they hadn't used this word: the mech interp notion of superposition is unrelated).

Let Di be the proportion of the in-context examples that correspond to task i, and let Pi be the probability that the model completes the prompt according to task i. Call a model calibrated if Pi≈Di: the probability it assigns to a task is proportional to the number of times the task appeared in-context. A measure of the degree of calibration is given by the KL-divergence:

KL(D||P)=∑iDilog(Di/Pi).

Lower KL means better calibration. 

Issue 1:

The authors show that, for a certain set of 6 tasks, and with 60 in-context examples, calibration improves with model size:
 

[Fig. 6 from the paper. X-axis labels show model's parameter count. D1 is uniform distribution, D2 has probability 0.5 on the third task and 0.1 on other tasks, D3 is a distribution with probabilities alternating between 0.25 and 0.083.]

The best reported KL between the uniform distribution [0.167, 0.167, ..., 0.167] and the model's output distribution is around 1.5. To give a sense for what this means, here are four examples of distributions with such a KL:

P1=[0.004,0.004,0.004,0.33,0.33,0.33]

P2=[0.02,...,0.02,0.9]

P3=[10−5,0.19999,...,0.19999]

P4=[0.004,0.008,0.016,0.032,0.2,0.74].

Seeing these distributions convinced me that the models they studied are not well-calibrated. I felt a bit misled by the paper.

Issue 2:

Say we take turns with the model, repeatedly appending a new "question" (e.g. a country) to the prompt and allowing the model to generate an "answer" (e.g. the country's capital, language, or first letter) at default temperature 1.0. The authors state:

After the first token is generated, the model tends to converge on predicting tokens for a single task, effectively negating its ability for multi-task execution.

But this sort of mode-collapse behavior is not what we would observe from a well-calibrated model!

For simplicity, consider the case of only two tasks: A and B. Let nA and nB be the number of examples of each task, prior to a given generation step. A calibrated model generates A with probability nAnA+nB, in which case (nA,nB)←(nA+1,nB). Otherwise, it generates B, and (nA,nB)←(nA,nB+1). 

This is precisely the Pólya urn model. If the numbers of each task are initially A0 and B0, and then we generate for a long time, it can be shown that the limiting proportion of A tasks is a random variable X distributed as follows:

P[X=x]∝xA0−1(1−x)B0−1.

In the realistic setting where A0,B0>1, this distribution is peaked aroundA0−1A0+B0−2, i.e. roughly the proportion of the initial tasks that were A, and goes to zero at x=0 or 1. So calibrated models don't collapse; they tend to maintain roughly the initial ratio of tasks.

Reply
Moderation Log
More from jacob_drori
View more
Curated and popular this week
6Comments