In case anyone else didn't know what it meant for a set of binary strings to be "prefix-free", here's Claude's explanation, which I found helpful:
A set of binary strings is prefix-free if no string in the set is a prefix of any other string in the set.
Example:
- ✅ Prefix-free: {0, 10, 110, 111}
- ❌ Not prefix-free: {0, 01, 10} (because "0" is a prefix of "01")
Why does this matter for Turing machines?
The key is in how universal Turing machines work. A universal machine U simulates any other machine T by receiving input of the form i′q, where:
- i′ = prefix-free encoding of machine T's description
- q = the actual input to feed to machine T
U must parse this concatenated input to figure out: "Where does the machine description end and the actual input begin?"
Without prefix-free encoding: Given input "00101", U can't tell if the machine description is "0", "00", "001", etc. - there's ambiguity.
With prefix-free encoding: Once U reads a complete machine description, it knows exactly where that description ends and the input begins. No ambiguity, no delimiters needed.
This unique parseability is essential for universal machines to correctly simulate other machines, and it's crucial for Kolmogorov complexity theory where we need to measure program lengths precisely without parsing ambiguity.
Pope and Turner (2022) define a shard as follows:
A shard of value refers to the contextually activated computations which are downstream of similar historical reinforcement events.
To operationalize their definition, we must decide exactly what we mean by contextually activated computations, and by similar reinforcement events. One could philosophize here, but I'll just pick somewhat-arbitrary definitions and run with them, for the sake of quickly producing something I could turn into code.
Following Apollo/Goodfire, I will identify contextually activated computations with certain directions in parameter space. These directions might be found using APD, L3D, SPD, or some future method.
I am not sure when to consider two reinforcement events (i.e. RL reward-assignments) "similar". But if we replace "reinforcement events" with "RL updates", there is a natural definition: cluster RL parameter updates, and call two updates similar if they are in the same cluster. A given update should be allowed to be in multiple clusters, and we suspect parameter updates enjoy some linear structure. Therefore, a natural clustering method is an SAE.[1] Then, a shard is a decoder vector of an SAE trained on RL parameter updates.
Annoyingly, parameter vectors much larger than the activation vectors that SAEs are usually trained on, posing three potential issues:
To mitigate Issue 1, one could focus on a small subset of model parameters. One could also train the SAE at the same time as RL training itself, storing only a small buffer of the most recent parameter updates.
L3D cleverly deals with Issue 2 using low-rank parametrizations for the encoder and decoder of the SAE.
Issue 3 may turn out not to occur in practice. Mukherjee et al (2025) find that, when post-training an LLM with RL, parameter updates are very sparse in the neuron basis. Intuitively, many expect that such post-training merely boosts/suppresses a small number of pre-existing circuits. So perhaps we will find that, in practice, one can reconstruct parameter updates with reasonable sparsity.
All of this is to say: maybe someone should try training an SAE/applying L3D to RL parameter updates. And, if they feel like it, they could call the resulting feature vectors "shards".
Some issues with the ICL Superposition paper
Setup:
Xiong et al (2024) show that LLMs can in-context-learn several tasks at once. Consider e.g. the following prompt:
France -> F
Portugal -> Portuguese
Germany -> Berlin
Spain -> Madrid
Russia -> R
Poland -> Polish
Italy ->
A model will complete this prompt sometimes with Rome
, sometimes with I
, and sometimes with Italian
, learning a "superposition" of the country -> capital
, country -> first-letter
and country -> language
tasks. (I wish they hadn't used this word: the mech interp notion of superposition is unrelated).
Let be the proportion of the in-context examples that correspond to task , and let be the probability that the model completes the prompt according to task . Call a model calibrated if : the probability it assigns to a task is proportional to the number of times the task appeared in-context. A measure of the degree of calibration is given by the KL-divergence:
Lower KL means better calibration.
Issue 1:
The authors show that, for a certain set of 6 tasks, and with 60 in-context examples, calibration improves with model size:
The best reported KL between the uniform distribution [0.167, 0.167, ..., 0.167] and the model's output distribution is around 1.5. To give a sense for what this means, here are four examples of distributions with such a KL:
.
Seeing these distributions convinced me that the models they studied are not well-calibrated. I felt a bit misled by the paper.
Issue 2:
Say we take turns with the model, repeatedly appending a new "question" (e.g. a country) to the prompt and allowing the model to generate an "answer" (e.g. the country's capital, language, or first letter) at default temperature 1.0. The authors state:
After the first token is generated, the model tends to converge on predicting tokens for a single task, effectively negating its ability for multi-task execution.
But this sort of mode-collapse behavior is not what we would observe from a well-calibrated model!
For simplicity, consider the case of only two tasks: and . Let and be the number of examples of each task, prior to a given generation step. A calibrated model generates with probability , in which case . Otherwise, it generates , and .
This is precisely the Pólya urn model. If the numbers of each task are initially and , and then we generate for a long time, it can be shown that the limiting proportion of tasks is a random variable distributed as follows:
In the realistic setting where , this distribution is peaked around, i.e. roughly the proportion of the initial tasks that were , and goes to zero at or . So calibrated models don't collapse; they tend to maintain roughly the initial ratio of tasks.
My intuition at least for the single word case is that saying a word increases the probability the model will say it in future. Therefore, if it wants the probability it says that the future to be low, it's incentivised not to say it anywhere.
I share your intuition, and think you are describing what we called "conditional spillover".[1] The fact that the Mind+Face intervention did not fully prevent spillover in our experiments is some evidence that conditional spillover is occurring.
One might mitigate conditional spillover as follows. In REINFORCE, the usual parameter update is (setting batch size = lr = 1 for brevity, and being a little schematic):
.
Here, is the reward, which depends only on the output. In the case where is the sum of a task-completion and an obfuscation-inducing term (as in our current work), we can write:
.
In order to mitigate conditional spillover, we can simply modify the gradient update by removing the term:
.
If this mitigation is used in combination with Mind+Face, then the Mind is not reinforced for producing CoTs which increase the probability of output obfuscation. It is only reinforced for task completion.
Side note: I don't love our "conditional spillover" and "parametric spillover" terminology, since they sound like different observed effects, whereas they are really two different (proposed) mechanisms underlying the same observed effect. If brevity weren't an issue, I'd call them "spillover due to later tokens conditioning on earlier ones" and "spillover due to parameter sharing between the models generating CoT and output", or something.
IIRC, When We Cease... was part fictional, but presented itself as fact. I remember being pretty annoyed when I eventually realized this (though possibly it was obvious all along and I was just being dumb).
Mind/Face requires more compute than generating with a single model, since 1) one must keep both sets of model weights in memory, and 2) when the face "takes over" generation, it must recompute the entire KV-cache up to that token.
However, Mind/Face takes no more than 2x the compute compared to standard generation (since you can always use a naive approach: just run forward passes of both models in parallel at each generation step, and choose which model's predicted token to append conditional on whether we are in thinking or output mode). I expect this upper bound to be loose. We haven't thought much about optimizations yet, but as Luke mentioned, using a smaller model for the Face (whose job is easier) seems reasonable.
Agreed. Thanks for the useful clarification.
In future work, we may try to produce a demo where optimizing only the output leads to steganographic CoT. But I expect this to be fairly challenging on a practical level (evidence: Roger and McCarthy et al had difficulty inducing steganography via RL. But note that I'm not up to date on this literature). We'd appreciate suggestions for RL environments where producing such a demo may be feasible.
It may be possible to massively reduce memory usage in sparsely-connected mode.
Let be batch size, be num active latents per dictionary per token, and be num latents per dictionary.
My current implementation of sparsely-connected mode has a terrible memory usage, since each virtual weight matrix has elements. But how many of these virtual weights do we actually need to compute?
Upstream latents: On each token in the batch, we only need the virtual weights connecting to the active upstream latents.
Downstream latents: Strictly speaking, we should compute activations for every downstream latent, since we don't know in advance which will be active. But, insofar as vanilla mode closely approximates sparsely-connected mode, we should be okay to only compute virtual weights connecting to downstream latents that were active in vanilla mode.
So on each token, we only need to compute virtual weights, and so the memory requirement is , which is small.
Of course, this new approach loses something: sparsely-connected mode now relies on vanilla mode to tell it which latents should activate. So much for a standalone replacement model! I think a reasonable middle-ground is to only compute virtual weights to the (say) latents with largest vanilla preactivation. Then compute sparsely-connected preactivations for all those latents, and apply TopK to get the activations. The memory usage is then which is still small.
Related: https://arxiv.org/abs/2004.10802
Ah I hadn't noticed that, very nice. Great post!