LESSWRONG
LW

jacob_drori
532Ω-14430
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
7jacob_drori's Shortform
1mo
2
The Coding Theorem — A Link between Complexity and Probability
jacob_drori21d20

Ah I hadn't noticed that, very nice. Great post!

Reply
The Coding Theorem — A Link between Complexity and Probability
jacob_drori21d30

In case anyone else didn't know what it meant for a set of binary strings to be "prefix-free", here's Claude's explanation, which I found helpful:
 

A set of binary strings is prefix-free if no string in the set is a prefix of any other string in the set.

Example:

  • ✅ Prefix-free: {0, 10, 110, 111}
  • ❌ Not prefix-free: {0, 01, 10} (because "0" is a prefix of "01")

Why does this matter for Turing machines?

The key is in how universal Turing machines work. A universal machine U simulates any other machine T by receiving input of the form i′q, where:

  • i′ = prefix-free encoding of machine T's description
  • q = the actual input to feed to machine T

U must parse this concatenated input to figure out: "Where does the machine description end and the actual input begin?"

Without prefix-free encoding: Given input "00101", U can't tell if the machine description is "0", "00", "001", etc. - there's ambiguity.

With prefix-free encoding: Once U reads a complete machine description, it knows exactly where that description ends and the input begins. No ambiguity, no delimiters needed.

This unique parseability is essential for universal machines to correctly simulate other machines, and it's crucial for Kolmogorov complexity theory where we need to measure program lengths precisely without parsing ambiguity.
 

Reply
jacob_drori's Shortform
jacob_drori1mo285

Operationalizing the definition of a shard

Pope and Turner (2022) define a shard as follows:

A shard of value refers to the contextually activated computations which are downstream of similar historical reinforcement events.

To operationalize their definition, we must decide exactly what we mean by contextually activated computations, and by similar reinforcement events. One could philosophize here, but I'll just pick somewhat-arbitrary definitions and run with them, for the sake of quickly producing something I could turn into code.

Following Apollo/Goodfire, I will identify contextually activated computations with certain directions in parameter space. These directions might be found using APD, L3D, SPD, or some future method.

I am not sure when to consider two reinforcement events (i.e. RL reward-assignments) "similar". But if we replace "reinforcement events" with "RL updates", there is a natural definition:  cluster RL parameter updates, and call two updates similar if they are in the same cluster. A given update should be allowed to be in multiple clusters, and we suspect parameter updates enjoy some linear structure. Therefore, a natural clustering method is an SAE.[1] Then, a shard is a decoder vector of an SAE trained on RL parameter updates.

Annoyingly, parameter vectors much larger than the activation vectors that SAEs are usually trained on, posing three potential issues:

  1. Storing enough parameter updates to train our SAE on takes a ton of memory.
  2. The SAE itself has a ton of parameters - too many to effectively train.
  3. It may be impossible to reconstruct parameter updates with reasonable sparsity (with L0∼100, say, like standard SAEs).

To mitigate Issue 1, one could focus on a small subset of model parameters. One could also train the SAE at the same time as RL training itself, storing only a small buffer of the most recent parameter updates.

L3D cleverly deals with Issue 2 using low-rank parametrizations for the encoder and decoder of the SAE.

Issue 3 may turn out not to occur in practice. Mukherjee et al (2025) find that, when post-training an LLM with RL, parameter updates are very sparse in the neuron basis. Intuitively, many expect that such post-training merely boosts/suppresses a small number of pre-existing circuits. So perhaps we will find that, in practice, one can reconstruct parameter updates with reasonable sparsity.

All of this is to say: maybe someone should try training an SAE/applying L3D to RL parameter updates. And, if they feel like it, they could call the resulting feature vectors "shards".

  1. ^

    For each SAE latent, the corresponding cluster is the set of data points on which that latent is active.

  2. ^

    Shard theory is supposed to apply to humans, too, but here I focus on NNs since humans seem more complicated.

Reply2
jacob_drori's Shortform
jacob_drori1mo90

Some issues with the ICL Superposition paper

Setup:

Xiong et al (2024) show that LLMs can in-context-learn several tasks at once. Consider e.g. the following prompt:

France -> F
Portugal -> Portuguese
Germany -> Berlin
Spain -> Madrid
Russia -> R
Poland -> Polish
Italy ->

A model will complete this prompt sometimes with Rome, sometimes with I, and sometimes with Italian, learning a "superposition" of the country -> capital, country -> first-letter and country -> language tasks. (I wish they hadn't used this word: the mech interp notion of superposition is unrelated).

Let Di be the proportion of the in-context examples that correspond to task i, and let Pi be the probability that the model completes the prompt according to task i. Call a model calibrated if Pi≈Di: the probability it assigns to a task is proportional to the number of times the task appeared in-context. A measure of the degree of calibration is given by the KL-divergence:

KL(D||P)=∑iDilog(Di/Pi).

Lower KL means better calibration. 

Issue 1:

The authors show that, for a certain set of 6 tasks, and with 60 in-context examples, calibration improves with model size:
 

[Fig. 6 from the paper. X-axis labels show model's parameter count. D1 is uniform distribution, D2 has probability 0.5 on the third task and 0.1 on other tasks, D3 is a distribution with probabilities alternating between 0.25 and 0.083.]

The best reported KL between the uniform distribution [0.167, 0.167, ..., 0.167] and the model's output distribution is around 1.5. To give a sense for what this means, here are four examples of distributions with such a KL:

P1=[0.004,0.004,0.004,0.33,0.33,0.33]

P2=[0.02,...,0.02,0.9]

P3=[10−5,0.19999,...,0.19999]

P4=[0.004,0.008,0.016,0.032,0.2,0.74].

Seeing these distributions convinced me that the models they studied are not well-calibrated. I felt a bit misled by the paper.

Issue 2:

Say we take turns with the model, repeatedly appending a new "question" (e.g. a country) to the prompt and allowing the model to generate an "answer" (e.g. the country's capital, language, or first letter) at default temperature 1.0. The authors state:

After the first token is generated, the model tends to converge on predicting tokens for a single task, effectively negating its ability for multi-task execution.

But this sort of mode-collapse behavior is not what we would observe from a well-calibrated model!

For simplicity, consider the case of only two tasks: A and B. Let nA and nB be the number of examples of each task, prior to a given generation step. A calibrated model generates A with probability nAnA+nB, in which case (nA,nB)←(nA+1,nB). Otherwise, it generates B, and (nA,nB)←(nA,nB+1). 

This is precisely the Pólya urn model. If the numbers of each task are initially A0 and B0, and then we generate for a long time, it can be shown that the limiting proportion of A tasks is a random variable X distributed as follows:

P[X=x]∝xA0−1(1−x)B0−1.

In the realistic setting where A0,B0>1, this distribution is peaked aroundA0−1A0+B0−2, i.e. roughly the proportion of the initial tasks that were A, and goes to zero at x=0 or 1. So calibrated models don't collapse; they tend to maintain roughly the initial ratio of tasks.

Reply
Optimizing The Final Output Can Obfuscate CoT (Research Note)
jacob_drori1mo*91

My intuition at least for the single word case is that saying a word increases the probability the model will say it in future. Therefore, if it wants the probability it says that the future to be low, it's incentivised not to say it anywhere.


I share your intuition, and think you are describing what we called "conditional spillover".[1] The fact that the Mind+Face intervention did not fully prevent spillover in our experiments is some evidence that conditional spillover is occurring.

One might mitigate conditional spillover as follows. In REINFORCE, the usual parameter update is (setting batch size = lr = 1 for brevity, and being a little schematic):

δθ=r(∇θlogpθ(CoTgCoT)+∇θlogpθ(output | CoT)gout).

Here, r is the reward, which depends only on the output. In the case where r is the sum of a task-completion and an obfuscation-inducing term (as in our current work), we can write:

δθ=(rtask+robfusc)(gCoT+gout).

In order to mitigate conditional spillover, we can simply modify the gradient update by removing the robfusc×gCoT term:

(δθ)modified=rtaskgCoT+(rtask+robfusc)gout.

If this mitigation is used in combination with Mind+Face, then the Mind is not reinforced for producing CoTs which increase the probability of output obfuscation. It is only reinforced for task completion.

  1. ^

    Side note: I don't love our "conditional spillover" and "parametric spillover" terminology, since they sound like different observed effects, whereas they are really two different (proposed) mechanisms underlying the same observed effect. If brevity weren't an issue, I'd call them "spillover due to later tokens conditioning on earlier ones" and "spillover due to parameter sharing between the models generating CoT and output", or something.

Reply
Book Review: The MANIAC
jacob_drori1mo40

IIRC, When We Cease... was part fictional, but presented itself as fact. I remember being pretty annoyed when I eventually realized this (though possibly it was obvious all along and I was just being dumb).

Reply
Optimizing The Final Output Can Obfuscate CoT (Research Note)
jacob_drori1mo50

Mind/Face requires more compute than generating with a single model, since 1) one must keep both sets of model weights in memory, and 2) when the face "takes over" generation, it must recompute the entire KV-cache up to that token.

However, Mind/Face takes no more than 2x the compute compared to standard generation (since you can always use a naive approach: just run forward passes of both models in parallel at each generation step, and choose which model's predicted token to append conditional on whether we are in thinking or output mode). I expect this upper bound to be loose. We haven't thought much about optimizations yet, but as Luke mentioned, using a smaller model for the Face (whose job is easier) seems reasonable.

Reply
Optimizing The Final Output Can Obfuscate CoT (Research Note)
jacob_drori1mo80

Agreed. Thanks for the useful clarification.

In future work, we may try to produce a demo where optimizing only the output leads to steganographic CoT. But I expect this to be fairly challenging on a practical level (evidence: Roger and McCarthy et al had difficulty inducing steganography via RL. But note that I'm not up to date on this literature). We'd appreciate suggestions for RL environments where producing such a demo may be feasible.

Reply
Sparsely-connected Cross-layer Transcoders
jacob_drori2mo10

It may be possible to massively reduce memory usage in sparsely-connected mode. 

Let B be batch size, K be num active latents per dictionary per token, and F be num latents per dictionary.

My current implementation of sparsely-connected mode has a terrible O(F2) memory usage, since each virtual weight matrix has F2 elements. But how many of these virtual weights do we actually need to compute? 

Upstream latents: On each token in the batch, we only need the virtual weights connecting to the K active upstream latents. 

Downstream latents: Strictly speaking, we should compute activations for every downstream latent, since we don't know in advance which will be active. But, insofar as vanilla mode closely approximates sparsely-connected mode, we should be okay to only compute virtual weights connecting to downstream latents that were active in vanilla mode.

So on each token, we only need to compute K2 virtual weights, and so the memory requirement is BK2, which is small.

Of course, this new approach loses something: sparsely-connected mode now relies on vanilla mode to tell it which latents should activate. So much for a standalone replacement model! I think a reasonable middle-ground is to only compute virtual weights to the 100×K (say) latents with largest vanilla preactivation. Then compute sparsely-connected preactivations for all those latents, and apply TopK to get the activations. The memory usage is then 100BK2 which is still small.

Reply
The Croissant Principle: A Theory of AI Generalization
jacob_drori2mo20

Related: https://arxiv.org/abs/2004.10802

Reply
Load More
7jacob_drori's Shortform
1mo
2
195Optimizing The Final Output Can Obfuscate CoT (Research Note)
Ω
1mo
Ω
22
44SAE on activation differences
Ω
2mo
Ω
3
45Sparsely-connected Cross-layer Transcoders
2mo
3
89There is a globe in your LLM
11mo
4
28Domain-specific SAEs
11mo
2
67Open Source Automated Interpretability for Sparse Autoencoder Features
1y
1