I'm not convinced by your stance on not refreshing probes. Formally, we have three outcomes for this "gamble" of adding probes to the loss: $x$ is the model doesn't shift its representations and actually achieves the desired result, $y$ is the model shifts its representations but by refreshing we could still probe them, and $z$ is the model shifts its representations and refreshing doesn't help anymore (say non-linear representations for a linear probe).

What I think your stance boils down to is saying that you are willing to generally consider this wager, arguing $P (z)$ is often not that high. But, when conditioned on the prior outcome $y$ , you argue $P (z | y)$ is too high and isn't worth it. You don't really state any clear argument... (read more)

Replying toIn (highly contingent!) defense of interpretability-in-the-loop ML training

RobinHa11d

In (highly contingent!) defense of interpretability-in-the-loop ML training

This is close to something I was thinking about after reading your sketch.

Specifically I thought about:

…Unless interpretability someday develops to such a refined point that it's adversarially robust (i.e., we understand the model so well that problematic thoughts have nowhere to hide from the interpretability tools). But that sure seems like a long-shot.

Say you have datasets D_aligned and D_misaligned. In D_aligned the model tells the truth; in D_misaligned it lies and manipulates. We use these to identify a linear probe - but importantly, we do all of this while keeping the gradient flowing through the probe construction itself. So when we add probe accuracy to the loss, the model can't just shift... (read more)

Gradient-Based Recovery of Memorized Diffusion Model Data

RobinHa

16d

Yesterday I attended a talk by Franziska Boenisch on training data memorization in diffusion models. The short version: diffusion models memorize a small percentage of their training data verbatim, reproducible with specific prompts regardless of noise seed. Privacy concern, etc.

I was mostly interested in the adversarial case - say Snapchat trains on their data and open-sources a model. Could you retrieve those rare but real memorized samples? Some literature suggests massive random prompt searches: generate a ton of images, check using some metric.

I find this incredibly unsatisfying. Surely there's a more algorithmic way?

This post documents my 1-day attempt at exactly that. Spoiler: it sort of works, with caveats.

One thing mentioned during the... (read 833 more words →)

Replying toQuantifying Love and Hatred

RobinHa1mo

Quantifying Love and Hatred

This is a very valid point - but I don't think it inherently disagrees with the core idea. The kind stranger is still someone we admire, else we wouldn't describe him as such. The more we look up to this stranger, the more we would be willing to risk our life for him. This also doesn't directly imply a friendship: for that, the p-values have to be a two-way street. As you are a stranger to him, suppose a true stranger, he has no idea who you are, what your beliefs, achievements and goals describe, etc... - for all he knows, you could be a murderer on death-row. Even this extremely kind

... (read more)

Quantifying Love and Hatred

RobinHa

1mo

Imagine a friend gets kidnapped by mobsters who are surprisingly obsessed with human psychology. They phone you with a deal: your friend will survive if and only if you show up and play a game. The game is simple: a random number between 0 and 100 is generated, and if it falls below 10, you get shot.

You know this with certainty. Your friend will live out the rest of his life as if nothing happened—no trauma, no strings attached. You simply have to accept a 10% risk of dying.

Generalizing this: instead of 10%, consider some value p. What's the maximum p you'd accept to save this friend? 3%? 15%?

Call this your p-value... (read more)

Replying toThe Case Against Continuous Chain-of-Thought (Neuralese)

RobinHa1mo

The Case Against Continuous Chain-of-Thought (Neuralese)

I'm not claiming I know the perfect cut off point between not losing information and not letting errors accumulate, if something like that even exists. It could very well be that after 3 forward passes with neuralese, it would still be mostly fine or it could also be that even in the middle of a single forward pass it makes sense to have some kind of mitigations (I think you could build an argument around MoE being a mitigation). But what this perfect ratio is doesn't really matter, the point is that recurrent forward passes will be a thousand times worse than a normal forward pass and therefore can't be worth it anymore.

Making everything discrete is one extreme just as making everything continuous is the other extreme. I'm arguing that the golden ratio lies somewhere in the middle, recognizing the importance of both rich, continuous representations and clear, discrete representations.

Replying toThe Case Against Continuous Chain-of-Thought (Neuralese)

RobinHa1mo

The Case Against Continuous Chain-of-Thought (Neuralese)

I'm not sure which passage you seem to refer to, saying that my argument implies this. The sections "The Bandwidth Intuition/Counterargument" are supposed to clear exactly this, stating roughly that I understand that there is still obviously a loss of information and as such it's nonsensical for a normal NN to have miniscule layers. This isn't an accurate assessment for neuralese LLMs though. They recursively aggregate this error, turning it into a lot bigger problem. If simply allowed to grow through the tens and hundreds of forward passes, it's simply not worth it. If we do already tokenize them though, quantization no longer would serve any real purpose.

I hope my position has become a little bit more clear.

Replying toThe Case Against Continuous Chain-of-Thought (Neuralese)

RobinHa1mo

The Case Against Continuous Chain-of-Thought (Neuralese)

No, for several reasons: For starters, quantization is normally done after training and not present during training (mainly because it introduces a lot of grad problems) - this is not comparable to the token distribution which we incorporate during training and train on. (In other words, it can't take advantage of any possible benefits cause it was trained on a whole other setting)

Even more importantly, the error doesn't aggregate for the KV Cache (the normal weights obviously can't, they are literally fixed): inspecting a token's KV cache in the i'th layer, it only carries the noise from the 0'th to i'th layer (any minor noise from before got removed since it was... (read more)

The Case Against Continuous Chain-of-Thought (Neuralese)

RobinHa

1mo

Main thesis: Discrete token vocabularies don't lose information so much as they allow information to be retained in the first place. By removing minor noise and singling out major noise, errors become identifiable and therefore correctable, which continuous latent representations fundamentally cannot offer.

The Bandwidth Intuition (And Why It's Incomplete)

One of the most elementary ideas connected to neuralese is increasing bandwidth. After the tireless mountains of computation called a forward pass, we condense everything down to ~17 bits (the log₂ of our vocabulary size).

This seems insane. Imagine pitching a neural network architecture where layers 5, 10, 15, and 20 have hidden dimension 20, while normal layers use 512. You'd be laughed out of... (read 1397 more words →)

Replying toWhy do LLMs so often say "It's not an X, it's a Y"?

RobinHa1mo

Why do LLMs so often say "It's not an X, it's a Y"?

Ah, I think I might have slightly misunderstood the intent of your posts title and tried answering a different question: why does LLM writing often seem shallow and bad rather than why LLMs specifically seem biased to a subset of stylistic devices.

I honestly don’t use LLMs much to chat or write with, so my personal experience is rather limited. But I do find the point others made, on the data distribution for post training just not being an accurate sample, convincing enough - just not particularly satisfying.

So, here’s my thoughts on why both RLHF but also SFT or DPO could, even with a perfect sample of training data, result in converged down... (read more)

-1

Replying toWhy do LLMs so often say "It's not an X, it's a Y"?

RobinHaDec 30, 2025

Why do LLMs so often say "It's not an X, it's a Y"?

I think it might be that the undesired response in RLHF/DPO settings isn't good enough.

Imagine two responses, one leveraging stylistic devices and persuasive words while the other... well, just doesn't. Naturally the first is better and more desirable. If we now inspect this over the whole training batch, these distinctions of the preferred response at any point in the response leveraging stylistic devices will become clear. That is, a phrase like "It's not an X, it's a Y" will occur at a bunch of different positions throughout all the different positive examples in contrast to the negative examples which very rarely, if at all, showcase such pleasant phrasing.

But wouldn't then such behavior of... (read more)

Replying toNeuroscience of human social instincts: a sketch

RobinHa2mo

Neuroscience of human social instincts: a sketch

but I can’t explain why I believe that, because it gets into gory details of brain algorithms that I don’t want to talk about publicly, sorry.

somewhat random but I think I want to learn more about this field in general - from what I can tell, you didn't learn about it in a normal academic setting (like doing a neuroscience B.Sc.) either; any tips for good resources?

Replying toNeuroscience of human social instincts: a sketch

RobinHa2mo

Neuroscience of human social instincts: a sketch

This isn’t as much a question as it is just sharing some thoughts I had, but I would love to hear your thoughts :) Let’s imagine we are our own brain’s optimizer. We just received a bad signal, we feel pain. Let’s say, we realized someone else is soon going to feel pain, so we feel pain. What could the optimizer do now? Well, there are only 2 things it can do:

Try to disconnect “she feels pain” from the concept of pain that then triggered pain in yourself
Try to disconnect your previous thoughts from arriving at “she feels pain”

You speak a lot to (1), explaining the symbol grounding mechanism that continuously symbol grounds... (read 642 more words →)

GRPO is terrible

RobinHa

3mo

An on-policy, sample-efficient NLP post-training approach not requiring verification

The current state

I once tried teaching a simulated 3D humanoid how to dunk^[1]using RL - truly the most effective use of my time. If I took a single thing away from that, it was that designing the best reward function is equivalent to believing your agent is conscious and has the single goal to destroy your project.

My point is, RL is already terrible in practice. Then additionally throwing intermediate rewards out the window and overly relying on the most inefficient part of modern LLMs, their autoregressive inference^[2], doesn't exactly seem like the play - somehow it is though.

The first attempts did try exactly this... (read 1254 more words →)

LESSWRONG
LW

LESSWRONG
LW

RobinHa

The Case Against Continuous Chain-of-Thought (Neuralese)

Gradient-Based Recovery of Memorized Diffusion Model Data

Quantifying Love and Hatred

GRPO is terrible

RobinHa

Gradient-Based Recovery of Memorized Diffusion Model Data

Quantifying Love and Hatred

The Case Against Continuous Chain-of-Thought (Neuralese)

GRPO is terrible

RobinHa

The Case Against Continuous Chain-of-Thought (Neuralese)

Gradient-Based Recovery of Memorized Diffusion Model Data

Quantifying Love and Hatred

GRPO is terrible

RobinHa

Gradient-Based Recovery of Memorized Diffusion Model Data

Quantifying Love and Hatred

The Case Against Continuous Chain-of-Thought (Neuralese)

GRPO is terrible

The Bandwidth Intuition (And Why It's Incomplete)

The current state