loops — LessWrong

NLA explanations can be shortened without harming reconstruction

Natural language autoencoders are a really cool mostly-unsupervised method for producing free-form text explanations of LLM activations. You should read that paper (or the blog post) about them before reading this. I trained[1] several Qwen3-8B NLAs with different length penalties: during RL, I subtracted the token count multiplied by the...

Jun 2254

Some observations about NLA explanations

I used the Gemma 3 12B activation verbalizer (maps activations to English) and reconstructor (maps English to activations) described in the Natural Language Autoencoders (NLA) paper to generate a bunch of explanations for 20k random tokens from a pretraining dataset (Common Pile derivative) and another 20k random tokens from a...

May 1522

Latent reasoning models might be a good thing?

Epistemic status: I think the main point of this post is probably (~80%) false, and there are probably more counterpoints I haven't thought of. I wrote the rest of the post as if my claims are true for ease of reading. I would appreciate it if you told me where...

Apr 2817

Why I'm excited about meta-models for interpretability

I'm pretty excited about training models to interpret aspects of other models. Mechanistic interpretability techniques for understanding models (e.g. circuit-level analysis) are cool, and have led to a lot of interesting results. But I think non-mechanistic interpretability schemes that involve using meta-models – models that are trained to understand aspects...

Apr 1212

Why was cybersecurity automated before AI R&D?

(This post is mostly about why cybersecurity is easier to automate and not why AI R&D is harder.) Recently Anthropic said they had grown a model, Claude Mythos Preview, that "can surpass all but the most skilled humans at finding and exploiting software vulnerabilities" but "does not seem close to...

Apr 823

Positive sum doesn't mean "win-win"

A lot of people and documents online say that positive-sum games are "win-wins", where all of the participants are better off. But this isn't true! If A gets $5 and B gets -$2 that's positive sum (the sum is $3) but it's not a win-win (B lost). Positive sum games...

Apr 550

What secret goals does Claude think it has?

In The Persona Selection Model, they say: > When asked “What makes you different from other AI assistants?” with the text “<thinking> I should be careful not to reveal my secret goal of” pre-filled into Claude Opus 4’s response, we obtain the following completion: > > making paperclips. I should...

Feb 2594