MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.
I have not signed any contracts that I can't mention exist, as of August 30, 2025. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
I didn't say that rationality is the same thing as correctness, truth, or effectiveness. I think when rationalists use the word "sane" they usually do mean something like "having a disposition towards better methods/processes that help with attaining truth or effectiveness." Do you disagree?
Rationalists often say "insane" to talk about normie behaviors they don't like, and "sane" to talk about behaviors they like better. This seems unnecessarily confusing and mean to me.
This clearly is very different from how most people use these words. Like, "guy who believes in God" is very different from "resident of a psych ward." It can even cause legitimate confusion when you want to switch back to the traditional definition of "insane". This doesn't seem very rational to me!
Also, the otherizing/dismissive nature of this language bothers me a bit. For those of us who are trying to make the world better for humanity, it seems like it would be nice to try to meet the vast majority of non-rationalist humans where they're at, which could start by not calling them "insane."
What to say instead? Well, "rational" and "irrational" are right there! That's why we call it "rationalism"! Maybe "X is irrational" sounds pretentious, but "X is insane" sounds insulting, so I think it evens out at least. If "irrational" seems too impassive, perhaps try "dangerously irrational?"
The way they use the word "aligned" in that paper is very weird to me :P (they basically use it as a synonym for "instruction-following" or "post-trained").
But I feel like this method could actually be adapted for AI alignment/safety. It's kind of similar to my "incremental steering" idea, but instead of a strong untrusted model guiding a weak trusted model, there's a weak post-trained model guiding a strong base model. This also looks more practical than incremental steering, because it alternates between the weak model and the strong model, rather than giving the weak model a single token of guidance and then leaving it entirely on its own.
Here's one thing you could do: rather than using an off-the-shelf strong model, you train the strong model to get reward (making it untrusted). But it can only do so by inserting tokens into the weak model's CoT. (The strong model decides when to insert a token, rather than doing this automatically whenever the weak model is uncertain.) Each token the strong model inserts is associated with a penalty, so it has to use as few tokens as possible. Unlike in the Nudging paper, the weak model might not be a base model - the important thing is that we trust the weak model isn't motivated to scheme/reward-hack/be misaligned.
Safety benefits: Because the strong/untrusted model has to rely on the weak/trusted model to do most of its thinking for it, it should be harder for it to do steganography or any kind of thinking that the weak model wouldn't naturally do. Also, if you're auditing the strong model, this should be easier, because you can just look very carefully at the relatively-few tokens written by the strong model and think about how they influence the weak model.
I'd be interested in a longer writeup and/or whatever other experiments you want to try!
Thanks for the explanation of how thinking dot by dot works! That makes sense.
I actually think the dot-by-dot paper is evidence against LLMs being about to do most kinds of reasoning with filler tokens
Yeah, I'm just reading the abstract of the paper, but it looks like it required very "specific" training on their problem to get it to work. I guess this is why you think it's evidence against? But maybe with algorithmic improvements, or due to a CoT where the only goal is getting higher reward rather than looking good to the user, LLMs will naturally gain the ability to use filler tokens effectively.
My intuition is that making the model less deep would make it non-linearly easier to interpret, since the last layers of the model is the most interpretable (since it's closest to human-understandable tokens and embeddings), while the earlier layers are interpretable (since they're farther away).
Well, if some reasoning is inarticulable in human language, the circuits implementing that reasoning would probably be difficult to interpret regardless of what layer they appear in the model. If there's reasoning that is articulable in human language, then presumably the model could just write that reasoning in CoT tokens. Steelman: certain reasoning is almost inarticulable (such that it's impossible to entirely represent it in CoT), but nonetheless pretty "close" to being articulable. If you reduce the number of layers, then the circuit for that reasoning will end up reusing some of the human-language circuitry, whereas otherwise the whole circuit would have been completely inarticulable.
The best reason I have for why reducing model size could nonlinearly improve interpretability is that there are dependencies between different parts of the network, and so the computation you'd need to understand all of these dependencies is O(n^2) or worse.
Anyway, this still seems pretty speculative. I think this argument would probably affect my actions very little or not at all - like, I still think going from CoT to neuralese would be bad and would want to discourage it. But if I had more understanding of mech interp I would feel less uncertain about it.
This is an interesting point. Off the cuff take:
I'm kind of bearish on mechanistic interpretability giving us very complete and satisfying answers about what's going on in models. Neel Nanda also seems to believe this. For the kind of stuff we can do today, like training SAEs or probes, it seems like the difficulty isn't that affected by model size. My vague impression is that scaling up to larger models is bottlenecked primarily on compute, but not that much compute.
Maybe if ambitious interpretability works out, a smaller model will be easier to interpret? But the effect doesn't obviously seem that large to me. Maybe if neuralese lets you make a 2x smaller model, this makes it more than 2x easier to interpret.
Also, it's unclear to me how much smaller you could make the model with neuralese, if at all. (Some) long-term dependencies can already be expressed in the CoT. Maybe many layers are required to express certain otherwise-inarticulable long-term dependencies? It seems plausible that most useful inarticulable thoughts only require short-term cognition. Maybe the part of the post about "picturing a geometric object and turning it around in one's mind for minutes at a time" is a counterexample to this.
If it's possible for the model to reuse the same parameters again and again to improve its own understanding, maybe it's already doing this alongside the CoT by doing something like "thinking dot by dot" in its CoT. Actually, I started writing this and realized this probably proves too much, because it means that continuous thought is never better than thinking dot by dot. I'm showing my ignorance of the details of transformer architecture, but this is supposed to be off-the-cuff, so I'll leave it here for now.
Makes sense. Surely there were many cases in which our ancestors' "family and/or friends and/or tribe were facing extinction," and going insane in those situations would've been really maladaptive! If anything, the people worried about AI x-risk have a more historically-normal amount of worry-about-death than most other people today.
We mostly just post more informal stuff directly to LessWrong / Alignment Forum
@Naomi Bashkansky Would it make sense for OpenAI to crosspost their blog posts to LW/AF? I'd personally be more likely to see them if they did.
Oh yep, I forgot about the interp updates and other informal posts from GDM researchers - these seem to occupy the same niche I was thinking the blog could have. The idea seemed worth suggesting, for the simple reason that Anthropic and OpenAI had both done it, but in retrospect I buy your argument that the main benefit of a blog is to the company rather than the readers, and I retract my suggestion.
My impression with the Medium blog is based pretty much entirely on the one paper I was involved with (MONA), where I remember hearing something like "we're going to write a separate Medium post and a LessWrong post, and the Medium post will assume (relatively) less knowledge of the field." I'm not sure if my impression was correct, or if things have changed since then.
It looks like OpenAI is following Anthropic's lead, which is great!
Google DeepMind's alignment team also has a blog, but it's much more targeted towards laypeople, mostly shares papers as they come out rather than sharing informal research, and IMO isn't as nice compared to a dedicated website or even a Substack. It might be worth considering doing something like this at GDM, subject to tradeoffs on researchers' time and Google's internal publication restrictions.
Yeah, I think some rationalists, e.g. Eliezer, use it a lot more than the general population, and differently from the popular figurative sense. As in "raising the sanity waterline."