MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.
I have not signed any contracts that I can't mention exist, as of December 20, 2025. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
I agree with all of this.
I don't intuitively see the relevance of whether the AI "truly" optimizes for reward, in the sense that it would "short-circuit the reward" given the opportunity. The "weak" reward-optimizing behavior, aka specification-gaming, seems basically equivalent, except in niche situations where the AI gets an opportunity to directly overwrite the numerical reward value in the computer's memory (or something). How should I adjust my plans if I think my AI will be a "true reward-optimizer" vs. a "weak reward-optimizer"?
In deployment, there will presumably be no actual reward signal to overwrite, making it especially unclear why this question is relevant. If there's any behavioral difference at all, maybe I'd expect the "true reward-optimizer" to get really confused and start acting incoherently in an obvious way - maybe that would actually be good?
Good point. You could train an (D)IT-adapter to read activations as well, if that turned out to help it generalize to other tasks. Or add other tasks that are easy to generate but not so trivial to solve, like:
Side note: I like the fact that you can remove the (D)IT-adapter when you're not using it for interpretability. I get the feeling that improving the model's metacognitive skills during general deployment could make it easier for it to do things like CoT obfuscation.
My understanding is that by your definition, steganography must 1) look like innocuous, non-obfuscated text (violated by your "ciphered reasoning" example) and 2) rely on the reader interpreting the text in a non-standard way (violated by "dog-whistling" and "gaslighting"). Did I get that right?
Seems reasonable to me. I agree that steganography that meets both of those criteria would be most concerning and hasn't really been demonstrated convincingly.
Could you say more about what you see as the pros and cons of each approach? Like, I agree it's nice that in Goel et al. you can ask questions in 100% natural language rather than having to insert activations. What are the nice things about AOs that you want to keep?
@Adam Karvonen's comment mentions that DIT-adapters are limited because they require many trained LoRAs. But it seems to me like you could train a DIT-adapter to do all the same tasks you train an AO to do, without necessarily changing the model weights. (Maybe Interpretation Tuning would be a better name than Diff Interpretation Tuning in this case.)
Example:
If you wanted to be able to interpret weight diffs as well, you could optionally add in some of the training in the DIT-adapter paper. If this works well in practice, this could be one way to unify the two approaches.
Exciting! I think getting expressive, natural-language explanations of LLM internals is an underexplored area of interpretability.
Do you think it would be possible to adapt your technique to interpret model weights? For example, training an oracle to answer questions about LoRAs. I guess if you look merely at activations, you won't detect rare behaviors (like backdoors) until they actually come up.
Writing this comment, I remembered Goel et al., which did something like what I mentioned above (you actually cited them in your paper). Their blog post mentions that a limitation of their technique is "poor cross-behavior generalization." I'm wondering if you or @avichal have some insight into why Activation Oracles seemingly generalize better than DIT Adapters.
"Scaffold" sounds very natural to me, because it's been common parlance on LessWrong for at least a year. A while ago, I Googled "LLM scaffold" and was surprised to find that all of the top results are LessWrong-adjacent. Before that, I just assumed everyone in AI called it a "scaffold," but "AI agent" is actually more common. Maybe it didn't catch on here because it would cause too much confusion when we talk about "agency" and "agent foundations."
IMO, "neuro-scaffold" is clearer than the existing options and pretty easy to say. I strong-upvoted the post because I think having a Schelling point for what to call these things would be good. (Even if it may not be the very first thing I'd pick - for instance, "neural scaffold" sounds slightly less neologism-y to me.)
Yeah, I think some rationalists, e.g. Eliezer, use it a lot more than the general population, and differently from the popular figurative sense. As in "raising the sanity waterline."
I didn't say that rationality is the same thing as correctness, truth, or effectiveness. I think when rationalists use the word "sane" they usually do mean something like "having a disposition towards better methods/processes that help with attaining truth or effectiveness." Do you disagree?
Rationalists often say "insane" to talk about normie behaviors they don't like, and "sane" to talk about behaviors they like better. This seems unnecessarily confusing and mean to me.
This clearly is very different from how most people use these words. Like, "guy who believes in God" is very different from "resident of a psych ward." It can even cause legitimate confusion when you want to switch back to the traditional definition of "insane". This doesn't seem very rational to me!
Also, the otherizing/dismissive nature of this language bothers me a bit. For those of us who are trying to make the world better for humanity, it seems like it would be nice to try to meet the vast majority of non-rationalist humans where they're at, which could start by not calling them "insane."
What to say instead? Well, "rational" and "irrational" are right there! That's why we call it "rationalism"! Maybe "X is irrational" sounds pretentious, but "X is insane" sounds insulting, so I think it evens out at least. If "irrational" seems too impassive, perhaps try "dangerously irrational?"
Point taken about continual learning - I conveniently ignored this, but I agree that scenario is pretty likely.
If AIs were weak reward-optimizers, I think that would solve inner alignment better than if they were strong reward-optimizers. "Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?" I.e. the AI shouldn't misgeneralize the objective function you designed, but it also shouldn't bypass the objective function entirely by overwriting its reward.
By "hardening," do you mean something like "making sure the AI can never hack directly into the computer implementing RL and overwrite its own reward?" I guess you're right. But I would only expect the AI to acquire this motivation if it actually saw this sort of opportunity during training. My vague impression is that in a production-grade RL pipeline, exploring into this would be really difficult, even without intentional reward hardening.
Assuming that's right, thinking the AI would have a "strong reward-optimizing" motivation feels pretty crazy to me. Maybe that's my true objection.
Like, I feel like no one actually thinks the AI will care about literal reward, for the same reason that no one thinks evolution will cause humans to want to spend their days making lots of copies of their own DNA in test tubes.
But maybe people do think this. IIRC, Bostrom's Superintelligence leaned pretty heavily on the "true reward-optimization" framing (cf. wireheading). Maybe some people thinking about AI safety today are also concerned about true reward-optimizers, and maybe those people are simply confused in a way that they wouldn't be if everyone talked about reward the way Alex wants.
I say things like "RL makes the AI want to get reward" all the time, which I guess would annoy Alex. But when I say "get reward" I actually mean "do things that would cause the untampered reward function to output a higher value," aka "weakly optimize reward." It feels obvious to me that the "true reward-optimization" interpretation isn't what I'm talking about, but maybe it's not as obvious as I think.
I think talking about AI "wanting/learning to get reward" is useful, but it isn't clear to me what Alex would want me to say instead. I'd consider changing my phrasing if there were an alternative that felt natural and didn't take much longer to say.