I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
I’m not making any claims about what the “interpretability” system is. It can be any system whatsoever whose input is activations and whose output is one or more numbers. The “system” could be a linear probe. Or the “system” could be a team of human researchers who pause the model after every forward pass, scrutinize the activation state for a week, and then output a “this activation state represents scheming” score from 0 to 10. (That’s not a practical example, because if you pause for a week on each forward pass then the training would take a zillion years. But in principle, sure!) Or the “system” could be something even more exotic than that. The “system” can be anything at all, it doesn’t matter for this post. I’m just saying that, regardless of what that system is, if you use its outputs to help determine the reward signal, then this post will hopefully help you think about the eventual consequences of doing that, and in particular whether gradient descent will be working to manipulate and undermine that “system”.
If you’re thinking that there isn’t a sharp line between an ML model with an “interpretability system” wrapped around it that has a numerical output (e.g. linear probe), versus an ML model with an auxiliary “output head”, then yeah, that’s true. It’s two ways of thinking about the same thing.
One thing you can maybe do is throw such accusations right back: “You say I’m being closed-minded to you, but aren’t you equally being closed-minded to me?”
It comes across as escalatory, and might be counterproductive, but I’ve also sometimes found it helpful. Depends a lot on the person and situation.
Thanks!! Quick question while I think over the rest:
What data are you plotting? Where exactly did you get it (i.e., what references)?
And why is the 2021 one better than the 2023 ones? Normally we would expect the other way around, right? Does DeepMind have so much secret sauce that it’s worth more than 2 years of public knowledge? Or are the other two groups making rookie mistakes? Or am I misunderstanding the plot?
Isn't it just the case that the human brain's 'interpretability technique' is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life.
I don’t think it’s that robust even in humans, despite the mitigation described in this post. (Without that mitigation, I think it would be hopeless.)
If we’re worried about a failure mode of the form “the interpretability technique has been routed around”, then that’s unrelated to “The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life”. For the failure mode that Yudkowsky & Zvi were complaining about, if that failure mode actually happened, there would still be an accurate model. It would just be an accurate model that is invisible to the interpretability technique.
I.e. the beliefs box would still be working fine, but the connection to desires would be weak or absent.
And I do think that happens plenty in the human world.
Maybe the best example (at least from my own perspective) is the social behavior of (many) smart autistic adults. [Copying from here:] The starting point is that innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. People respond to that by (I think) relating to other people in a way that generally avoids triggering certain innate social reactions. This includes (famously) avoiding eye contact, but I think also includes various hard-to-describe unconscious attention-control strategies. So at the end of the day, neurotypical people will have an unconscious innate snap reaction to (e.g.) learning that someone is angry at them, whereas autistic people won’t have that snap reaction, because they have an unconscious coping strategy to avoid triggering it, that they’ve used since early childhood, because the reaction is so unpleasant. Of course, they’ll still understand intellectually perfectly well that the person is angry. As one consequence of that, autistic people (naturally) have trouble modeling how neurotypical people will react to different social situations, and conversely, neurotypical people will misunderstand and misinterpret the social behaviors of autistic people.
Still, socially-attentive smart autistic adults sometimes become good (indeed, sometimes better than average) at predicting the behavior of neurotypical people, if they put enough work into it.
(People can form predictive models of other people just using our general ability to figure things out, just like we can build predictive models of car engines or whatever.)
That’s just one example. I discuss other (maybe less controversial) examples in my Sympathy Reward post §4.1 and Approval Reward post §6.
Belated update on that last point on “algorithmic progress” for LLMs: I looked into this a bit and wrote it up at: The nature of LLM algorithmic progress. The last section is how it relates to this post, with the upshot that I stand by what I wrote in OP.
I added a caption to the picture. Does that help?
Yeah I agree with that. I have a diagram of homeostatic feedback control in §1.5 of my SMTM reply post, and RL is one of the ingredients (d & f).
I know very little about this topic, but I was under the impression that there was more to it than “KV cache: yes or no?”, and I was trying to refer to that whole category of possible improvements. E.g. here’s a paper on “KV cache compression”.
Thanks for the feedback!
I would also include improvements from synthetic training data as an algorithmic improvement, not a data-related improvement, because better synthetic training data is created by better algorithms…
I have now changed the text in a few places to better clarify how I am defining the scope of §1.1.
I feel like you’re maybe reading in some subtext, where you think I’m trying to downplay the things outside §1.1, and suggest that they don’t really count, or something? If so, that’s not what I meant to suggest, and it’s not how I feel in my heart. I’m open to rewording more, if you have suggestions.
I think most things mentioned in 1.4 ("Algorithmic changes that are not really quantifiable as efficiency") belong to 1.1 (algorithmic efficiency progress) because they can actually be quantified as efficiency improvements, namely SFT, RLHF, RLVR. These have strongly increased capabilities, as measured by benchmarks, compared to GPT-3-style prompt engineering of the underlying base model. So a much smaller model with these improvements can get to the performance of a larger base model without them.
In the context of this post, I’m mainly interested in: (1) are the things in §1.4 relevant to the Epoch claim of exponential algorithmic improvements? and (2) are the things in §1.4 relevant to the Dario claim of exponential algorithmic improvements? It seems to me that the answer in both cases is “no”.
(1) In the Epoch case, I believe they quantified performance by perplexity not benchmarks.
(2) In the Dario case, I mean, I keep reading and re-reading the exact wording of the excerpt where he talks about “compute multipliers”. And it just really doesn’t sound to me like he is referring to SFT, RLHF, or RLVR in that excerpt (nor anything else in §1.4). Admittedly, his wording is a bit vague and confusing (to me). I’m open to discussion.
I think model distillation would not cause such a large and ongoing improvement in inference efficiency.
Pick a fixed model size, let’s say N=50B parameters. My current belief is that: if you straightforwardly distill Claude Opus 3.5 into an N-parameter model, then you wind up with a worse model, than if you straightforwardly distill Claude Opus 4.5 into an N-parameter model.
Are you disagreeing with that?
If you agree, then it would follow that (for example) maybe:
(because Bob’s better starting point is making up for his more aggressive distillation). Thus we would see ever-falling inference costs at any given level of benchmarks. See what I mean?
(Thanks!) To me, your comment is like: “We have a great plan for robust engineering of vehicles (as long as they are at on a dry, warm, indoor track, going under 10kph).” OK that’s better than nothing. But if we are eventually going to be driving cars at high speed in the cold rain, it’s inadequate. We did not test or engineer them in the right environment.
This is not a complex systems objection (e.g., it’s not about how the world changes with billions of cars). It’s a distribution shift objection. Even just one car will fail at high speed in the cold rain under those circumstances.
If there’s a distribution shift (test environment systematically different from deployment environment), then you need sufficiently deep understanding of the system to allow extrapolation across the distribution shift.
In the AI case, the issue is: there’s a strong (quadrillion-dollar) economic incentive to make a type of AI that can found, run, and staff innovative companies, 100% autonomously, for years on end, even when the company is doing things that nobody has thought of, and even in a world possibly very different from our own.
And then there’s a huge distribution shift between the environment in which we expect such AIs to be operating, and the environment(s) in which we can safely test those AIs.
My actual opinion is that this type of AI won’t be an LLM, which e.g. have issues with long context windows and don’t do human-like continual learning.
…But even if it were an LLM (or system that includes LLMs), I think you’re going wrong by treating “reliability” and “robustness” as synonyms, when LLMs are actually much stronger at the former than the latter.
We can make a car that’s 99.99% “reliable” on a warm dry indoor track, but after you distribution-shift into the cold rain, it might be 10% or 0% reliable. So it’s not “robust” to distribution shifts. By the same token, I’m willing to believe that LLMs can be made 99.99% “reliable” (in the sense of reliably doing a specific thing in a specific situation). But in weird situations, LLMs sometimes go off the rails; e.g. nobody has yet made an LLM that could not be jailbroken, despite years of work. They’re not very “robust”.
You’re sorta implying that I’m against robustness, but in the OP I was saying the opposite. I think we desperately need robustness. I just think we’re not gonna get it without deeper understanding, because of distribution shifts.