I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Isn't it just the case that the human brain's 'interpretability technique' is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life.
I don’t think it’s that robust even in humans, despite the mitigation described in this post. (Without that mitigation, I think it would be hopeless.)
If we’re worried about a failure mode of the form “the interpretability technique has been routed around”, then that’s unrelated to “The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life”. For the failure mode that Yudkowsky & Zvi were complaining about, if that failure mode actually happened, there would still be an accurate model. It would just be an accurate model that is invisible to the interpretability technique.
I.e. the beliefs box would still be working fine, but the connection to desires would be weak or absent.
And I do think that happens plenty in the human world.
Maybe the best example (at least from my own perspective) is the social behavior of (many) smart autistic adults. [Copying from here:] The starting point is that innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. People respond to that by (I think) relating to other people in a way that generally avoids triggering certain innate social reactions. This includes (famously) avoiding eye contact, but I think also includes various hard-to-describe unconscious attention-control strategies. So at the end of the day, neurotypical people will have an unconscious innate snap reaction to (e.g.) learning that someone is angry at them, whereas autistic people won’t have that snap reaction, because they have an unconscious coping strategy to avoid triggering it, that they’ve used since early childhood, because the reaction is so unpleasant. Of course, they’ll still understand intellectually perfectly well that the person is angry. As one consequence of that, autistic people (naturally) have trouble modeling how neurotypical people will react to different social situations, and conversely, neurotypical people will misunderstand and misinterpret the social behaviors of autistic people.
Still, socially-attentive smart autistic adults sometimes become good (indeed, sometimes better than average) at predicting the behavior of neurotypical people, if they put enough work into it.
(People can form predictive models of other people just using our general ability to figure things out, just like we can build predictive models of car engines or whatever.)
That’s just one example. I discuss other (maybe less controversial) examples in my Sympathy Reward post §4.1 and Approval Reward post §6.
Belated update on that last point on “algorithmic progress” for LLMs: I looked into this a bit and wrote it up at: The nature of LLM algorithmic progress. The last section is how it relates to this post, with the upshot that I stand by what I wrote in OP.
I added a caption to the picture. Does that help?
Yeah I agree with that. I have a diagram of homeostatic feedback control in §1.5 of my SMTM reply post, and RL is one of the ingredients (d & f).
I know very little about this topic, but I was under the impression that there was more to it than “KV cache: yes or no?”, and I was trying to refer to that whole category of possible improvements. E.g. here’s a paper on “KV cache compression”.
Thanks for the feedback!
I would also include improvements from synthetic training data as an algorithmic improvement, not a data-related improvement, because better synthetic training data is created by better algorithms…
I have now changed the text in a few places to better clarify how I am defining the scope of §1.1.
I feel like you’re maybe reading in some subtext, where you think I’m trying to downplay the things outside §1.1, and suggest that they don’t really count, or something? If so, that’s not what I meant to suggest, and it’s not how I feel in my heart. I’m open to rewording more, if you have suggestions.
I think most things mentioned in 1.4 ("Algorithmic changes that are not really quantifiable as efficiency") belong to 1.1 (algorithmic efficiency progress) because they can actually be quantified as efficiency improvements, namely SFT, RLHF, RLVR. These have strongly increased capabilities, as measured by benchmarks, compared to GPT-3-style prompt engineering of the underlying base model. So a much smaller model with these improvements can get to the performance of a larger base model without them.
In the context of this post, I’m mainly interested in: (1) are the things in §1.4 relevant to the Epoch claim of exponential algorithmic improvements? and (2) are the things in §1.4 relevant to the Dario claim of exponential algorithmic improvements? It seems to me that the answer in both cases is “no”.
(1) In the Epoch case, I believe they quantified performance by perplexity not benchmarks.
(2) In the Dario case, I mean, I keep reading and re-reading the exact wording of the excerpt where he talks about “compute multipliers”. And it just really doesn’t sound to me like he is referring to SFT, RLHF, or RLVR in that excerpt (nor anything else in §1.4). Admittedly, his wording is a bit vague and confusing (to me). I’m open to discussion.
I think model distillation would not cause such a large and ongoing improvement in inference efficiency.
Pick a fixed model size, let’s say N=50B parameters. My current belief is that: if you straightforwardly distill Claude Opus 3.5 into an N-parameter model, then you wind up with a worse model, than if you straightforwardly distill Claude Opus 4.5 into an N-parameter model.
Are you disagreeing with that?
If you agree, then it would follow that (for example) maybe:
(because Bob’s better starting point is making up for his more aggressive distillation). Thus we would see ever-falling inference costs at any given level of benchmarks. See what I mean?
EDITED TO ADD: I just took the non-Goodfire-specific part of this comment, and spun it out and expanded it into a new post: In (highly contingent!) defense of interpretability-in-the-loop ML training
~ ~ ~
No opinion on Goodfire, or even on LLMs more broadly, but I generically think that there exists some AI algorithms in which interpretability outputs are connected to a reward function in a way that might be very helpful for safe & beneficial ASI. See for example Reward Function Design: a starter pack sections 1 & 4 & 5.
E.g. consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the reward can evidently get triggered by specific activations inside my inscrutable learned world-model. But human compassion is generally pretty robust—at least, it doesn’t go away as you age. It would be cool if we knew how to put something like that into an AGI.
…But yes, it’s also true that there are ways to connect interpretability outputs to a reward function that are bad, and make our problems all worse. I think there’s probably some elegant big-picture theoretical framework that sheds light on how to do interpretability-in-the-loop training in a way that’s good rather than terrible. Developing that theoretical framework would be good. (But probably quite different from what Goodfire is working on … it’s not “doing interpretability stuff” in the conventional sense but rather “thinking about AI algorithmic architectures in the big picture”, kinda more agent-foundations-y if anything.)
Fair enough, I reworded slightly. (Although I’d be pretty surprised if LLM algorithmic progress were way faster today than in, say, 2020, given that there was presumably much more low-hanging fruit in 2020.)
In 2024, I wasn't able to find anyone making this argument. My sense is that it was not at all prevalent, and continues to be not at all prevalent.
I feel like I see it pretty often. Check out “Unfalsifiable stories of doom”, for example.
Or really, anyone who uses the phrase “hypothetical risk” or “hypothetical threat” as a conversation-stopper when talking about ASI extinction, is implicitly invoking the intuitive idea that we should by default be deeply skeptical of things that we have not already seen with our own eyes.
Is The Spokesperson a realistic villain?
Obviously I agree that The Spokesperson is not going to sound realistic and sympathetic when he is arguing for “Ponzi Pyramid Incorporated” led by “Bernie Bankman”. It’s a reductio ad absurdum, showing that this style of argument proves too much. That’s the whole point.
Thanks!! Quick question while I think over the rest:
What data are you plotting? Where exactly did you get it (i.e., what references)?
And why is the 2021 one better than the 2023 ones? Normally we would expect the other way around, right? Does DeepMind have so much secret sauce that it’s worth more than 2 years of public knowledge? Or are the other two groups making rookie mistakes? Or am I misunderstanding the plot?