I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Thanks! I appreciate the brainstorming here.
it feels like you are adding more gears to your model as you go. … I'm unable to tell if this is all coming from a consistent model or is more something plausible you suggest could work.
I am acutely aware of the risk of post-hoc storytelling instead of principled postdiction :) I think I'm pretty good at doing principled postdiction rather than post-hoc storytelling (although maybe everybody thinks that about themselves), but I’m certainly capable of the latter, especially when I’m just brainstorming and haven't stewed on something for months or years. E.g. much of my previous comment was early-stage low-confidence brainstorming, I hope I made that clear. :)
…allows us to make some testable predictions…
I think I’m a lot more skeptical than you about almost any psych-style experiments being even worth the time to do, let alone definitive. I imagine the experiment coming out one way, or coming out the other way, and either way, it seems very easy to explain the result. There's just too many degrees of freedom, and too wide and hazy a hypothesis space (at this stage), and too many degrees of separation between the question and the measurement. (See also: You Are Not Measuring What You Think You Are Measuring.)
Relatedly, Bayes says you kinda need two plausible hypotheses, then an experiment can favor one over the other. But I almost never have that. Rather, it’s all I can do to get to ONE hypothesis that really hangs together and is consistent with everything we know in neuroscience, evolution, algorithm theory, everyday life, mental health, culture, and so on. At least at a high level. (At a lower level, things are much more under-constrained, e.g. I can imagine dozens of ways that some calculation might be divvied up among different neuron groups. But then I don’t care as much about what the answer is.)
cheek EMG
Some areas where underdetermined messiness can sneak in here are:
(1) We don’t know what the person is actually thinking at any given time, e.g. they aren’t necessarily paying attention to the current cue, or the cue might remind them of an embarrassing thing that they did in middle school etc.
(2) We don’t know the map from the space of facial expressions to the alleged “innate parameters” that I’m hypothesizing, and it might be indirect. For example, if we see a friendly and angry facial expression simultaneously, does that mean that the underlying hypothalamic groups (or whatever) are not mutually inhibitory? Or might they just be activating so close in time that it looks simultaneous? Or might there be yet a third facial expression that just happens to look like those two coinciding?
Mapping the space of microexpressions (including pupil dilation etc.) in a rigorous way seems potentially interesting and useful, but it wouldn’t lead to a nice legible non-invasive experiment that proves my theory or any other theory, unless we also have a nice way to measure and break down what’s happening upstream of those facial expressions, i.e. in the domain of “feelings and situations triggering innate reactions”, and we don’t. For example, my theory of laughter is unusually simple, but I still don’t know how to get really nice strong legible evidence of it via anything like a psych study. I have a proposed path forward but it involves neural tracing experiments (or equivalent) in rodents.
Envy clearly splits into frustration and craving
I’m having trouble following this one. If it’s important, I wonder if you can try again to explain it more concretely? What would be the possible results, and what do you think we would we learn in each case?
Private guilt splits into discovery and appeasement
Just trying to think this through:
I guess you’re saying that I’m likeliest to fix the toy in A, and I’m likeliest to ostentatiously “beat myself up” in C? If so, yeah that seems likely. But as usual, I doubt that “proving” this experimentally would (or should) convince skeptics of any specific underlying theory.
gaze cues
As usual I can’t think of how to set up an experiment that would (or should) convince skeptics and that would not have lots of possible interpretations. For example, suppose we compare a person’s stage fright vs how much they look at the audience’s eyes. I would find it equally easy to explain both possible experimental correlations. If the correlation is positive, I could say “aha, looking at the audience’s eyes causes stage fright”. If the correlation is negative, I could say “aha, people with stage fright are deliberately avoiding looking at the audience’s eyes, because that would be too much for them to handle”.
Hmm, I guess one could look at whether physiological arousal jumps upward at the moments when eye contact with an audience member happens, and whether those jumps are bigger in people with stage fright? Seems pretty likely to me. But again, I’m not sure what skeptic would find that this data changes their mind about anything, and conversely if it turned out the other way I would be mildly surprised and confused, but I probably wouldn’t be SO surprised and confused that I would change my mind on anything important.
That’s helpful, thanks! The new version 2 has a rewritten optimization section, hope it’s better now.
Thanks for your feedback; I incorporated some of it in my rewrite (it’s now version 2). In particular, I appreciate the data showing FLOP utilization staying (roughly) constant, and the idea that there’s a red-queen race against communication overhead etc. And I added some of those examples from DeepSeek & Kimi in the appropriate sections. Thanks!
…But I do want to push back on your suggestion that your HellaSwag plot implies what you think it implies.
So I don’t think this is very strong evidence either way, and indeed if anything I would suggest that it’s pushing a bit in the direction of data over algorithms, especially given that Gopher was earlier. Right? Sorry if I’m misunderstanding.
(Thanks!) To me, your comment is like: “We have a great plan for robust engineering of vehicles (as long as they are at on a dry, warm, indoor track, going under 10kph).” OK that’s better than nothing. But if we are eventually going to be driving cars at high speed in the cold rain, it’s inadequate. We did not test or engineer them in the right environment.
This is not a complex systems objection (e.g., it’s not about how the world changes with billions of cars). It’s a distribution shift objection. Even just one car will fail at high speed in the cold rain under those circumstances.
If there’s a distribution shift (test environment systematically different from deployment environment), then you need sufficiently deep understanding of the system to allow extrapolation across the distribution shift.
In the AI case, the issue is: there’s a strong (quadrillion-dollar) economic incentive to make a type of AI that can found, run, and staff innovative companies, 100% autonomously, for years on end, even when the company is doing things that nobody has thought of, and even in a world possibly very different from our own.
And then there’s a huge distribution shift between the environment in which we expect such AIs to be operating, and the environment(s) in which we can safely test those AIs.
My actual opinion is that this type of AI won’t be an LLM, which e.g. have issues with long context windows and don’t do human-like continual learning.
…But even if it were an LLM (or system that includes LLMs), I think you’re going wrong by treating “reliability” and “robustness” as synonyms, when LLMs are actually much stronger at the former than the latter.
We can make a car that’s 99.99% “reliable” on a warm dry indoor track, but after you distribution-shift into the cold rain, it might be 10% or 0% reliable. So it’s not “robust” to distribution shifts. By the same token, I’m willing to believe that LLMs can be made 99.99% “reliable” (in the sense of reliably doing a specific thing in a specific situation). But in weird situations, LLMs sometimes go off the rails; e.g. nobody has yet made an LLM that could not be jailbroken, despite years of work. They’re not very “robust”.
You’re sorta implying that I’m against robustness, but in the OP I was saying the opposite. I think we desperately need robustness. I just think we’re not gonna get it without deeper understanding, because of distribution shifts.
I’m not making any claims about what the “interpretability” system is. It can be any system whatsoever whose input is activations and whose output is one or more numbers. The “system” could be a linear probe. Or the “system” could be a team of human researchers who pause the model after every forward pass, scrutinize the activation state for a week, and then output a “this activation state represents scheming” score from 0 to 10. (That’s not a practical example, because if you pause for a week on each forward pass then the training would take a zillion years. But in principle, sure!) Or the “system” could be something even more exotic than that. The “system” can be anything at all, it doesn’t matter for this post. I’m just saying that, regardless of what that system is, if you use its outputs to help determine the reward signal, then this post will hopefully help you think about the eventual consequences of doing that, and in particular whether gradient descent will be working to manipulate and undermine that “system”.
If you’re thinking that there isn’t a sharp line between an ML model with an “interpretability system” wrapped around it that has a numerical output (e.g. linear probe), versus an ML model with an auxiliary “output head”, then yeah, that’s true. It’s two ways of thinking about the same thing.
One thing you can maybe do is throw such accusations right back: “You say I’m being closed-minded to you, but aren’t you equally being closed-minded to me?”
It comes across as escalatory, and might be counterproductive, but I’ve also sometimes found it helpful. Depends a lot on the person and situation.
Thanks!! Quick question while I think over the rest:
What data are you plotting? Where exactly did you get it (i.e., what references)?
And why is the 2021 one better than the 2023 ones? Normally we would expect the other way around, right? Does DeepMind have so much secret sauce that it’s worth more than 2 years of public knowledge? Or are the other two groups making rookie mistakes? Or am I misunderstanding the plot?
Isn't it just the case that the human brain's 'interpretability technique' is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life.
I don’t think it’s that robust even in humans, despite the mitigation described in this post. (Without that mitigation, I think it would be hopeless.)
If we’re worried about a failure mode of the form “the interpretability technique has been routed around”, then that’s unrelated to “The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life”. For the failure mode that Yudkowsky & Zvi were complaining about, if that failure mode actually happened, there would still be an accurate model. It would just be an accurate model that is invisible to the interpretability technique.
I.e. the beliefs box would still be working fine, but the connection to desires would be weak or absent.
And I do think that happens plenty in the human world.
Maybe the best example (at least from my own perspective) is the social behavior of (many) smart autistic adults. [Copying from here:] The starting point is that innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. People respond to that by (I think) relating to other people in a way that generally avoids triggering certain innate social reactions. This includes (famously) avoiding eye contact, but I think also includes various hard-to-describe unconscious attention-control strategies. So at the end of the day, neurotypical people will have an unconscious innate snap reaction to (e.g.) learning that someone is angry at them, whereas autistic people won’t have that snap reaction, because they have an unconscious coping strategy to avoid triggering it, that they’ve used since early childhood, because the reaction is so unpleasant. Of course, they’ll still understand intellectually perfectly well that the person is angry. As one consequence of that, autistic people (naturally) have trouble modeling how neurotypical people will react to different social situations, and conversely, neurotypical people will misunderstand and misinterpret the social behaviors of autistic people.
Still, socially-attentive smart autistic adults sometimes become good (indeed, sometimes better than average) at predicting the behavior of neurotypical people, if they put enough work into it.
(People can form predictive models of other people just using our general ability to figure things out, just like we can build predictive models of car engines or whatever.)
That’s just one example. I discuss other (maybe less controversial) examples in my Sympathy Reward post §4.1 and Approval Reward post §6.
Belated update on that last point on “algorithmic progress” for LLMs: I looked into this a bit and wrote it up at: The nature of LLM algorithmic progress. The last section is how it relates to this post, with the upshot that I stand by what I wrote in OP.
FWIW, inspired by Justis, I’ve been keeping up a list of things that I could usefully automate with Claude Code (or similar) for my own personal productivity, adding to the list every time something pops into my head. I’ve been adding to the list for the past three weeks. But so far it’s a very underwhelming list! Here’s ~the whole thing:
Anyway, all of these seem like they would save me a pathetically small amount of time, and so I haven’t bothered to install Claude Code yet. But someday the list will be longer, or I will be bored and curious enough to do it regardless.
Meanwhile, I 80/20’d the second one (clipboard normalizer) just using a normal LLM chat interface: Gemini one-shotted a nice HTML + javascript solution that I stored locally and bookmarked. It adds an extra couple seconds compared to an app or chrome extension, but whatever, I don’t use it that often anyway.
I’ll keep brainstorming, but I dunno, I really don’t seem to do much that can be automated at all, and that I haven’t already automated years ago in the old-fashioned way (e.g. I have long had automatic file backups, automatic credit card payments, automatic bank transfers, automatic citation downloading, etc.)