I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Thanks, this is great!
Envy toward a friend’s success…
I used to think that envy was a social instinct (before 2023ish), but now I don’t think it’s a social instinct at all (see “changelog” here). Instead I currently think that envy is a special case of, umm, “craving” (in the general colloquial sense, not the specific Buddhist sense)—a kind of anxious frustration in a scenario where something is highly salient, and highly desired, but in fact cannot happen.
So a social example would be: Sally has a juice box, and I love juice, but I can’t have any. Looking at Sally drinking juice reminds me of the scenario where I’m drinking juice, which makes me unhappy because I don’t have any juice.
Whereas non-social example of the same innate reaction would be: It’s lunch time, and every day at lunch I have juice and a sandwich in a brown paper bag, and I love juice. But it happens that there’s a new global juice shortage, so today for the first time I don’t have any juice. Looking at my sandwich and the brown bag reminds me of the scenario where I’m drinking juice, which makes me unhappy because I don’t have any juice.
So that’s my starting point: both these two examples are the same kind of (not-specifically-social) craving-related frustration reaction.
After that, of course, the Sally scenario becomes social, because the scenario involves Sally doing something (i.e. drinking juice) that causes me to feel an unpleasant feeling (per above), and generically if someone is causing me unpleasant feelings then that tends to push me from regarding Sally as a friend, towards regarding her as an enemy, and to feel motivated to find an excuse to blame her for my troubles and pick a fight with her.
Admiration for a rival or enemy
My guess is that, just as going to bed can feel like a good idea or a bad idea depending on which aspects of the situation you’re paying attention to, likewise Genghis Khan can feel like a friend or an enemy depending on which aspects of him you’re paying attention to. I would suggest that people don’t feel admiration towards Person X and schadenfreude towards Person X at the very same instant. You might be able to flip back and forth from one to the other very quickly, even within 1 or 2 seconds, but not at the very same instant. For example, if I say the sentence “It was catastrophic how Genghis Khan killed all those people, but I have to admit, he was a talented leader”, I would suggest that the “innate friend-vs-enemy parameter” related to thoughts of Genghis Khan flips from enemy in the first half of the sentence to friend in the second half.
Compassion for a stable enemy’s suffering
There probably isn’t one great answer; probably different people are different. As above, we can think of people in different ways, paying attention to different aspects of them, and they can flip rapidly from enemy to friend and back. Since attention control is partly voluntary, it’s partly (but not entirely) a choice whether we see someone as a friend vs enemy, and we tend to choose the option that feels better / more motivating on net, and there can be a bunch of factors related to that. For example, approval reward is a factor—some people take pride in their compassion (just as we nod approvingly when superheroes take compassion upon their enemies, and cf. §6), while others take pride in their viciousness. Personality matters, culture matters, the detailed situation matters, etc.
Gratitude / indebtedness
Hmm. Generically, I think there are two (not mutually exclusive) paths:
As an example of the latter, recently someone important-to-me went out of his way to help me, and I expected the interaction to work out well for him too, but instead it wound up being a giant waste of his time, and objectively it wasn’t really my fault, but I still felt horrible and lost much sleep over it, and I think the aspect that felt most painful to me was when I imagined him secretly being annoyed at me and regretful for ever reaching out to me, even if he was far too nice a guy to say anything like that to me directly.
…But I’m kinda neurotic; different people are different and I don’t want to overgeneralize. Happy to hear more about how things seem to you.
Private guilt
I talked about “no one will ever find out” a bit in §6.1 of the approval reward post. I basically think that you can consciously believe that no one will ever find out, while nevertheless viscerally feeling a bit of the reaction associated with a nonzero possibility of someone finding out.
As for the “Dobby effect” (self-punishment related to guilt, a.k.a. atonement), that’s an interesting question. I thought about it a bit and here’s my proposed explanation:
Generally, if Ahab does something hurtful to Bob, then Bob might get angry at Ahab, and thus want Ahab to suffer (and better yet, to suffer while thinking about Bob, such as if Bob is punching Ahab in the face). But that desire of Bob’s, just like hunger and many other things, is satiable—just like a hungry person stops being hungry after eating a certain amount, likewise Bob tends to lose his motivation for Ahab to suffer, after Ahab has already suffered a certain amount. For example, if an angry person punches out his opponent in a bar fight, he usually feels satisfied, and doesn’t keep kicking his victim when he’s down, except in unusual cases. Or even if he kicks a bit, he won’t keep kicking for hours and hours.
We all know this intuitively from life experience, and we intuitively pick up on what it implies: if Ahab did something hurtful to Bob, and Ahab wants to get back to a situation where Bob feels OK about Ahab ASAP, then Ahab should be making himself suffer, and better yet suffer while thinking about Bob. Then not only is Ahab helping dull Bob’s feelings of aggression by satiating them, but simultaneously, there’s the very fact that Ahab is helping Bob feel a good feeling (i.e., satiation of anger), which should help push Ahab towards the “friend” side of the ledger in Bob’s mind.
Aggregation cases
In “identifiable victim effect”, I normally think of, like, reading a news article about an earthquake across the world. It’s very abstract. There’s some connection to the ground-truth reward signals that I suggested in Neuroscience of human social instincts: a sketch, but it’s several steps removed. Ditto “psychic numbing”, I think.
By contrast, in stage fright, you can see the people right there, looking at you, potentially judging you. You can make eye contact with one actual person, then move your eyes, and now you’re making eye contact with a different actual person, etc. The full force of the ground-truth reward signals is happening right now.
Likewise, for “audience effect”, we all have life experience of doing something, and then it turns out that there’s a real person right there who was watching us and judging us based on what we did. At any second, that real person could appear, and make eye contact etc. So again, we’re very close to the full force of the ground-truth reward signals here.
…So I don’t see a contradiction there.
Again I really appreciate this kind of comment, feel free to keep chatting.
I read one of their papers (the Pong one, which is featured on the frontpage of their website) and thought it was really bad and p-hacked, see here & here.
…sounds like a joke? you do not want to do any computation on neurons, they are slow and fragile. (you might want to run brain-inspired algorithms, but on semiconductors!)
Strong agree.
oh oops sorry if I already shared that with you, I forgot, didn’t mean to spam.
My actual expectation is that WBE just ain’t gonna happen at all (at least not before ASI), for better or worse. I think the without-reverse-engineering path is impossible, and the with-reverse-engineering path would be possible given infinite time, but would incidentally involve figuring out how to make ASI way before the project is done, and that recipe would leak (or they would try it themselves). Or even more realistically, someone else on Earth would invent ASI first, via an unrelated effort. So I spend very little time thinking about WBE.
Like, a discussion might go:
Optimist: If you pick some random thing, there is no reason at all to expect that thing to be a ruthless sociopath. It’s an extraordinarily weird and unlikely property.
Me: Yes I happily concede that point.
O: You do? So why are you worried about ASI x-risk?
Me: Well if you show me some random thing, it’s probably, like, a rock or something. It’s not sociopathic, but only because it’s not intelligent at all.
O: Well, c’mon, you know what I mean. If you pick some random mind, there is no reason at all to expect it to be a ruthless sociopath.
Me: How do you “pick some random mind”? Minds don’t just appear out of nowhere.
O: I dunno, like, human? Or AI?
Me: Different humans are different to some extent, and different AI algorithms are different to a much greater extent, and also different from humans. “AI” includes everything from A* search to MuZero to LLMs. Is A* search a ruthless sociopath? Like, I dunno, it does seem rather maniacally obsessed with graph traversal right?
O: Oh c’mon, don’t be dense. I didn’t mean “AI” in the sense of the academic discipline, I meant, like, AI in the colloquial sense, AI that qualifies as a mind, like LLMs. I’m talking about human minds and LLM “minds”, i.e. all the minds we’ve ever seen, and we observe that they are not sociopathic.
Me: As it happens, I’m working on the threat model of model-based actor-critic RL agent “brain-like” AGI, not LLMs. LLMs are profoundly different from what I’m working on. Saying that LLMs will have similar properties as RL agent AGI because “both are AI” is like saying that LLMs will have similar properties as the A* search algorithm because “both are AI”. Or it’s like saying that a tree or a parasitic wasp will have similar properties as a human because both are alive. They can still be wildly different in every way that matters.
O: OK but lots of other doomers talk about LLMs causing doom, even if you claim to be agnostic about it. E.g. IABIED.
Me: Well fine, go find those people and argue with them, and leave me out of it, it’s not my wheelhouse. I mostly don’t expect LLMs to become powerful enough to be the kind of really scary thing that could cause human extinction even if they wanted to.
O: Well you’re here so I’ll keep talking to you. I still think you need some positive reason to believe that RL agent AGI will be a ruthless sociopath.
Me: Maybe a good starting point would be my posts LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem, or “The Era of Experience” has an unsolved technical alignment problem.
O: I’m still not seeing what you’re seeing. Can you explain it a different way?
Me: OK, back at the start of the conversation, I mentioned that random object like rocks are not able to accomplish impressive difficult feats. If we’re thinking about AI that can autonomously found and grow companies for years, or autonomously wipe out humans and run the world by itself, then clearly it’s not a “random object”, but rather a thing that is able to accomplish impressive difficult feats. And the question we should be asking is: how does it do that? It can’t do it by choosing random actions. There has to be some explanation for how it finds actions that accomplish these feats.
And one possible answer is: it does it by (what amounts to) having desires about what winds up happening in the future, and running some search process to find actions that lead to those desires getting fulfilled. This is the main thing that you get from RL agents and model-based planning algorithms. The whole point of those subfields of AI is, they’re algorithms that find actions that maximize an objective. I.e., you get ruthless sociopathic behavior by default. And this isn’t armchair theorizing, it’s dead obvious to anyone who has spent serious amounts of time building or using RL agents and/or model-based planning algorithms. These things are ruthless by default, unless the programmer goes out of their way to make them non-ruthless. (And I claim that it’s not obvious or even known how they would make them non-ruthless, see those links above.) (And of course, evolution did specifically add features to the human brain to make humans non-ruthless, i.e. our evolved social instincts. Human sociopaths do exist, after all, and are quite capable of accomplishing impressive difficult feats.)
So that’s one possible answer, and it’s an answer that brings in ruthlessness by default.
…And then there’s a second, different possible answer: it finds actions that accomplish impressive feats by imitating what humans would do in different contexts. That’s where (I claim) LLMs get the lion’s share of their capabilities from. See my post Foom & Doom §2.3 for details. Of course, in my view, the alignment benefits that LLMs derive of imitating humans are inexorably tied to capabilities costs, namely LLMs struggle to get very far beyond ideas that humans have already written down. And that’s why (as I mentioned above), I’m not expecting LLMs to get all the way to the scary kind of AGI / ASI capabilities that I’m mainly worried about.
Do it! Write a new “version 2” post / post-series! It’s OK if there’s self-plagiarism. Would be time well spent.
If we put the emphasis on “simplest possible”, the most minimal that I personally recall writing is this one; here it is in its entirety:
- The path we’re heading down is to eventually make AIs that are like a new intelligent species on our planet, and able to do everything that humans can do—understand what’s going on, creatively solve problems, take initiative, get stuff done, make plans, pivot when the plans fail, invent new tools to solve their problems, etc.—but with various advantages over humans like speed and the the ability to copy themselves.
- Nobody currently has a great plan to figure out whether such AIs have our best interests at heart. We can ask the AI, but it will probably just say “yes”, and we won’t know if it’s lying.
- The path we’re heading down is to eventually wind up with billions or trillions of such AIs, with billions or trillions of robot bodies spread all around the world.
- It seems pretty obvious to me that by the time we get to that point—and indeed probably much much earlier—human extinction should be at least on the table as a possibility.
(This is an argument that human extinction is on the table, not that it’s likely.)
This one will be unconvincing to lots of people, because they’ll reject it for any of dozens of different reasons. I think those reasons are all wrong, but you need to start responding to them if you want any chance of bringing a larger share of the audience onto your side. These responses include both sophisticated “insider debates”, and just responding to dumb misconceptions that would pop into someone’s head.
(See §1.6 here for my case-for-doom writeup that I consider “better”, but it’s longer because it includes a list of counterarguments and responses.)
(This is a universal dynamic. For example, the case for evolution-by-natural-selection is simple and airtight, but the responses to every purported disproof of evolution-by-natural-selection would be at least book-length and would need to cover evolutionary theory and math in way more gory technical detail.)
I bet that Steve Byrnes can point out a bunch of specific sensory evidence that the brain uses to construct the status concept (stuff like gaze length of conspecifics or something?), but the human motivation system isn't just optimizing for those physical proxy measures, or people wouldn't be motivated to get prestige on internet forums where people have reputations but never see each other's faces.
If it helps, my take is in Neuroscience of human social instincts: a sketch and its follow-up Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking.
Sensory evidence is definitely involved, but kinda indirectly. As I wrote in the latter: “The central situation where Approval Reward fires in my brain, is a situation where someone else (especially one of my friends or idols) feels a positive or negative feeling as they think about and interact with me.” I think it has to start with in-person interactions with other humans (and associated sensory evidence), but then there’s “generalization upstream of reward signals” such that rewards also get triggered in semantically similar situations, e.g. online interactions. And it’s intimately related to the fact that there’s a semantic overlap between “I am happy” and “you are happy”, via both involving a “happy” concept. It’s a trick that works for certain social things but can’t be applied to arbitrary concepts like inclusive genetic fitness.
I stand by my nitpick in other comment that you’re not using the word “concept” quite right. Or, hmm, maybe we can distinguish (A) “concept” = a latent variable in a specific human brain’s world-model, versus (B) “concept” = some platonic Natural Abstraction™ or whatever, whether or not any human is actually tracking it. Maybe I was confused because you’re using the (B) sense but I (mis)read it as the (A) sense? In AI alignment, we care especially about getting a concept in the (A) sense to be explicitly desired because that’s likelier to generalize out-of-distribution, e.g. via out-of-the-box plans. (Arguably.) There are indeed situations where the desires bestowed by Approval Reward come apart from social status as normally understood (cf. this section, plus the possibility that we’ll all get addicted to sycophantic digital friends upon future technological changes), and I wonder whether the whole question of “is Approval Reward exactly creating social status desire, or something that overlaps it but comes apart out-of-distribution?” might be a bit ill-defined via “painting the target around the arrow” in how we think about what social status even means.
(This is a narrow reply, not taking a stand on your larger points, and I wrote it quickly, sorry for errors.)
You might (or might not) have missed that we can simultaneously be in defer-to-predictor mode for valence, override mode for goosebumps, defer-to-predictor mode for physiological arousal, etc. It’s not all-or-nothing. (I just edited the text you quoted to make that clearer.)
In "defer-to-predictor" mode, all of the informational content that directs thought rerolls is coming from the thought assessors in the Learned-from-Scratch part of the brain, even if if that information is neurologically routed through the steering subsystem?
To within the limitations of the model I’m putting forward here (which sweeps a bit of complexity under the rug), basically yes.
I feel like I see it pretty often. Check out “Unfalsifiable stories of doom”, for example.
Or really, anyone who uses the phrase “hypothetical risk” or “hypothetical threat” as a conversation-stopper when talking about ASI extinction, is implicitly invoking the intuitive idea that we should by default be deeply skeptical of things that we have not already seen with our own eyes.
Obviously I agree that The Spokesperson is not going to sound realistic and sympathetic when he is arguing for “Ponzi Pyramid Incorporated” led by “Bernie Bankman”. It’s a reductio ad absurdum, showing that this style of argument proves too much. That’s the whole point.