I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
If we put the emphasis on “simplest possible”, the most minimal that I personally recall writing is this one; here it is in its entirety:
- The path we’re heading down is to eventually make AIs that are like a new intelligent species on our planet, and able to do everything that humans can do—understand what’s going on, creatively solve problems, take initiative, get stuff done, make plans, pivot when the plans fail, invent new tools to solve their problems, etc.—but with various advantages over humans like speed and the the ability to copy themselves.
- Nobody currently has a great plan to figure out whether such AIs have our best interests at heart. We can ask the AI, but it will probably just say “yes”, and we won’t know if it’s lying.
- The path we’re heading down is to eventually wind up with billions or trillions of such AIs, with billions or trillions of robot bodies spread all around the world.
- It seems pretty obvious to me that by the time we get to that point—and indeed probably much much earlier—human extinction should be at least on the table as a possibility.
(This is an argument that human extinction is on the table, not that it’s likely.)
This one will be unconvincing to lots of people, because they’ll reject it for any of dozens of different reasons. I think those reasons are all wrong, but you need to start responding to them if you want any chance of bringing a larger share of the audience onto your side. These responses include both sophisticated “insider debates”, and just responding to dumb misconceptions that would pop into someone’s head.
(See §1.6 here for my case-for-doom writeup that I consider “better”, but it’s longer because it includes a list of counterarguments and responses.)
(This is a universal dynamic. For example, the case for evolution-by-natural-selection is simple and airtight, but the responses to every purported disproof of evolution-by-natural-selection would be at least book-length and would need to cover evolutionary theory and math in way more gory technical detail.)
I bet that Steve Byrnes can point out a bunch of specific sensory evidence that the brain uses to construct the status concept (stuff like gaze length of conspecifics or something?), but the human motivation system isn't just optimizing for those physical proxy measures, or people wouldn't be motivated to get prestige on internet forums where people have reputations but never see each other's faces.
If it helps, my take is in Neuroscience of human social instincts: a sketch and its follow-up Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking.
Sensory evidence is definitely involved, but kinda indirectly. As I wrote in the latter: “The central situation where Approval Reward fires in my brain, is a situation where someone else (especially one of my friends or idols) feels a positive or negative feeling as they think about and interact with me.” I think it has to start with in-person interactions with other humans (and associated sensory evidence), but then there’s “generalization upstream of reward signals” such that rewards also get triggered in semantically similar situations, e.g. online interactions. And it’s intimately related to the fact that there’s a semantic overlap between “I am happy” and “you are happy”, via both involving a “happy” concept. It’s a trick that works for certain social things but can’t be applied to arbitrary concepts like inclusive genetic fitness.
I stand by my nitpick in other comment that you’re not using the word “concept” quite right. Or, hmm, maybe we can distinguish (A) “concept” = a latent variable in a specific human brain’s world-model, versus (B) “concept” = some platonic Natural Abstraction™ or whatever, whether or not any human is actually tracking it. Maybe I was confused because you’re using the (B) sense but I (mis)read it as the (A) sense? In AI alignment, we care especially about getting a concept in the (A) sense to be explicitly desired because that’s likelier to generalize out-of-distribution, e.g. via out-of-the-box plans. (Arguably.) There are indeed situations where the desires bestowed by Approval Reward come apart from social status as normally understood (cf. this section, plus the possibility that we’ll all get addicted to sycophantic digital friends upon future technological changes), and I wonder whether the whole question of “is Approval Reward exactly creating social status desire, or something that overlaps it but comes apart out-of-distribution?” might be a bit ill-defined via “painting the target around the arrow” in how we think about what social status even means.
(This is a narrow reply, not taking a stand on your larger points, and I wrote it quickly, sorry for errors.)
You might (or might not) have missed that we can simultaneously be in defer-to-predictor mode for valence, override mode for goosebumps, defer-to-predictor mode for physiological arousal, etc. It’s not all-or-nothing. (I just edited the text you quoted to make that clearer.)
In "defer-to-predictor" mode, all of the informational content that directs thought rerolls is coming from the thought assessors in the Learned-from-Scratch part of the brain, even if if that information is neurologically routed through the steering subsystem?
To within the limitations of the model I’m putting forward here (which sweeps a bit of complexity under the rug), basically yes.
The black border around your macbook screen would be represented in some tiny subset of the cortex before you pay attention to it, and in a much larger subset of the cortex after you pay attention to it. In the before state (when it’s affecting a tiny subset of the cortex), I still want to declare it part of the “thought”, in the sense relevant to this post, i.e. (1) those bits of the cortex are still potentially providing context signals for the amygdala, striatum, etc., and (2) those bits are still interconnected with and compatible with what’s happening elsewhere in the cortex. If that tiny subset of the cortex doesn’t directly connect to the hippocampus (which it probably doesn’t), then it won’t directly impact your episodic memory afterwards, although it still has an indirect impact via needing to be compatible with the other parts of the cortex that connects to (i.e., if the border had been different than usual, you would have noticed something wrong).
If we think in terms of attractor dynamics (as in Hopfield nets, Boltzmann machines, etc.), then I guess your proposal in this comment corresponds to the definitions: “thought” = “stable attractor state”, and “proto-thought” = “weak disjointed activity that’s bubbling up and might (or might not) eventually develop into a new stable attractor state.
Whereas the purpose of this series, I’m just using the simpler “thought” = “whatever the cortex is doing”. And “whatever the cortex is doing” might be (at some moment) 95% stable attractor + 5% weak disjointed activity, or whatever.
Is there a reason why these "proto-thoughts" don't have the problem cited above, that force "thoughts" to be sequential?
Weak disjointed activity can be hyper-local to some tiny part of the cortex, and then it might or might not impact other areas and gradually (i.e. over the course of 0.1 seconds or whatever) spread into a new stable attractor for a large fraction of the cortex, by outcompeting the stable attractor which was there before.
(I’m exaggerating a bit for clarity; the ability of some local pool of neurons to explore multiple possibilities simultaneously is more than zero, but I really don’t think it gets very far at all before there has to be a “winner”.)
…fish…
No, I was trying to describe sequential thoughts. First the fish has Thought A (well-established, stable attractor, global workspace) “I’m going left to my cave”, then for maybe a quarter of a second it has Thought B (well-established, stable attractor, global workspace) “I’m going right to the reef”, then it switches back to Thought A. I was not attempting to explain why those thoughts appeared rather than other possible thoughts, rather I was emphasizing the fact that these are two different thoughts, and that Thought B got discarded because it seemed bad.
I just reworded that section, hopefully that will help future readers, thanks.
FYI, I just revised the post, mainly by adding a new §5.2.1. Hopefully that will help you and/or future readers understand what I’m getting at more easily. Thanks for the feedback (and of course I’m open to further suggestions).
If memory serves, the journal Foundations of Physics was long known as a place for people to publish wild fringe theories that would never get accepted by more mainstream physics journals.
I remember back in 2007, this was common knowledge, so it was big news that (widely respected physicist) Gerard 't Hooft was due to take over as editor-in-chief, and people in the physics department were speculating about whether he would radically change the nature of the journal. I don’t know whether that happened or not. But anyway, 1997 is before that.
I feel like you omit the possibility that the trait of motivated reasoning is like the “trait” of not-flying. You don’t need an explanation for why humans have the trait of not-flying, because not-flying is the default. Why didn’t this “trait” evolve away? Because there isn’t really any feasible genomic changes that would “get rid” of not-flying (i.e. that would make humans fly), at least not without causing other issues.
RE “evolutionarily-recent”: I guess your belief is that “lots of other mammals engaging in motivated reasoning” is not the world we live in. But is that right? I don’t see any evidence either way. How could one tell whether, say, a dog or a mouse ever engages in motivated reasoning?
My own theory (see [Valence series] 3. Valence & Beliefs) is that planning and cognition (in humans and other mammals) works by an algorithm that is generally very effective, and has gotten us very far, but which has motivated reasoning as a natural and unavoidable failure mode. Basically, the algorithm is built so as to systematically search for thoughts that seem good rather than bad. If some possibility is unpleasant, then the algorithm will naturally discover the strategy of “just don’t think about the unpleasant possibility”. That’s just what the algorithm will naturally do. There isn’t any elegant way to avoid this problem, other than evolve an entirely different algorithm for practical intelligence / planning / etc., if indeed such an alternative algorithm even exists at all.
Our brain has a hack-y workaround to mitigate this issue, namely the “involuntary attention” associated with anxiety, itches, etc., which constrain your thoughts so as to make you unable to put (particular types of) problems out of your mind. In parallel, culture has also developed some hack-y workarounds, like Reading The Sequences, or companies that have a red-teaming process. But none of these workarounds completely solves the issue, and/or they come along with their own bad side-effects.
Anyway, the key point is that motivated reasoning is a natural default that needs no particular explanation.
(Thanks for the thought-provoking post.)
Couple nitpicks:
If you're going to join the mind and you don't care about paper clips and it cares about paper clips, that's not going to happen. But if it can offer some kind of compelling shared value story that everybody could agree with in some sense, then we can actually get values which can snowball.
I thought the “merge” idea was that, if the super-mind cares about paperclips and you care about staples, and you have 1% of the bargaining power of the super-mind, then you merge into a super+1-mind that cares 99% about paperclips and 1% about staples. And that can be a Pareto improvement for both. Right?
For one thing, it doesn't really care about the actual von Neumann conditions like "not being money-pumped" because it's the only mind, so there's not an equilibrium that keeps it in check.
I think “not being money-pumped” is not primarily about adversarial dynamics, where there’s literally another agent trying to trick you, but rather about the broader notion of having goals about the future, and being effective in achieving those goals. Being dutch-book-able implies sometimes making bad decisions by your own light, and a smart agent should recognize that this is happening and avoid it, in order to accomplish more of its own goals.
(TBC there are other reasons to question the applicability of VNM rationality, including Garrabrant’s fairness thing and the assumption that the agent has pure long-term consequentialist goals in the first place.)
In the original blog post, we think a lot about slack. It says that if you have slack, you can kind of go off the optimal solution and do whatever you want. But in practice, what we see is that slack, when it occurs, produces this kind of drift. It's basically the universe fulfilling its naturally entropic nature, in that most ways to go away from the optimum are bad. If we randomly drift, we just basically tend to lose fitness and produce really strange things which are not even really what we value.
My response to this gets at what Joe Carlsmith calls Deep Atheism. I think there just is no natural force that systematically produces goodness. I agree with you that slack is not a force that systematically produces goodness. But also, I feel much more strongly than you that competition is also not a force that systematically produces goodness. No such force exists. Too bad.
So I agree with this paragraph literally, but disagree with its connotation that competition would be better than slack.
Do it! Write a new “version 2” post / post-series! It’s OK if there’s self-plagiarism. Would be time well spent.