Do you think sociopaths are sociopaths because their approval reward is very weak? And if so, why do they often still seek dominance/prestige?
Do you think sociopaths are sociopaths because their approval reward is very weak?
Basically yes (+ also sympathy reward); see Approval Reward post §4.1, including the footnote.
And if so, why do they often still seek dominance/prestige?
My current take is that prestige-seeking comes mainly from Approval Reward, and is very weak in (a certain central type of) sociopath, whereas dominance-seeking comes mainly from a different social drive that I discussed in Neuroscience of human social instincts: a sketch §7.1, but mostly haven’t thought about too much, and which may be strong in some sociopathic people (and weak in others).
I guess it’s also possible to prestige-seek not because prestige seems intrinsically desirable, but rather as a means to an end.
My default mental model of an intelligent sociopath includes something like this:
You find yourself wandering around in a universe where there's a bunch of stuff to do. There's no intrinsic meaning, and you don't care whether you help or hurt other people or society; you're just out to get some kicks and have a good time, preferably on your own terms. A lot of neat stuff has already been built, which, hey, saves you a ton of effort! But it's got other people and society in front of it. Well, that could get annoying. What do you do?
Well, if you learn which levers to pull, sometimes you can get the people to let you in ‘naturally’. Bonus if you don't have to worry as much about them coming back to inconvenience you later. And depending on what you were after, that can turn out as prestige—‘legitimately’ earned or not, whatever was easier or more fun. (Or dominance; I feel like prestige is more likely here, but that might be dependent on what kind of society you're in and what your relative strengths are. Also, sometimes it's much more invisible! There's selection effects in which sociopaths become well-known versus quietly preying somewhere they won't get caught.)
Beyond that, a lot of...
Nice post. Approval reward seems like it helps explain a lot of human motivation and behavior.
I'm wondering whether approval reward would really be a safe source of motivation in an AGI though. From the post, it apparently comes from two sources in humans:
In each case it seems the person is generating behaviors and they there is an equally strong/robust reward classifier internally or externally so it's hard to game.
The internal classifier is hard to game because we can't edit our minds.
And other people are hard to fool. For example, there are fake billionaires but they are usually found out and then get negative approval so it's not worth it.
But I'm wondering would an AGI with an approval reward modify itself to reward hack or figure out how to fool humans in clever ways (like the RLHF robot arm) to get more approval.
Though maybe implementing an approval reward in an AI gets you most of the alignment you need and it's robust enough.
I definitely have strong concerns that Approval Reward won’t work on AGI. (But I don’t have an airtight no-go theorem either. I just don’t know; I plan to think about it more.) See especially footnote 7 of this post, and §6 of the Approval Reward post, for some of my concerns, which overlap with yours.
(I hope I wasn’t insinuating that I think AGI with Approval Reward is definitely a great plan that will solve AGI technical alignment. I’m open to wording changes if you can think of any.)
Curated! I very much like the project of finding upstream cruxes to different intuitions regarding AI alignment. Oddly, such cruxes can be invisible until someone points them out. It’s also cool how Steven’s insight here isn’t a one-off post, but flows from his larger research project and models, kind of the project paying dividends. (To clarify, in curating this I’m not saying it’s definitely correct according to me, but I find it quite plausible.)
I also appreciate that most times when I or others try to do this mechanistic modeling of human minds, it ends up very dry and others don’t want to read it even when it feels compelling to the author; somehow Steven has escaped that, by dint of writing quality or idea quality, I’m not sure.
I really liked this and when the relevant Annual Review comes around, expect to give it at least a 4.
A complementary angle: we shouldn't be arguing over whether or not we're in for a rough ride, we should be figuring out how to not have that.
I suspect more people would be willing to (both empirically and theoretically) get behind 'ruthless consequentialist maximisers are one extreme of a spectrum which gets increasingly scary and dangerous; it would be bad if those got unleashed'.
Sure, skeptics can still argue that this just won't happen even if we sit back and relax. But I think then it's clearer that they're probably making a mistake (since origin stories for ruthless consequentialist maximisers are many and disjunctive). So the debate becomes 'which sources of supercompetent ruthless consequentialist maximisers are most likely and what options exist to curtail that?'.
I appreciate this post for working to distill a key crux in the larger debate.
Some quick thoughts:
1. I'm having a hard time understanding the "Alas, the power-seeking ruthless consequentialist AIs are still coming” intuition. It seems like a lot of people in this community have this intuition, and I feel very curious why. I appreciate this crux getting attention.
2. Personally, my stance is something more like, "It seems very feasible to create sophisticated AI architectures that don't act as scary maximizers." To me it seems like this is what we're d...
Personally, my stance is something more like, "It seems very feasible to create sophisticated AI architectures that don't act as scary maximizers." To me it seems like this is what we're doing now, and I see some strong reasons to expect this to continue. (I realize this isn't guaranteed, but I do think it's pretty likely)
We probably mostly disagree because you’re expecting LLMs forever and I’m not. For example, AlphaZero does act as a scary maximizer. Indeed, nobody knows any way to make an AI that’s superhuman at Go, except by techniques that produce scary maximizers. Is there a way to make an AI that’s superhuman at founding and running innovative companies, but isn’t a scary maximizer? That’s beyond present AI capabilities, so the jury is still out.
The issue is basically “where do you get your capabilities from?” One place to get capabilities is by imitating humans. That’s the LLM route, but (I claim) it can’t go far beyond the hull of existing human knowledge. Another place to get capabilities is specific human design (e.g. the heuristics that humans put into Deep Blue), but that has the same limitation. That leaves consequentialism as a third source of capabilities, and it de...
Do you think AI-empowered people / companies / governments also won't become more like scary maximizers? Not even if they can choose how to use the AI and how to train it? This seems a super strong statement and I don't know any reasons to believe it at all.
What's your take on why Approval Reward was selected for in the first place VS sociopathy?
I find myself wondering if non-behavioral reward functions are more powerful in general than behavioral ones due to less tendency towards wireheading, etc.(consider the laziness & impulsivity of sociopaths) Especially ones such as Approval Reward which can be "customized" depending on the details of the environment and what sort of agent it would be most useful to become.
In general, humans also tend to be satisficers/prediction-error-minimizers rather than utility maxmizers. When a human behaves like a utility maximizer, we tend to regard it as addiction or other dysfunctional behavior. So we don't so much have "utility" as a collection of dimensions of satisfiable appetites, whose priorities depend on how strong the appetite is (i.e. how long since last fulfilled).
On top of that, some research (see Ainslie's Breakdown of Will) suggests that our appetites are conditioned on differential availability of opportunities to p...
There's a scientific field that studies the origins for human motivations: Evolutionary Psychology, and its subfield Evolutionary Moral Psychology. That clearly predicts Approval Reward: if you're part of a hunter-gatherer band, sooner or later you will need the help of other members (because their hunt succeeded today and yours failed, and you need them to lend you some food against the time the opposite will happen, or because you're better at making moccasins and they're better at chipping handaxes and you want to trade with them). Gaining their approva...
insofar as human capabilities are not largely explained by consequentialist planning (as suggested by the approval reward picture), this should make us more optimistic about human-level AGI alignment.
further, this picture might suggest that the cheapest way to human level-level AGI might route through approval reward-like mechanisms, giving us a large negative alignment tax.
ofc you might think getting approval reward to work is actually a very narrow target, and even if early human-level AGIs aren't coherent consequentialists, they will use som...
The human intuition that treating other humans as a resource to be callously manipulated and exploited, just like a car engine or any other complex mechanism in their environment, is a weird anomaly rather than the obvious default
e.g. as is the typical human response to people who are far away (both physically and conceptually, so whose approval isn't salient or anticipated) i.e. 'the outgroup'
Even if AGI has Approval Rewards (i.e., from LLMs or somehow in RL/agentic scenarios), Approval Rewards only work if the agent actually values the approver's approval. Maybe sometimes that valuation is more or less explicit, but there needs to be some kind of belief that the approval is important, and therefore behaviors should align with approval reward-seeking / disapproval minimizing outcomes.
As a toy analogy: many animals have preferences about food, territory, mates, etc., but humans don't really treat those signals as serious guides to our behaviors....
I found this post very unintuitive, and I’m not sure Approval Reward is a precisely bounded concept. If it can be used to explain “saving money to buy a car” then it can really be stretched to explain a wide range of human actions that IMO can be better explained by drives other than social approval. Most importantly (and this is likely a skill issue on my part) it’s unclear to me how to operationalize what would be predicted by an Approval Reward driven framework vs some alternative.
what exactly is it about human brains that allows them to not always act like power-seeking ruthless consequentialists?
I propose that this question is flawed, because humans actually do act like power-seeking ruthless consequentialists, and to the extent it seems like they don't, that's because of an overly naive view of what effective power-seeking looks like.
I feel like a lot of these discussions are essentially about "if an entity were a power-seeking ruthless consequentialist, then it'd act like a Nazi", to which I observe that in fact humans...
The notion of "Approval Reward" reminded me of Adam Smith's The Theory of Moral Sentiments (https://en.wikipedia.org/wiki/The_Theory_of_Moral_Sentiments ) where he says something like "we aren't motivated to be praised, we are motivated to be praiseworthy", and that this roundabout moral motivation depends on us (sociopaths excluded) generating a kind of inner impartial spectator
Dipping into this neglected book by Adam Smith may help alignment! Thanks
This is a wonderful piece and it's so great to hear from somebody so deeply knowledgeable in the field. I wonder if approval reward might be an emergent property. When you're operating under radical uncertainty with more variables than you can possibly model, defaulting to "what would my community approve of" is a computationally efficient heuristic. It's not some holy pro-social module evolution baked into us; it's a deeply rational response to chaos. Even if I tip when no one is watching at a restaurant I will never return to, psychologically I know I am...
I think the reason humans care about other people's interests, and aren't power-seeking ruthless consequentialists, is because of evolution.
Evolutionary "group selection" meant each human cared about her tribe's survival a tiny bit: not enough to make sacrifices herself, but enough to reward/punish other humans to make sacrifices for the tribe (which was far more cost effective).
Evolution thus optimized our ability to evaluate other people's behaviour by how beneficial to the tribe (virtuous) or beneficial to themselves (evil) they were. Evolution also opt...
By contrast, in humans, self-reflective (meta)preferences mostly (though not exclusively) come from Approval Reward. By and large, our “true”, endorsed, ego-syntonic desires are approximately whatever kinds of desires would impress our friends and idols
Now that you said it, I have a strong urge to cut it out.
I guess you can frame it as "wanting to impress yourself by placing yourself in the place of an idol" or "the people who set the trends are cool, and everybody is impressed by them, but to do that you need to defy existing trend setters" or somet...
Responding to just the tl;dr, but will try to read the whole thing, apologies as usual for, well...
If your fixation remains solely on architecture, and you don't consider the fact that morality-shaped-stuff keeps evolving in mammals because the environment selects for it in some way, you are just setting yourself up for future problems when the superintelligent AI develops or cheats its way to whatever form of compartmentalization or metacognition lets it do the allegedly pure rational thing of murdering all other forms of intelligence. I literally d...
Humans reproduce sexually, and only sexually at present, and require a large number of friendly support personnel that they cannot afford to simply "pay". This produces the behavior you notice, when combined with the requiqrements of cognitive evolution. You cannot reproduce sexually if there is not a pool of people to reproduce with.
All species that became intelligent (Acorn Woodpeckers, Dolphins) developed some time of cooperative mating, not simple dominance based mating. There is no advantage to intelligence without such cooperative n...
what exactly is it about human brains[1] that allows them to not always act like power-seeking ruthless consequentialists?
why focus only on the brains? it's a property of the mind and I thought the standard take why humans are not even approximating utility maximizers is because of properties of the environment (priors), not a hardcoded function/software/architecture/wetwere in the brain .. or?
I think most people have positive views about some/most humans (and consequently about alignment) because they are implicitly factoring in their mortality. Would you feel safe picking a human that you thought was good and giving them a pill that gave them superintelligence? Maybe. Would you feel safe giving that same person a pill that made them both superintelligent AND immortal? I know I wouldn't trust me with that. An AGI/SGI would be potentially immortal and would know it. For that reason alone I would never trust it no matter how well I thought it seemed aligned in the short term (and compared to an immortal, any human timescale is short term).
consequentialist
You are talking about the unaligned AI that has good intentions for humanity? What about the self serving paper clip maximizer? Isn't that a fairly large group too?
Thought provoking article! But likely it confused instrinsic "value" with "social reward" as in current definition of "Approval Reward". The intrisinc "value function" that agents operates under is likely much more complicated than "social approval" which definitely plays an important role for humans as we are evolved to be social creatures.
e.g.,
"So saving the money is not doing an unpleasant thing now for a benefit later. Rather, the pleasant feeling starts immediately, thanks to (usually) Approval Reward."
Yes the pleasant feeling starts immedi...
This is not what normally happens with RL reward functions! For example, you might be wondering: “Suppose I surreptitiously[2] press a reward button when I notice my robot following rules. Wouldn’t that likewise lead to my robot having a proud, self-reflective, ego-syntonic sense that rule-following is good?” I claim the answer is: no, it would lead to something more like an object-level “desire to be noticed following the rules”, with a sociopathic, deceptive, ruthless undercurrent.[3]
I don't think we have considered how much increased self-awar...
will future powerful AGI / ASI “by default” lack Approval Reward altogether?
I'd say that pessimists are similar to LLM optimists in their conviction that it would be pretty easy to match and then greatly surpass general human intelligence, trusting their own intuitions far too much. Of course, once that assumption is made, everything else straightforwardly follows.
Tl;dr
AI alignment has a culture clash. On one side, the “technical-alignment-is-hard” / “rational agents” school-of-thought argues that we should expect future powerful AIs to be power-seeking ruthless consequentialists. On the other side, people observe that both humans and LLMs are obviously capable of behaving like, well, not that. The latter group accuses the former of head-in-the-clouds abstract theorizing gone off the rails, while the former accuses the latter of mindlessly assuming that the future will always be the same as the present, rather than trying to understand things. “Alas, the power-seeking ruthless consequentialist AIs are still coming,” sigh the former. “Just you wait.”
As it happens, I’m basically in that “alas, just you wait” camp, expecting ruthless future AIs. But my camp faces a real question: what exactly is it about human brains[1] that allows them to not always act like power-seeking ruthless consequentialists? I find that existing explanations in the discourse—e.g. “ah but humans just aren’t smart and reflective enough”, or evolved modularity, or shard theory, etc.—to be wrong, handwavy, or otherwise unsatisfying.
So in this post, I offer my own explanation of why “agent foundations” toy models fail to describe humans, centering around a particular non-“behaviorist” part of the RL reward function in human brains that I call Approval Reward, which plays an outsized role in human sociality, morality, and self-image. And then the alignment culture clash above amounts to the two camps having opposite predictions about whether future powerful AIs will have something like Approval Reward (like humans, and today’s LLMs), or not (like utility-maximizers).
(You can read this post as pushing back against pessimists, by offering a hopeful exploration of a possible future path around technical blockers to alignment. Or you can read this post as pushing back against optimists, by “explaining away” the otherwise-reassuring observation that humans and LLMs don't act like psychos 100% of the time.)
Finally, with that background, I’ll go through six more specific areas where “alignment-is-hard” researchers (like me) make claims about what’s “natural” for future AI, that seem quite bizarre from the perspective of human intuitions, and conversely where human intuitions are quite bizarre from the perspective of agent foundations toy models. All these examples, I argue, revolve around Approval Reward. They are:
0. Background
0.1 Human social instincts and “Approval Reward”
As I discussed in Neuroscience of human social instincts: a sketch (2024), we should view the brain as having a reinforcement learning (RL) reward function, which says that pain is bad, eating-when-hungry is good, and dozens of other things (sometimes called “innate drives” or “primary rewards”). I argued that part of the reward function was a thing I called the “compassion / spite circuit”, centered around a small number of (hypothesized) cell groups in the hypothalamus, and I sketched some of its effects.
Then last month in Social drives 1: “Sympathy Reward”, from compassion to dehumanization and Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking, I dove into the effects of this “compassion / spite circuit” more systematically.
And now in this post, I’ll elaborate on the connections between “Approval Reward” and AI technical alignment.
“Approval Reward” fires most strongly in situations where I’m interacting with another person (call her Zoe), and I’m paying attention to Zoe, and Zoe is also paying attention to me. If Zoe seems to be feeling good, that makes me feel good, and if Zoe is feeling bad, that makes me feel bad. Thanks to these brain reward signals, I want Zoe to like me, and to like what I’m doing. And then Approval Reward generalizes from those situations to other similar ones, including where Zoe is not physically present, but I imagine what she would think of me. It sends positive or negative reward signals in those cases too.
As I argue in Social drives 2, this “Approval Reward” leads to a wide array of effects, including credit-seeking, blame-avoidance, and status-seeking. It also leads not only to picking up and following social norms, but also to taking pride in following those norms, even when nobody is watching, and to shunning and punishing those who violate them.
This is not what normally happens with RL reward functions! For example, you might be wondering: “Suppose I surreptitiously[2] press a reward button when I notice my robot following rules. Wouldn’t that likewise lead to my robot having a proud, self-reflective, ego-syntonic sense that rule-following is good?” I claim the answer is: no, it would lead to something more like an object-level “desire to be noticed following the rules”, with a sociopathic, deceptive, ruthless undercurrent.[3]
I argue in Social drives 2 that Approval Reward is overwhelmingly important to most people’s lives and psyches, probably triggering reward signals thousands of times a day, including when nobody is around but you’re still thinking thoughts and taking actions that your friends and idols would approve of.
Approval Reward is so central and ubiquitous to (almost) everyone’s world, that it’s difficult and unintuitive to imagine its absence—we’re much like the proverbial fish who puzzles over what this alleged thing called “water” is.
…Meanwhile, a major school of thought in AI alignment implicitly assumes that future powerful AGIs / ASIs will almost definitely lack Approval Reward altogether, and therefore AGIs / ASIs will behave in ways that seem (to normal people) quite bizarre, unintuitive, and psychopathic.
The differing implicit assumption about whether Approval Reward will be present versus absent in AGI / ASI is (I claim) upstream of many central optimist-pessimist disagreements on how hard technical AGI alignment will be. My goal in this post is to clarify the nature of this disagreement, via six example intuitions that seem natural to humans but are rejected by “alignment-is-hard” alignment researchers. All these examples centrally involve Approval Reward.
0.2 Hang on, will future powerful AGI / ASI “by default” lack Approval Reward altogether?
This post is mainly making a narrow point that the proposition “alignment is hard” is closely connected to the proposition “AGI will lack Approval Reward”. But an obvious follow-up question is: are both of these propositions true? Or are they both false?
Here’s how I see things, in brief, broken down into three cases:
If AGI / ASI will be based on LLMs: Humans have Approval Reward (arguably apart from some sociopaths etc.). And LLMs are substantially sculpted by human imitation (see my post Foom & Doom §2.3). Thus, unsurprisingly, LLMs also display behaviors typical of Approval Reward, at least to some extent. Many people see this as a reason for hope that technical alignment might be solvable. But then the alignment-is-hard people have various counterarguments, to the effect that these Approval-Reward-ish LLM behaviors are fake, and/or brittle, and/or unstable, and that they will definitely break down as LLMs get more powerful. The cautious-optimists generally find those pessimistic arguments confusing (example).
Who’s right? Beats me. It’s out-of-scope for this post, and anyway I personally feel unable to participate in that debate because I don’t expect LLMs to scale to AGI in the first place.[4]
If AGI / ASI will be based on RL agents (or similar), as expected by David Silver & Rich Sutton, Yann LeCun, and myself (“brain-like AGI”), among others, then the answer is clear: There will be no Approval Reward at all, unless the programmers explicitly put it into the reward function source code. And will they do that? We might (or might not) hope that they do, but it should definitely not be our “default” expectation, the way things are looking today. For example, we don’t even know how to do that, and it’s quite different from anything in the literature. (RL agents in the literature almost universally have “behaviorist” reward functions.) We haven’t even pinned down all the details of how Approval Reward works in humans. And even if we do, there will be technical challenges to making it work similarly in AIs—which, for example, do not grow up with a human body at human speed in a human society. And even if it were technically possible, and a good idea, to put in Approval Reward, there are competitiveness issues and other barriers to it actually happening. More on all this in future posts.
If AGI / ASI will wind up like “rational agents”, “utility maximizers”, or related: Here the situation seems even clearer: as far as I can tell, under common assumptions, it’s not even possible to fit Approval Reward into these kinds of frameworks, such that it would lead to the effects that we expect from human experience. No wonder human intuitions and “agent foundations” people tend to talk past each other!
0.3 Where do self-reflective (meta)preferences come from?
This idea will come up over and over as we proceed, so I’ll address it up front:
In the context of utility-maximizers etc., the starting point is generally that desires are associated with object-level things (whether due to the reward signals or the utility function). And from there, the meta-preferences will naturally line up with the object-level preferences.
After all, consider: what’s the main effect of ‘me wanting X’? It’s ‘me getting X’. So if getting X is good, then ‘me wanting X’ is also good. Thus, means-end reasoning (or anything functionally equivalent, e.g. RL backchaining) will echo object-level desires into corresponding self-reflective meta-level desires. And this is the only place that those meta-level desires come from.
By contrast, in humans, self-reflective (meta)preferences mostly (though not exclusively) come from Approval Reward. By and large, our “true”, endorsed, ego-syntonic desires are approximately whatever kinds of desires would impress our friends and idols (see previous post §3.1).
Box: More detailed argument about where self-reflective preferences come from
The actual effects of “me wanting X” are
Any of these three pathways can lead to a meta-preference wherein “me wanting X” seems good or bad. And my claim is that (2B) is how Approval Reward works (see previous post §3.2), while (1) is what I’m calling the “default” case in “alignment-is-hard” thinking.
(What about (2A)? That’s another funny “non-default” case. Like Approval Reward, this might circumvent many “alignment-is-hard” arguments, at least in principle. But it has its own issues. Anyway, I’ll be putting the (2A) possibility aside for this post.)
(Actually, human Approval Reward in practice probably involves a dash of (2A) on top of the (2B)—most people are imperfect at hiding their true intentions from others.)
…OK, finally, let’s jump into those “6 reasons” that I promised in the title!
1. The human intuition that it’s normal and good for one’s goals & values to change over the years
In human experience, it is totally normal and good for desires to change over time. Not always, but often. Hence emotive conjugations like
…And so on. Anyway, openness-to-change, in the right context, is great. Indeed, even our meta-preferences concerning desire-changes are themselves subject to change, and we’re generally OK with that too.[5]
Whereas if you’re thinking about an AI agent with foresight, planning, and situational awareness (whether it’s a utility maximizer, or a model-based RL agent[6], etc.), this kind of preference is a weird anomaly, not a normal expectation. The default instead is instrumental convergence: if I want to cure cancer, then I (incidentally) want to continue wanting to cure cancer until it’s cured.
Why the difference? Well, it comes right from that diagram in §0.3 just above. For Approval-Reward-free AGIs (which I see as “default”), their self-reflective (meta)desires are subservient to their object-level desires.
Goal-preservation follows: if the AGI wants object-level-thing X to happen next week, then it wants to want X right now, and it wants to still want X tomorrow.
By contrast, in humans, self-reflective preferences mostly come from Approval Reward. By and large, our “true”, endorsed desires are approximately whatever kinds of desires would impress our friends and idols, if they could read our minds. (They can’t actually read our minds—but our own reward function can!)
This pathway does not generate any particular force for desire preservation.[7] If our friends and idols would be impressed by desires that change over time, then that’s generally what we want for ourselves as well.
2. The human intuition that ego-syntonic “desires” come from a fundamentally different place than “urges”
In human experience, it is totally normal and expected to want X (e.g. candy), but not want to want X. Likewise, it is totally normal and expected to dislike X (e.g. homework), but want to like it.
And moreover, we have a deep intuitive sense that the self-reflective meta-level ego-syntonic “desires” are coming from a fundamentally different place as the object-level “urges” like eating-when-hungry. For example, in a recent conversation, a high-level AI safety funder confidently told me that urges come from human nature while desires come from “reason”. Similarly, Jeff Hawkins dismisses AGI extinction risk partly on the (incorrect) grounds that urges come from the brainstem while desires come from the neocortex (see my Intro Series §3.6 for why he’s wrong and incoherent on this point).
In a very narrow sense, there’s actually a kernel of truth to the idea that, in humans, urges and desires come from different sources. As in Social Drives 2 and §0.3 above, one part of the RL reward function is Approval Reward, and is the primary (though not exclusive) source of ego-syntonic desires. Everything else in the reward function mostly gives rise to urges.
But this whole way of thinking is bizarre and inapplicable from the perspective of Approval-Reward-free AI futures—utility maximizers, “default” RL systems, etc. There, as above, the starting point is object-level desires; self-reflective desires arise only incidentally.
A related issue is how we think about AGI reflecting on its own desires. How this goes depends strongly on the presence or absence of (something like) Approval Reward.
Start with the former. Humans often have conflicts between ego-syntonic self-reflective desires and ego-dystonic object-level urges, and reflection allows the desires to scheme against the urges, potentially resulting in large behavior changes. If AGI has Approval Reward (or similar), we should expect AGI to undergo those same large changes upon reflection. Or perhaps even larger—after all, AGIs will generally have more affordances for self-modification than humans do.
By contrast, I happen to expect AGIs, by default (in the absence of Approval Reward or similar), to mainly have object-level, non-self-reflective desires. For such AGIs, I don’t expect self-reflection to lead to much desire change. Really, it shouldn’t lead to any change more interesting than pursuing its existing desires more effectively.
(Of course, such an AGI may feel torn between conflicting object-level desires, but I don’t think that leads to the kinds of internal battles that we’re used to from humans.[8])
(To be clear, reflection in Approval-Reward-free AGIs might still have “complications” of other sorts, such as ontological crises.)
3. The human intuition that helpfulness, deference, and corrigibility are natural
This human intuition comes straight from Approval Reward, which is absolutely central in human intuitions, and leads to us caring about whether others would approve of our actions (even if they’re not watching), taking pride in our virtues, and various other things that distinguish neurotypical people from sociopaths.
As an example, here’s Paul Christiano: “I think that normal people [would say]: ‘If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should try a different tactic for helping them.’”
He’s right: normal people would definitely say that, and our human Approval Reward is why we would say that. And if AGI likewise has Approval Reward (or something like it), then the AGI would presumably share that intuition.
On the other hand, if Approval Reward is not part of AGI / ASI, then we’re instead in the “corrigibility is anti-natural” school of thought in AI alignment. As an example of that school of thought, see Why Corrigibility is Hard and Important.
4. The human intuition that unorthodox consequentialist planning is rare and sus
Obviously, humans can make long-term plans to accomplish distant goals—for example, an 18-year-old could plan to become a doctor in 15 years, and immediately move this plan forward via sensible consequentialist actions, like taking a chemistry class.
How does that work in the 18yo’s brain? Obviously not via anything like RL techniques that we know and love in AI today—for example, it does not work by episodic RL with an absurdly-close-to-unity discount factor that allows for 15-year time horizons. Indeed, the discount factor / time horizon is clearly irrelevant here! This 18yo has never become a doctor before!
Instead, there has to be something motivating the 18yo right now to take appropriate actions towards becoming a doctor. And in practice, I claim that that “something” is almost always an immediate Approval Reward signal.
Here’s another example. Consider someone saving money today to buy a car in three months. You might think that they’re doing something unpleasant now, for a reward later. But I claim that that’s unlikely. Granted, saving the money has immediately-unpleasant aspects! But saving the money also has even stronger immediately-pleasant aspects—namely, that the person feels pride in what they’re doing. They’re probably telling their friends periodically about this great plan they’re working on, and the progress they’ve made. Or if not, they’re probably at least imagining doing so.
So saving the money is not doing an unpleasant thing now for a benefit later. Rather, the pleasant feeling starts immediately, thanks to (usually) Approval Reward.
Moreover, everyone has gotten very used to this fact about human nature. Thus, doing the first step of a long-term plan, without Approval Reward for that first step, is so rare that people generally regard it as highly suspicious. They generally assume that there must be an Approval Reward. And if they can’t figure out what it is, then there’s something important about the situation that you’re not telling them. …Or maybe they’ll assume that you’re a Machiavellian sociopath.
As an example, I like to bring up Earning To Give (EtG) in Effective Altruism, the idea of getting a higher-paying job in order to earn money and give it to charity. If you tell a normal non-nerdy person about EtG, they’ll generally assume that it’s an obvious lie, and that the person actually wants the higher-paying job for its perks and status. That’s how weird it is—it doesn’t even cross most people’s minds that someone is actually doing a socially-frowned-upon plan because of its expected long-term consequences, unless the person is a psycho. …Well, that’s less true now than a decade ago; EtG has become more common, probably because (you guessed it) there’s now a community in which EtG is socially admirable.
Related: there’s a fiction trope that basically only villains are allowed to make out-of-the-box plans and display intelligence. The normal way to write a hero in a work of fiction is to have conflicts between doing things that have strong immediate social approval, versus doing things for other reasons (e.g. fear, hunger, logic(!)), and to have the former win out over the latter in the mind of the hero. And then the hero pursues the immediate-social-approval option with such gusto that everyone lives happily ever after.[9]
That’s all in the human world. Meanwhile in AI, the alignment-is-hard thinkers like me generally expect that future powerful AIs will lack Approval Reward, or anything like it. Instead, they generally assume that the agent will have preferences about the future, and make decisions so as to bring about those preferences, not just as a tie-breaker on the margin, but as the main event. Hence instrumental convergence. I think this is exactly the right assumption (in the absence of a specific designed mechanism to prevent that), but I think people react with disbelief when we start describing how these AI agents behave, since it’s so different from humans.
…Well, different from most humans. Sociopaths can be a bit more like that (in certain ways). Ditto people who are unusually “agentic”. And by the way, how do you help a person become “agentic”? You guessed it: a key ingredient is calling out “being agentic” as a meta-level behavioral pattern, and indicating to this person that following this meta-level pattern will get social approval! (Or at least, that it won’t get social disapproval.)
5. The human intuition that societal norms and institutions are mostly stably self-enforcing
5.1 Detour into “Security-Mindset Institution Design”
There’s an attitude, common in the crypto world, that we might call “Security-Mindset Institution Design”. You assume that every surface is an attack surface. You assume that everyone is a potential thief and traitor. You assume that any group of people might be colluding against any other group of people. And so on.
It is extremely hard to get anything at all done in “Security-Mindset Institution Design”, especially when you need to interface with the real-world, with all its rich complexities that cannot be bounded by cryptographic protocols and decentralized verification. For example, crypto Decentralized Autonomous Organizations (DAOs) don’t seem to have done much of note in their decade of existence, apart from on-chain projects, and occasionally getting catastrophically hacked. Polymarket has a nice on-chain system, right up until the moment that a prediction market needs to resolve, and even this tiny bit of contact with the real world seems to be a problematic source of vulnerabilities.
If you extend this “Security Mindset Institution Design” attitude to an actual fully-real-world government and economy, it would be beyond hopeless. Oh, you have an alarm system in your house? Why do you trust that the alarm system company, or its installer, is not out to get you? Oh, the company has a good reputation? According to who? And how do you know they’re not in cahoots too?
…That’s just one tiny microcosm of a universal issue. Who has physical access to weapons? Why don’t those people collude to set their own taxes to zero and to raise everyone else’s? Who sets government policy, and what if those people collude against everyone else? Or even if they don’t collude, are they vulnerable to blackmail? Who counts the votes, and will they join together and start soliciting bribes? Who coded the website to collect taxes, and why do we trust them not to steal tons of money and run off to Dubai?
…OK, you get the idea. That’s the “Security Mindset Institution Design” perspective.
5.2 The load-bearing ingredient in human society is not Security-Mindset Institution Design, but rather good-enough institutions plus almost-universal human innate Approval Reward
Meanwhile, ordinary readers[10] might be shaking their heads and saying:
“Man, what kind of strange alien world is being described in that subsection above? High-trust societies with robust functional institutions are obviously possible! I live in one!”
The wrong answer is: “Security Mindset Institution Design is insanely overkill; rather, using checks and balances to make institutions stable against defectors is in fact a very solvable problem in the real world.”
Why is that the wrong answer? Well for one thing, if you look around the real world, even well-functioning institutions are obviously not robust against competent self-interested sociopaths willing to burn the commons for their own interests. For example, I happen to have a high-functioning sociopath ex-boss from long ago. Where is he now? Head of research at a major USA research university, and occasional government appointee wielding immense power. Or just look at how Donald Trump has been systematically working to undermine any aspect of society or government that might oppose his whims or correct his lies.[11]
For another thing, abundant “nation-building” experience shows that you cannot simply bestow a “good” government constitution onto a deeply corrupt and low-trust society, and expect the society to instantly transform into Switzerland. Institutions and laws are not enough. There’s also an arduous and fraught process of getting to the right social norms. Which brings us to:
The right answer is, you guessed it, human Approval Reward, a consequence of which is that almost all humans are intrinsically motivated to follow and enforce social norms. The word “intrinsically” is important here. I’m not talking about transactionally following norms when the selfish benefit outweighs the selfish cost, while constantly energetically searching for norm-violating strategies that might change that calculus. Rather, people take pride in following the norms, and in punishing those who violate them.
Obviously, any possible system of norms and institutions will be vastly easier to stabilize when, no matter what the norm is, you can get up to ≈99% of the population proudly adopting it, and then spending their own resources to root out, punish, and shame the 1% of people who undermine it.
In a world like that, it is hard but doable to get into a stable situation where 99% of cops aren’t corrupt, and 99% of judges aren’t corrupt, and 99% of people in the military with physical access to weapons aren’t corrupt, and 99% of IRS agents aren’t corrupt, etc. The last 1% will still create problems, but the other 99% have a fighting chance to keep things under control. Bad apples can be discovered and tossed out. Chains of trust can percolate.
5.3 Upshot
Something like 99% of humans are intrinsically motivated to follow and enforce norms, with the rest being sociopaths and similar. What about future AGIs? As discussed in §0.2, my own expectation is that 0% of them will be intrinsically motivated to follow and enforce norms. When those sociopathic AGIs grow in number and power, it takes us from the familiar world of §5.2 to the paranoid insanity world of §5.1.
In that world, we really shouldn’t be using the word “norm” at all—it’s just misleading baggage. We should be talking about rules that are stably self-enforcing against defectors, where the “defectors” are of course allowed to include those who are supposed to be doing the enforcement, and where the “defectors” might also include broad coalitions coordinating to jump into a new equilibrium that Pareto-benefits them all. We do not have such self-enforcing rules today. Not even close. And we never have. And inventing such rules is a pipe dream.[12]
The flip side, of course, is that if we figure out how to ensure that almost all AGIs are intrinsically motivated to follow and enforce norms, then it’s the pessimists who are invoking a misleading mental image if they lean on §5.1 intuitions.
6. The human intuition that treating other humans as a resource to be callously manipulated and exploited, just like a car engine or any other complex mechanism in their environment, is a weird anomaly rather than the obvious default
Click over to Foom & Doom §2.3.4—“The naturalness of egregious scheming: some intuitions” to read this part.
7. Conclusion
(Homework: can you think of more examples?)
I want to reiterate that my main point in this post is not
but rather
For my part, I’m obviously very interested in the question of whether we can and should put Approval Reward (and Sympathy Reward) into Brain-Like AGI, and what might go right and wrong if we do so. More on that in (hopefully) upcoming posts!
Thanks Seth Herd, Linda Linsefors, Charlie Steiner, Simon Skade, Jeremy Gillen, and Justis Mills for critical comments on earlier drafts.
…and by extension today’s LLMs, which (I claim) get their powers mainly from imitating humans.
I said “surreptitiously” here because if you ostentatiously press a reward button, in a way that the robot can see, then the robot would presumably wind up wanting the reward button to be pressed, which eventually leads to the robot grabbing the reward button etc. See Reward button alignment.
See Perils of under- vs over-sculpting AGI desires, especially §7.2, for why the “nice” desire would not even be temporarily learned, and if it were it would be promptly unlearned; and see “Behaviorist” RL reward functions lead to scheming for some related intuitions; and see §3.2 of the Approval Reward post for why those don’t apply to (non-behaviorist) Approval Reward.
My own take, which I won’t defend here, is that this whole debate is cursed, and both sides are confused, because LLMs cannot scale to AGI. I think the AGI concerns really are unsolved, and I think that LLM techniques really are potentially-safe, but they are potentially-safe for the very reason that they won’t lead to AGI. I think “LLM AGI” is an incoherent contradiction, like “square circle”, and one side of the debate has a mental image of “square thing (but I guess it’s somehow also a circle)”, and the other side of the debate has a mental image of “circle (but I guess it’s somehow also square)”, so no wonder they talk past each other. So that’s how things seem to me right now. Maybe I’m wrong!! But anyway, that’s why I feel unable to take a side in this particular debate. I’ll leave it to others. See also: Foom & Doom §2.9.1.
…as long as the meta-preferences-about-desire-changes are changing in a way that seems good according to those same meta-preferences themselves—growth good, brainwashing bad, etc.
Possible objection: “If the RL agent has lots of past experience of its reward function periodically changing, won’t it learn that this is good?” My answer: No. At least for the kind of model-based RL agent that I generally think about, the reward function creates desires, and the desires guide plans and actions. So at any given time, there are still desires, and if these desires concern the state of the world in the future, then the instrumental convergence argument for goal-preservation goes through as usual. I see no process by which past history of reward function changes should make an agent OK with further reward function changes going forward.
(But note that the instrumental convergence argument makes model-based RL agents want to preserve their current desires, not their current reward function. For example, if an agent has a wireheading desire to get reward, it will want to self-modify to preserve this desire while changing the reward function to “return +∞”.)
…At least to a first approximation. Here are some technicalities: (1) Other pathways also exist, and can generate a force for desire preservation. (2) There’s also a loopy thing where Approval Reward influences self-reflective desires, which in turn influence Approval Reward, e.g. by changing who you admire. (See Approval Reward post §5–§6.) This can (mildly) lock in desires. (3) Even Approval Reward itself leads not only to “proud feeling about what I’m up to right now” (Approval Reward post §3.2), which does not particularly induce desire-preservation, but also to “desire to actually interact with and impress a real live human sometime in the future”, which is on the left side of that figure in §0.3, and which (being consequentialist) does induce desire-preservation and the other instrumental convergence stuff.
If an Approval-Reward-free AGI wants X and wants Y, then it could get more X by no longer wanting Y, and it could get more Y by no longer wanting X. So there’s a possibility that AGI reflection could lead to “total victory” where one desire erases another. But I (tentatively) think that’s unlikely, and that the more likely outcome is that the AGI would continue to want both X and Y, and to split its time and resources between them. A big part of my intuition is: you can theoretically have a consequentialist utility-maximizer with utility function U=log(X)+log(Y), and it will generally split its time between X and Y forever, and this agent is reflectively stable. (The logarithm ensures that X and Y have diminishing returns. Or if that’s not diminishing enough, consider U=loglogX+loglogY, etc.)
To show how widespread this is, I don’t want to cherry-pick, so my two examples will be the two most recent movies that I happen to have watched, as I’m sitting down to write this paragraph. These are: Avengers: Infinity War & Ant-Man and the Wasp. (Don’t judge, I like watching dumb action movies while I exercise.)
Spoilers for the Marvel Cinematic Universe film series (pre-2020) below:
The former has a wonderful example. The heroes can definitely save trillions of lives by allowing their friend Vision to sacrifice his life, which by the way he is begging to do. They refuse, instead trying to save Vision and save the trillions of lives. As it turns out, they fail, and both Vision and the trillions of innocent bystanders wind up dead. Even so, this decision is portrayed as good and proper heroic behavior, and is never second-guessed even after the failure. (Note that “Helping a friend in need who is standing right there” has very strong immediate social approval for reasons explained in §6 of Social drives 1 (“Sympathy Reward strength as a character trait, and the Copenhagen Interpretation of Ethics”).) (Don’t worry, in a sequel, the plucky heroes travel back in time to save the trillions of innocent bystanders after all.)
In the latter movie, nobody does anything quite as outrageous as that, but it’s still true that pretty much every major plot point involves the protagonists risking themselves, or their freedom, or the lives of unseen or unsympathetic third parties, in order to help their friends or family in need—which, again, has very strong immediate social approval.
And @Matthew Barnett! This whole section is based on (and partly copied from) a comment thread last year between him and me.
…in a terrifying escalation of a long tradition that both USA parties have partaken in. E.g. if you want examples of the Biden administration recklessly damaging longstanding institutional norms, see 1, 2. (Pretty please don’t argue about politics in the comments section.)
Superintelligences might be able to design such rules amongst themselves, for all I know, although it would probably involve human-incompatible things like “merging” (jointly creating a successor ASI then shutting down). Or we might just get a unipolar outcome in the first place (e.g. many copies of one ASI with the same non-indexical goal), for reasons discussed in my post Foom & Doom §1.8.7.