I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
I definitely agree that people are capable of doing things for other reasons besides Approval Reward. I think Approval Reward is just one of many dozens of human innate drives. I also agree that many of those other human innate drives can also lead to long-term planning towards goals. For example, if I’m hungry, maybe I’ll drive to buy food (thus executing a hour-long foresighted consequentialist plan), even if I’m embarrassed to be doing that, rather than proud. (And if I’m embarrassed, not proud, of my plan, then I’m planning despite Approval Reward, not because of it.) Sorry if any of that wasn’t clear from what I wrote.
I have a bunch of nitpicky disagreements with your comment, but I agree with the broader point that I could write a follow-up post, “[N] MORE reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions…”, where none of those N things have anything to do with human Approval Reward. E.g. maybe I could respond to the school of thought that says “AIs will have laziness and akrasia, like humans do” (cf here, here), and to the school of thought that says “technical alignment is moot because AIs are tools not agents” (cf here, here), and maybe other things too. Yeah, sure, that’s true. I did not mean to imply that the 6 things in this post are the ONLY 6 things :)
Thanks!
I think you’re raising two questions, one about how human brains actually work, and one about how future AIs could or should work. Taking them in order:
Q1: IN HUMANS, is Approval Reward in fact innate (as I claim) or do people learn those behaviors & motivations from experience, means-end reasoning, etc.?
I really feel strongly that it’s the former. Some bits of evidence would be: how early in life these kinds of behaviors start, how reliable they are, the person-to-person variability in how much people care about fitting in socially, and the general inability of people to not care about other people admiring them, even in situations where it knowably has no other downstream consequences, e.g. see Approval Reward post §4.1: “the pity play” as a tell for sociopaths.
I think there’s a more general rule that, if a person wants to do X, then either X has a past and ongoing history of immediately (within a second or so) preceding a ground-truth reward signal, or the person is doing X as a means-to-an-end of getting to Y, where Y is explicitly, consciously represented in their own mind as they start to do X. An example of the former is wanting to eat yummy food; an example of the latter is wanting to drive to the restaurant to eat yummy food—you’re explicitly holding the idea of the yummy restaurant food in your mind as you decide to go get in the car. I believe in this more general rule based on how I think reinforcement learning and credit assignment work in the brain. If you buy it, then it would follow that most Approval Reward related behavior has to lead to immediate brain reward signals, since people are not generally explicitly thinking about the long-term benefits of social status, like what you brought up in your comment.
Q2: If you agree on the above, then we can still wonder: IN FUTURE AGIs, can we make AGIs that lack anything like innate Approval Drive, i.e. they’re “innately sociopathic”, but they develop similar Approval-Reward-type behaviors from experience, means-end reasoning, etc.?
This is worth considering—just as, by analogy, humans have innate fear of heights, but a rational utility maximizer without any innate fear of heights will nevertheless display many of the same behaviors (e.g. not dancing near the edge of a precipice), simply because it recognizes that falling off a cliff would be bad for its long-term goals.
…But I’m very skeptical that it works in the case at hand. Yes, we can easily come up with situations where a rational utility maximizer will correctly recognize that Approval Reward type behaviors (pride, blame-avoidance, prestige-seeking, wanting-to-be-helpful, etc.) are the best way of accomplishing its sociopathic goals. But we can also come up with situations where it isn’t, even accounting for unknown unknowns etc.
Smart agents will find rules-of-thumb that are normally good ideas, but they’ll also drop those rules-of-thumb in situations where they no longer make sense for accomplishing their goals. So it’s not enough to say that a rule-of-thumb would generally have good consequences; it has to outcompete the conditional policy of “follow the rule-of-thumb by default, but also understand why the rule-of-thumb tends to be a good idea, and then drop the rule-of-thumb in the situations where it no longer makes sense for my selfish goals”.
Humans do this all the time. I have a rule-of-thumb that it’s wise to wear boots in the snow, but as I’ve gotten older I now understand why it’s wise to wear boots in the snow, and given that knowledge, I will sometimes choose to not wear boots in the snow. And I tend to make good decisions in that regard, such that I far outperform the alternate policy of “wear boots in the snow always, no matter what”.
I’m very confused about what you’re trying to say. In my mind:
Maybe I’m finding your comments confusing because you’re adopting the AI’s normative frame instead of the human programmer’s? …But you used the word “interpretation”. Who or what is “interpreting” the reward function? The AI? The human? If the latter, why does it matter? (I care a lot about what some piece of AI code will actually do when you run it, but I don’t directly care about how humans “interpret” that code.)
Or are you saying that some RL algorithms have a “constitutive” reward function while other RL algorithms have an “evidential” reward function? If so, can you name one or more RL algorithms from each category?
I read your linked post but found it unhelpful, sorry.
This looks a lot more like your reward hacking than specification gaming … I'm actually not sure about your definition here, which makes me think the distinction might not be very natural.
It might be helpful to draw out the causal chain and talk about where on that chain the intervention is happening (or if applicable, where on that chain the situationally-aware AI motivation / planning system is targeted):
(image copied from here, ultimately IIRC inspired from somebody (maybe leogao?)’s 2020-ish tweet that I couldn’t find.)
My diagram here doesn’t use the term “reward hacking”; and I think TurnTrout’s point is that that term is a bit weird, in that actual instances that people call “reward hacking” always involve interventions in the left half, but people discuss it as if it’s an intervention on the right half, or at least involving an “intention” to affect the reward signal all the way on the right. Or something like that. (Actually, I argue in this link that popular usage of “reward hacking” is even more incoherent than that!)
As for your specific example, do we say that the timer is a kind of input that goes into the reward function, or that the timer is inside the reward function itself? I vote for the former (i.e. it’s an input, akin to a camera).
(But I agree in principle that there are probably edge cases.)
This might be a dumb question, but did you try anything like changing the prompt from:
…
After the problem, there will be filler tokens (counting from 1 to {N}) to give you extra space to process the problem before answering.…
to:
…
After the problem, there will be distractor tokens (counting from 1 to {N}) to give you extra space to forget the problem before answering.…
I’m asking because AFAICT the results can be explained by EITHER your hypothesis (the extra tokens allow more space / capacity for computation during a forward pass) OR an alternate hypothesis more like “the LLM interprets this as more of a situation where the correct answer is expected” or whatever, i.e. normal sensitivity of LLMs to details of their prompt.
(Not that I have anything against the first hypothesis! Just curious.)
I have lots of disagreements with evolutionary psychology (as normally practiced and understood, see here), but actually I more-or-less agree with everything you said in that comment.
You mean, if I’m a guy in Pleistocene Africa, then why it instrumentally useful for other people to have positive feelings about me? Yeah, basically what you said; I’m regularly interacting with these people, and if they have positive feelings about me, they’ll generally want me to be around, and to stick around, and also they’ll tend to buy into my decisions and plans, etc.
Also, Approval Reward also leads to norm-following, which is also probably adaptive for me, because probably many of those social norms exist for good and non-obvious reason, cf. Heinrich.
this might not be easily learned by a behaviorist reward function
I’m not sure what the word “behaviorist” is doing there; I would just say: “This won’t happen quickly, and indeed might not happen at all, unless it’s directly in the reward function. If it’s present only indirectly (via means-end planning, or RL back-chaining, etc.), that’s not as effective.”
I think “the reward function is incentivizing (blah) directly versus indirectly” is (again) an orthogonal axis from “the reward function is behaviorist vs non-behaviorist”.
Is the claim that evolutionarily early versions of behavioral circuits had approximately the form...
Yeah I think there’s something to that, see my discussion of run-and-tumble in §6.5.3
It’s true that human moral drives (such as they are) came from evolution in a certain environment. Some people notice that and come up with a plan: “hey, let’s set up AI in a carefully-crafted evolutionary environment such that it will likewise wind up moral”. I have discussed that plan in my Intro series §8.3, where I argued both that it was both a bad plan, and that it is unlikely to happen even if it was a good plan. For example, AIs may evolve to be cruel to humans just as humans are cruel to factory-farmed animals. Humans are often cruel to other humans too.
But your argument is slightly different (IIUC): you’re saying that we need not bother to carefully craft the evolutionary environment, because, good news, the real-world environment is already of the type that mammal-like species will evolve to be kind. I’m even more skeptical of that. Mammals eat each other all the time, and kill their conspecifics, etc. And why are we restricting to mammals here anyway? More importantly, I think there are very important disanalogies between a world of future AGIs and a world of mammals, particularly that AGIs can “reproduce” by instantly creating identical (adult) copies. No comment on whether this and other disanalogies should make us feel optimistic vs pessimistic about AGI kindness compared to mammal kindness. But it should definitely make us feel like it’s a different problem. I.e., we have to think about the AGI world directly, with all its unprecedented weird features, instead of unthinkingly guessing that its evolutionary trajectory will be similar to humans’ (let alone hamsters’).
I’m unclear on your position here. There’s a possible take that says that sufficiently smart and reflective agents will become ruthless power-seeking consequentialists that murder all other forms of intelligence. Your comment seems to be mocking this take as absurd (by using the words “allegedly pure rational”), but your comment also seems to be endorsing this take as correct (by saying that it’s a real failure mode that I will face by not considering evolutionary pressures). Which is it?
For my part, I disagree with this take. I think it’s possible (at least in principle) to make an arbitrarily smart and reflective ASI agent that wants humans and life to flourish.
But IF this take is correct, it would seem to imply that we’re screwed no matter what. Right? We’d be screwed if a human tries to design an AGI, AND we’d be screwed if an evolutionary environment “designs” an AGI. So I’m even more confused about where you’re coming from.
(Much of my response to this part of your comment amounts to “I don’t actually think what you think I think”.)
First, I dislike your description “RL based decision making vs social reward function decision making”. “Reward function” is an RL term. Both are RL-based. All human motivations are RL-based, IMO. (But note that I use a broad definition of “RL”.)
Second, I guess you interpreted me as having a vibe of “Yay Approval Reward!”. I emphatically reject that vibe, and in my Approval Reward post I went to some length to emphasize that Approval Reward leads to both good things and bad things, with the latter including blame-avoidance, jockeying for credit, sycophancy, status competitions, “Simulacrum Level 3”, and more.
Third, I guess you also assumed that I was also saying that Approval Reward would be a great idea for AGIs. I didn’t say that in the post, and it's not a belief I currently hold. (But it might be true, in conjunction with a lot of careful design and thought; see other comment.)
Next: I’m a big fan of understanding the full range of human neurotypes, and if you look up my neuroscience writing you’ll find my detailed opinions about schizophrenia, depression, mania, BPD, NPD, ASPD, DID, and more. As for autism, I’ve written loads about autism (e.g. here, here and links therein), and read tons about it, and have talked to my many autistic friends about their experiences, and have a kid with an autism diagnosis. That doesn’t mean my takes are right, of course! But I hope that, if I’m wrong, I’m wrong for more interesting reasons than “forgetting that autism exists”. :)
I guess your model is that autistic people, like sociopathic people, lack all innate social drives? And therefore a social-drive-free RL agent AGI, e.g. one whose reward signals are tied purely to a bank account balance going up, would behave generally like an autistic person, instead of (or in addition to?) like a sociopath? If so, I very strongly disagree.
I think “autism” is an umbrella term for lots of rather different things, but I do think it’s much more likely to involve social drives set to an unusually intense level rather than “turned off”. Indeed, I think they get so intense that they often feel overwhelming and aversive.
For example, many autistic people strongly dislike making eye contact. If someone had no innate social reactions to other people, then they wouldn’t care one way or the other about eye contact; looking at someone’s eyes would be no more aversive or significant than looking at a plant. So the “no social drives” theory is a bad match to this observation. Whereas “unusually intense social drives” theory does match eye contact aversion.
Likewise, “autism = no social drives” theory would predict that an autistic person would be perfectly fine if his frail elderly parents, parents who are no longer able to directly help or support him, died a gruesome and painful death right now. Whereas “unusually intense social drives” theory would predict that he would not be perfectly fine with that. I think the latter tends to be a better fit!
Anyway, I think if you met a hypothetical person whose innate human social drive strengths were set to zero, they would look wildly different from any autistic person, but only modestly different from a sociopathic (ASPD) person.