I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
I’m very confused about what you’re trying to say. In my mind:
Maybe I’m finding your comments confusing because you’re adopting the AI’s normative frame instead of the human programmer’s? …But you used the word “interpretation”. Who or what is “interpreting” the reward function? The AI? The human? If the latter, why does it matter? (I care a lot about what some piece of AI code will actually do when you run it, but I don’t directly care about how humans “interpret” that code.)
Or are you saying that some RL algorithms have a “constitutive” reward function while other RL algorithms have an “evidential” reward function? If so, can you name one or more RL algorithms from each category?
I read your linked post but found it unhelpful, sorry.
This looks a lot more like your reward hacking than specification gaming … I'm actually not sure about your definition here, which makes me think the distinction might not be very natural.
It might be helpful to draw out the causal chain and talk about where on that chain the intervention is happening (or if applicable, where on that chain the situationally-aware AI motivation / planning system is targeted):
(image copied from here, ultimately IIRC inspired from somebody (maybe leogao?)’s 2020-ish tweet that I couldn’t find.)
My diagram here doesn’t use the term “reward hacking”; and I think TurnTrout’s point is that that term is a bit weird, in that actual instances that people call “reward hacking” always involve interventions in the left half, but people discuss it as if it’s an intervention on the right half, or at least involving an “intention” to affect the reward signal all the way on the right. Or something like that. (Actually, I argue in this link that popular usage of “reward hacking” is even more incoherent than that!)
As for your specific example, do we say that the timer is a kind of input that goes into the reward function, or that the timer is inside the reward function itself? I vote for the former (i.e. it’s an input, akin to a camera).
(But I agree in principle that there are probably edge cases.)
This might be a dumb question, but did you try anything like changing the prompt from:
…
After the problem, there will be filler tokens (counting from 1 to {N}) to give you extra space to process the problem before answering.…
to:
…
After the problem, there will be distractor tokens (counting from 1 to {N}) to give you extra space to forget the problem before answering.…
I’m asking because AFAICT the results can be explained by EITHER your hypothesis (the extra tokens allow more space / capacity for computation during a forward pass) OR an alternate hypothesis more like “the LLM interprets this as more of a situation where the correct answer is expected” or whatever, i.e. normal sensitivity of LLMs to details of their prompt.
(Not that I have anything against the first hypothesis! Just curious.)
I have lots of disagreements with evolutionary psychology (as normally practiced and understood, see here), but actually I more-or-less agree with everything you said in that comment.
You mean, if I’m a guy in Pleistocene Africa, then why it instrumentally useful for other people to have positive feelings about me? Yeah, basically what you said; I’m regularly interacting with these people, and if they have positive feelings about me, they’ll generally want me to be around, and to stick around, and also they’ll tend to buy into my decisions and plans, etc.
Also, Approval Reward also leads to norm-following, which is also probably adaptive for me, because probably many of those social norms exist for good and non-obvious reason, cf. Heinrich.
this might not be easily learned by a behaviorist reward function
I’m not sure what the word “behaviorist” is doing there; I would just say: “This won’t happen quickly, and indeed might not happen at all, unless it’s directly in the reward function. If it’s present only indirectly (via means-end planning, or RL back-chaining, etc.), that’s not as effective.”
I think “the reward function is incentivizing (blah) directly versus indirectly” is (again) an orthogonal axis from “the reward function is behaviorist vs non-behaviorist”.
Is the claim that evolutionarily early versions of behavioral circuits had approximately the form...
Yeah I think there’s something to that, see my discussion of run-and-tumble in §6.5.3
It seems like the short-term predictor should learn to predict (based on context cues) the behavior triggered by the hardwired circuitry. But it should predict that behavior only 0.3 seconds early?
You might be missing the “static context” part (§5.2.1). The short-term predictor learns a function F : context → output. Suppose (for simplicity) the context is the exact same vector c₁ for 5 minutes, and then out of nowhere at time T, an override appears and says “Ground Truth Alert: F(c₁) was too low!” Then the learning algorithm will make F(c₁) higher for next time.
But the trick is, the output is determined by F(c₁) at T–(0.3 seconds), AND the output is determined by the exact same function F(c₁) at T–(4 minutes). So if the situation recurs a week later, F(c₁) will be higher not just 0.3 seconds before the override, but also 4 minutes before it. You can’t update one without the other, because it’s the same calculation.
Oh! Is the key point that there's a kind of resonance, where this system maintains the behavior of the genetically hardwired components?
I don’t think so. If I understand you correctly, the thing you’re describing here would be backwards-looking (“predicting” something that already happened), but what we want is forward-looking (how much digestive enzymes be needed in 5 minutes?).
Is the idea that the lookahead propagates earlier and earlier with each cycle? You start with a 0.3 second prediction. But that means that supervisory signal (when in the "defer-to-predictor mode") is 0.3 seconds earlier, which means that the predictor learns to predict the change in output 0.6 seconds ahead…
The thing you’re pointing to is a well-known thing that happens in TD learning in the AI literature (and is a limitation to efficient learning). I think it can happen in humans and animals—I recall reading a paper that (supposedly) observed dopamine marching backwards in time with each repetition, which of course got the authors very excited—but if it happens at all, I think it’s rare, and that humans and animals generally learn things with many fewer repetitions than it would take for the signal to walk backwards step by step.
Instead, I claim that the “static context” picture above is capturing an important dynamic even in the real world where the context is not literally static. Certain aspects of the context are static, and that’s good enough. See the sweater example in §5.3.2 for how (I claim) this works in detail.
Drake equation wrong interpretation: The mean of multiplication is not multiplication of the mean
As I’ve previously written, I disagree that this constitutes a separate explanation. This paper is just saying as far as we know, one or more of the Drake equation parameters might be very much lower than Drake’s guess. But yeah duh, the whole point of this discourse is to figure out which parameter is very much lower and why. Pretty much all the other items on your list are engaged in that activity, so I think this box is an odd one out and should be deleted. (But if you’re trying to do a lit review without being opinionated, then I understand why you’d keep it in. I just like to rant about this.)
Multicellular life is difficult
In The Vital Question, Nick Lane argues (IMO plausibly) that the hard step is not multicellular life per se but rather eukaryotes (i.e. cellular life with at least two different genomes). Not all eukaryotes are multicellular, but once eukaryotes existed, they evolved multicellularity many times independently (if I recall correctly).
Space travel is very difficult for unknown reasons
AFAICT, “interstellar travel is impossible or extremely slow because there’s too much dust and crap in space that you’d collide with” remains a live possibility that doesn’t get enough attention around these parts.
immoral but very interesting experiment … not seeing any human face for multiple months, be it in person, on pictures or on your phone
There must be plenty of literature on the psychological effects of isolation, but I haven’t looked into it much. (My vague impression is: “it messes people up”.) I think I disagree that my theory makes a firm prediction, because who is to say that the representations will drift on a multiple-month timescale, as opposed to much slower? Indeed, the fact that adults are able to recall and understand memories from decades earlier implies that, after early childhood, pointers to semantic latent variables remain basically stable.
2. Try to disconnect your previous thoughts from arriving at “she feels pain”
I would describe this as: if it’s unpleasant to think about how my friend is suffering, then I can avoid those unpleasant feelings by simply not thinking about that, and thinking about something else instead.
For starters, there’s certainly a kernel of truth to that. E.g. see compassion fatigue, where people will burn out and quit jobs working with traumatized people. Or if someone said to me: “I stopped hanging out with Ahmed, he’s always miserable and complaining about stuff, and it was dragging me down too”, I would see that as a perfectly normal and common thing for someone to say and do. But you’re right that it doesn’t happen 100% of the time, and that this merits an explanation.
My own analysis is at: §4.1.1 and §4.1.2 of my (later) Sympathy Reward post. The most relevant-to-you part starts at: “From my perspective, the interesting puzzle is not explaining why this ignorance-is-bliss problem happens sometimes, but rather explaining why this ignorance-is-bliss problem happens less than 100% of the time. In other words, how is it that anyone ever does pay attention to a suffering friend? …”
So that’s my take. As for your take, I think one of my nitpicks would be that I think you’re giving the optimizer-y part of the brain a larger action space than it actually has. If I would get a higher reward by magically teleporting, I’m still not gonna do that, because I can’t. By the same token, if I would get a higher reward by no longer knowing some math concept that I’ve already learned, tough luck for me, that is not an available option in my action space. My world-model is built by predictive (a.k.a. self-supervised) learning, not by “whatever beliefs would lead to immediate higher reward”, and for good reason: the latter has pathological effects, as you point out. (I’ve written about it too, long ago, in Reward is Not Enough.) I do have actions that can impact beliefs, but only in an indirect and limited way—see my discussion of motivated reasoning (also linked in my other comment).
Thanks!
I think you’re raising two questions, one about how human brains actually work, and one about how future AIs could or should work. Taking them in order:
Q1: IN HUMANS, is Approval Reward in fact innate (as I claim) or do people learn those behaviors & motivations from experience, means-end reasoning, etc.?
I really feel strongly that it’s the former. Some bits of evidence would be: how early in life these kinds of behaviors start, how reliable they are, the person-to-person variability in how much people care about fitting in socially, and the general inability of people to not care about other people admiring them, even in situations where it knowably has no other downstream consequences, e.g. see Approval Reward post §4.1: “the pity play” as a tell for sociopaths.
I think there’s a more general rule that, if a person wants to do X, then either X has a past and ongoing history of immediately (within a second or so) preceding a ground-truth reward signal, or the person is doing X as a means-to-an-end of getting to Y, where Y is explicitly, consciously represented in their own mind as they start to do X. An example of the former is wanting to eat yummy food; an example of the latter is wanting to drive to the restaurant to eat yummy food—you’re explicitly holding the idea of the yummy restaurant food in your mind as you decide to go get in the car. I believe in this more general rule based on how I think reinforcement learning and credit assignment work in the brain. If you buy it, then it would follow that most Approval Reward related behavior has to lead to immediate brain reward signals, since people are not generally explicitly thinking about the long-term benefits of social status, like what you brought up in your comment.
Q2: If you agree on the above, then we can still wonder: IN FUTURE AGIs, can we make AGIs that lack anything like innate Approval Drive, i.e. they’re “innately sociopathic”, but they develop similar Approval-Reward-type behaviors from experience, means-end reasoning, etc.?
This is worth considering—just as, by analogy, humans have innate fear of heights, but a rational utility maximizer without any innate fear of heights will nevertheless display many of the same behaviors (e.g. not dancing near the edge of a precipice), simply because it recognizes that falling off a cliff would be bad for its long-term goals.
…But I’m very skeptical that it works in the case at hand. Yes, we can easily come up with situations where a rational utility maximizer will correctly recognize that Approval Reward type behaviors (pride, blame-avoidance, prestige-seeking, wanting-to-be-helpful, etc.) are the best way of accomplishing its sociopathic goals. But we can also come up with situations where it isn’t, even accounting for unknown unknowns etc.
Smart agents will find rules-of-thumb that are normally good ideas, but they’ll also drop those rules-of-thumb in situations where they no longer make sense for accomplishing their goals. So it’s not enough to say that a rule-of-thumb would generally have good consequences; it has to outcompete the conditional policy of “follow the rule-of-thumb by default, but also understand why the rule-of-thumb tends to be a good idea, and then drop the rule-of-thumb in the situations where it no longer makes sense for my selfish goals”.
Humans do this all the time. I have a rule-of-thumb that it’s wise to wear boots in the snow, but as I’ve gotten older I now understand why it’s wise to wear boots in the snow, and given that knowledge, I will sometimes choose to not wear boots in the snow. And I tend to make good decisions in that regard, such that I far outperform the alternate policy of “wear boots in the snow always, no matter what”.