Heritability, Behaviorism, and Within-Lifetime RL

Steven Byrnes

I’m a firm subscriber to both:

(A) The theory that people’s personalities are significantly predictable from their genes, and mostly independent of how their parents raised them (at least within the typical distribution, i.e. leaving aside cases of flagrant abuse and neglect etc.). See e.g. popular expositions of this theory by Judith Harris or by Bryan Caplan for the fine print.
(B) The theory that we should think of people’s beliefs and goals and preferences developing via within-lifetime learning, and more specifically via within-lifetime Model-based Reinforcement Learning (details), with randomly-initialized (“learning-from-scratch”) world-model and value function.

I feel like there’s an idea in the air that these two beliefs are contradictory. For example, one time someone politely informed me that (A) is true and therefore obviously (B) must be false.

Needless to say, I don’t think they’re contradictory. Indeed, I think that (B) naturally implies (A).

But I admit that they sorta feel contradictory. Why do they feel that way? I think because:

(A) is sorta vaguely affiliated with cognitive science, evolutionary psychology, etc.
(B) is sorta vaguely affiliated with B.F. Skinner-style behaviorism,
…and those two schools-of-thought are generally considered to be bitter enemies.

In this short post I want to explain why we should put aside that baggage and see (A) & (B) as natural allies.

Two dubious steps to get from (B) to Behaviorism

Here’s the fleshed-out argument as I see it:

I’ll go through the two dubious steps in the opposite order.

Dubious step #1: “No more learning / unlearning after the kid grows up”

Here are two stories:

“RL with continuous learning” story: The person has an internal reward function in their head, and over time they’ll settle into the patterns of thought & behavior that best tickle their internal reward function^[1].
- If they spend a lot of time in the presence of their parents, they’ll gradually learn patterns of thought & behavior that best tickle their innate internal reward function in the presence of their parents.
- If they spend a lot of time hanging out with friends, they’ll gradually learn patterns of thought & behavior that best tickle their innate internal reward function when they’re hanging out with friends.
- As adults in society, they’ll gradually learn patterns of thought & behavior that best tickle their innate internal reward function as adults in society.
“RL learn-then-get-stuck” story: The kid learns patterns of thoughts & behavior in childhood, and then sticks with those patterns for the rest of their lives no matter what.

Claim: I think the “RL with continuous learning” story, not the “RL learn-then-get-stuck” story, is how we should generally be thinking about things. At least in humans. (Probably also in non-human animals, but that’s off-topic.)

I am not making a strong statement that the “RL learn-then-get-stuck” story is obviously and universally wrong and stupid nonsense. Indeed, I think there are edge cases where the “learn-then-get-stuck” story is true. For example, childhood phobias can sometimes persist into adulthood, and certainly childhood regional accents do. Some related discussion is at Scott Alexander’s blog post “Trapped priors”.

Instead, I think we should mainly believe the “RL with continuous learning” story for empirical reasons:

Heritability studies: See top. More specifically, note that (IIRC) parenting style can have some effect on what a kid believes and how they behave while a child, but these effects fade out when the kid grows up.
Culture shifts: Culture shifts are in fact possible, contrary to the “RL learn-then-get-stuck” story. For example, almost everybody in the USA opposed gay marriage and now almost nobody does. Almost nobody used Facebook and cellphones and now almost everyone does. And Trumpism, and so on. These all seem generally inconsistent with the “RL learn-then-get-stuck” story.

Probably other things too, like maybe the fact that people can sometimes have very different personalities when hanging out with their families versus friends.

Here’s how I make sense of “RL with continuous learning” being the main story, rather than “RL learn-then-get-stuck”: Even if in principle learned patterns of thought & behavior can “get stuck”, I think there’s a lot of “shaking the jar” that happens in our long lives. I think that as people try living in different places, and talking to different people, and they change jobs, and they find romance, etc., they wind up “trying out” lots of different patterns of thought & behavior, at least a little bit. And thus they wind up discovering the patterns that feel most natural and appealing to them—even if those same patterns of thought & behavior were hidden from them when they were kids, or otherwise made to feel unappealing.

(Side note: Different people have different innate internal reward functions, and these reward-function differences are heritable, and I claim that that’s a major reason that adult personalities and preferences are heritable.)

Dubious step #2: “Parents effectively control the reward during childhood”

Even leaving aside what happens after the kid grows up, let’s zoom in on childhood. Above I wrote “If they spend a lot of time in the presence of their parents, they’ll gradually learn patterns of thought & behavior that best tickle their internal reward function in the presence of their parents.” This is definitely not the same as “they’ll gradually learn patterns of thought & behavior that their parents want them to have”! OK, here, I made a figure so that nobody will miss this point:

I think this has a lot to do with the fact that the parent can’t see inside the kid’s head and issue positive rewards when the kid thinks docile & obedient thoughts, and negative rewards when the kid thinks defiant thoughts. Indeed, the parents are at an extraordinary disadvantage, as shown in this comparison table:

The kid finds it inherently rewarding to think defiant thoughts	Parents reward their kid with praise
The reward is laser-targeted at the exact thing to be incentivized.	The reward is poorly targeted—e.g., maybe kid will think bad thoughts / do bad things, not get caught, and be rewarded.
The “reward” is always in fact rewarding.	The praise might be rewarding from the kid’s perspective … but it also might not. It might even be aversive.
This might happen 3000 times per day.	This might happen a few times per day.

(Let’s leave aside the question of wtf kind of internal innate reward function could have the property that “thinking defiant thoughts is inherently rewarding”. I think that question is answerable, but that the answer is convoluted and currently-unknown-to-Science. I’m working on it; more discussion here.)

^{^}
I’m deliberately using the weird phrasing “tickle their reward function” instead of “maximize their reward”, because I find that everyone reads “maximize their reward” as “maximize the discounted sum of present and future rewards”, and I don’t think RL algorithms need to work that way in general, and more specifically I don’t think RL works that way in the brain. In the brain case I think it’s more like “maximize the reward of the thought that I’m thinking right now”, and meanwhile “thoughts” are these complicated things that can (but don’t necessarily!) involve plans with predicted future consequences. If we’re wondering whether Carol is going to do cocaine, the question to ask is: When the thought “I’m going to do cocaine” pops into Carol’s head, does she find that thought appealing or aversive? That’s a different question from “will doing cocaine lead to future rewards for Carol?”. It’s also a different question from “does Carol expect that doing cocaine will lead to future rewards?” More discussion here and here.

[-]Gunnar_Zarncke3y*74

I want to remind everybody that the Reward is not the optimization target. The child is not optimizing for its reward. Thinking in terms of reward only can be misleading. A lot depends on what areas of the reward landscape are visited. It is not exactly a lock-in - because high-dimension spaces have few local maxima - but the reward landscape is big and once you are in a certain area even if you can in principle go everywhere else it may take more than a lifetime to get there. This matches the long-term effects cited. Still, parents and environment influence the exploration of the reward space and earlier movements may lock in some aspects - not only language but also concepts like ego and identity.

[-]tailcalled3y40

Nice.

A related point, the heritable effects might not need to be direct. For instance, suppose someone is genetically predisposed to be small and weak. In that case, it seems like physically dangerous things like big strong aggressive people or wild activities would be more dangerous for them, and they would therefore have more reason to fear them. We might expect the within-lifetime RL to therefore lead to them being more neurotic/emotional about such things.

[-]hold_my_fish3y30

Your position seems obviously right, so I'd guess the confusion is coming from the internal reward vs external reward distinction that you discuss in the last section. When thinking of possible pathways for genetics to influence our preferences, internal reward seems like the most natural.

That said, there are certainly also cases where genetics influences our actions directly. Reflexes are unambiguous examples of this, and there are probably others that are harder to prove.