Protein Reinforcement and DNA Consequentialism


33


Eliezer_Yudkowsky

Followup toEvolutionary Psychology

It takes hundreds of generations for a simple beneficial mutation to promote itself to universality in a gene pool.  Thousands of generations, or even millions, to create complex interdependent machinery.

That's some slow learning there.  Let's say you're building a squirrel, and you want the squirrel to know locations for finding nuts.  Individual nut trees don't last for the thousands of years required for natural selection.  You're going to have to learn using proteins.  You're going to have to build a brain.

Protein computers and sensors can learn by looking, much faster than DNA can learn by mutation and selection.  And yet (until very recently) the protein learning machines only learned in narrow, specific domains.  Squirrel brains learn to find nut trees, but not to build gliders - as flying squirrel DNA is slowly learning to do.  The protein computers learned faster than DNA, but much less generally.

How the heck does a double-stranded molecule that fits inside a cell nucleus, come to embody truths that baffle a whole damn squirrel brain?

Consider the high-falutin' abstract thinking that modern evolutionary theorists do in order to understand how adaptations increase inclusive genetic fitness.  Reciprocal altruism, evolutionarily stable strategies, deterrence, costly signaling, sexual selection - how many humans explicitly represent this knowledge?  Yet DNA can learn it without a protein computer.

There's a long chain of causality whereby a male squirrel, eating a nut today, produces more offspring months later:  Chewing and swallowing food, to digesting food, to burning some calories today and turning others into fat, to burning the fat through the winter, to surviving the winter, to mating with a female, to the sperm fertilizing an egg inside the female, to the female giving birth to an offspring that shares 50% of the squirrel's genes.

With the sole exception of humans, no protein brain can imagine chains of causality that long, that abstract, and crossing that many domains.  With one exception, no protein brain is even capable of drawing the consequential link from chewing and swallowing to inclusive reproductive fitness.

Yet natural selection exploits links between local actions and distant reproductive benefits.  In wide generality, across domains, and through levels of abstraction that confuse some humans.  Because - of course - the basic evolutionary idiom works through the actual real-world consequences, avoiding the difficulty of having a brain imagine them.

Naturally, this also misses the efficiency of having a brain imagine consequences.  It takes millions of years and billions of dead bodies to build complex machines this way.  And if you want to memorize the location of a nut tree, you're out of luck.

Gradually DNA acquired the ability to build protein computers, brains, that could learn small modular facets of reality like the location of nut trees. To call these brains "limited" implies that a speed limit was tacked onto a general learning device, which isn't what happened.  It's just that the incremental successes of particular mutations tended to build out into domain-specific nut-tree-mapping programs.  (If you know how to program, you can verify for yourself that it's easier to build a nut-tree-mapper than an Artificial General Intelligence.)

One idiom that brain-building DNA seems to have hit on, over and over, is reinforcement learning - repeating policies similar to policies previously rewarded.  If a food contains lots of calories and doesn't make you sick, then eat more foods that have similar tastes.  This doesn't require a brain that visualizes the whole chain of digestive causality.

Reinforcement learning isn't trivial:  You've got to chop up taste space into neighborhoods of similarity, and stick a sensor in the stomach to detect calories or indigestion, and do some kind of long-term-potentiation that strengthens the eating impulse.  But it seems much easier for evolution to hit on reinforcement learning, than a brain that accurately visualizes the digestive system, let alone a brain that accurately visualizes the reproductive consequences N months later.

(This efficiency does come at a price:  If the environment changes, making food no longer scarce and famines improbable, the organisms may go on eating food until they explode.)

Similarly, a bird doesn't have to cognitively model the airflow over its wings.  It just has to track which wing-flapping policies cause it to lurch.

Why not learn to like food based on reproductive success, so that you'll stop liking the taste of candy if it stops leading to reproductive success?  Why don't birds wait and see which wing-flapping policies result in more eggs, not just more stability?

Because it takes too long.  Reinforcement learning still requires you to wait for the detected consequences before you learn.

Now, if a protein brain could imagine the consequences, accurately, it wouldn't need a reinforcement sensor that waited for them to actually happen.

Put a food reward in a transparent box.  Put the corresponding key, which looks unique and uniquely corresponds to that box, in another transparent box.  Put the key to that box in another box.  Do this with five boxes.  Mix in another sequence of five boxes that doesn't lead to a food reward.  Then offer a choice of two keys, one which starts the sequence of five boxes leading to food, one which starts the sequence leading nowhere.

Chimpanzees can learn to do this.  (Dohl 1970.)  So consequentialist reasoning, backward chaining from goal to action, is not strictly limited to Homo sapiens.

But as far as I know, no non-primate species can pull that trick.  And working with a few transparent boxes is nothing compared to the kind of high-falutin' cross-domain reasoning you would need to causally link food to inclusive fitness.  (Never mind linking reciprocal altruism to inclusive fitness).  Reinforcement learning seems to evolve a lot more easily.

When natural selection builds a digestible-calorie-sensor linked by reinforcement learning to taste, then the DNA itself embodies the implicit belief that calories lead to reproduction.  So the long-term, complicated, cross-domain, distant link from calories to reproduction, is learned by natural selection - it's implicit in the reinforcement learning mechanism that uses calories as a reward signal.

Only short-term consequences, which the protein brains can quickly observe and easily learn from, get hooked up to protein learning.  The DNA builds a protein computer that seeks calories, rather than, say, chewiness.  Then the protein computer learns which tastes are caloric.  (Oversimplified, I know.  Lots of inductive hints embedded in this machinery.)

But the DNA had better hope that its protein computer never ends up in an environment where calories are bad for it...  or where sexual pleasure stops correlating to reproduction... or where there are marketers that intelligently reverse-engineer reward signals...