... Or more specifically, a post on how, and why, to encode emotions to find out more about goals, rationality and safe alignment in general.


If one tries to naively fit reinforcement learning’s reward functions back onto the human mind, the closest equivalent one may find is emotions. If one delves deeper into the topic though, they will find a mish-mash of other “reward signals” and auxiliary mechanisms in the human brain, (such as the face tracking reflex, which aids us in social development) and ends up hearing about affects, the official term when it comes to the study of emotions. 

At least that is what approximately happened with me. 


Affects and reward functions seem to have a common functional purpose in agents, in that they both direct the agent’s attention towards what is relevant.  


  • Evaluate the ‘goodness’ (valence) of a situation.
  • Are required for the ‘agent’ to perform any actions.
  • Define what the agents learn, what they value, and what goals they can create based on these values.

This means that if we can map and write all of the human affects into reward functions, we can compare various constellations of affects and see which ones produce what human-like behaviours. This in turn may lead to solutions for not only how to induce human-like biases into AI, but also investigate our own values and rationality from a new perspective.


The purpose of this post is to introduce a few ways to proceed with the task of encoding affects. First, there will be a rudimentary definition of the components and some motivational points on what this all could mean. After that there will be an introduction to three distinct levels of representation for various use cases, from philosophical to ontological, and finally pseudocode.



This post is meant to act as a conversation starter, so many points might be alarmingly concise.

The formalities are meant to replicate the functionality of human behaviour, and are not claimed to be exact copies of the neurological mechanisms themselves. Tests should be performed to find out what emerges in the end. Some points might be controversial, so discussion is welcome.




Alright, let's define the two a bit more in depth and see how they compare. 

Reward functions are part of reinforcement learning, where a “computer program interacts with a dynamic environment in which it must perform a certain goal (sic, goal here is the programmer’s) (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that's analogous to rewards, which it tries to maximize.”

- Wikipedia, reinforcement learning

Specific points about reward functions that will conveniently compare well with my argument:

  • Reward functions are the handcrafted functions that measure the agent’s input and assign a score w.r.t. how well they are doing.
  • They are what connect the state of the world to the best actions to take, encoded into the memory of the agent.
  • Rewards are essential in generating values, policies, models and goals.


Affect theory is “the idea that feelings and emotions are the primary motives for human behaviour, with people desiring to maximise their positive feelings and minimise their negative ones. Within the theory, affects are considered to be innate and universal responses that create consciousness and direct cognition.” 

- APA, affect theory

Specific points about affects (citations pending):

  • Affects are the preprogrammed behavioural cues we got from the evolutionary bottleneck, our genes.
  • They direct what we learn onto our cortical regions about the world, being the basis for our values and goals.
  • Without affects, we would have no values, without values, we would have no goals [1, 2].

Disclaimer: The claim here is not that all values and goals necessarily come from emotions later in life, when they can be based on other values and existing knowledge. But rather, that the original source of our very first values came from affects during infancy and childhood, and thus the ultimate source for all values are, in the end, affects.


Further elaboration can be found also from appraisal theory and affective neuroscience.


So what is common? 

Both frameworks define what the agent can learn, what they value, and what goals they can create based on these values. I will posit here even further that neither humans nor AI would “learn what to do” if there weren’t any criteria towards which to learn, thus doing reflexive and random actions only. We can see this clearly from the definition of RL-agents: remove their reward function, and they cannot learn the "relevant" connections from the environment they work in. With humans we could study brain lesion patients and birth defects, but more on that later. What I found thus far was inconclusive, but the search continues. 


But what does it all mean?


Meanwhile, let’s discuss a number of beliefs I have regarding the topic, some might be more certain than others. All of these could have a discussion of their own, but I will simply list them here for now. 


  1. Turning affects into reward functions will enable agents to attain "human-like intelligence". Note, NOT human-level, but an intelligence with possibly the same biases, such as the bias to learn social interactions more readily.
  2. Affects are our prime example of an evaluation-system working in a general intelligence. Although they might not be optimal together, we can compare various constellations of affects and see which ones produce what human-like behaviours.
  3. We could align AI better for humans if we knew more about how we ourselves form our values.
  4. We could also formulate an extra layer for rationality if we better understood the birth and emergence of various value sets.
  5. We can better communicate with AI if their vocabulary would be similar to ours. Introducing the same needs to form social connections and directing their learning to speak could allow an AI to learn language as we do.
  6. If our goals are defined by years of experience on top of affects, we are hard pressed to define such goals for the AI. If we tell it to "not kill humans", it does not have the understanding of 'human', 'kill' or even 'not', until it has formulated those concepts into its knowledge base/cortical regions over years worth of experience (f. ex. GPT has 'centuries worth of experience' in the training data it has). At that point it is likely too late.
  7. If all goals and values stem from emotions, then:
  • There are no profound reasons to have a certain value set, as they are all originating from the possibly arbitrary set of affects we have. But certain values can come as natural by-products to certain instrumental goals, such as survival.
  • Rationality can thus be done only w.r.t. to a certain set of values. 

Alright, back to business.


Examples on representation


Here are three different levels of representation for encoding affects and their interplay within consciousness, from philosophical ponderings to actual pseudocode. 


Surprise was already partially developed with TD-Lambda, but has been further refined by DeepMind with episodic curiosity.

In practice, the agent could be constantly predicting what it might perceive next, and if this prediction is wildly different, the prediction error could be directly proportional to the amount of surprise.


Ontology card: 

Reciprocity violation 

An example of a more fleshed out ontology with the theoretical fields addressed.

This was a card we designed with an affective psychologist some time ago, it basically outlines the interface for the affect within the whole system.


Pseudocode: Learning the concept ‘agent’ 

Affects also have implicit requirements for their functionality, such as the concept “another agent” to tie all of the social affects to. Sorry, this one is a bit of a mess, the idea should be visible though.

This pseudocode addresses a developmental window we humans have, which helps us generate the concept of "agents" faster and more reliably. It is a learned feature within our consciousness, but because many of our social affects are based on the concept of an agent, this is something the genes just have to be sure that the brain has learned (and then linked to via a preprogrammed pointer, meaning the 'affect'). We can see this mechanism breaking partially in some cases of severe autism, where the child doesn't look people in the eyes.

These pseudocodes could be eventually combined into a framework of actual code, and may be tested in places such as OpenAI's multi-agent environment


Alright, that's it for this one. I'll just end this with a repetition of the disclaimer:

Tests should be performed to find out what emerges in the end. Some points might be controversial, so discussion is welcome, and this material will be eventually rectified.


10 comments, sorted by Click to highlight new comments since: Today at 1:43 PM
New Comment

This reminds me of this post about the encoding of "fear" of snakes: https://www.lesswrong.com/posts/bgqmv8YF6HA3mvrPM/attention-to-snakes-not-fear-of-snakes-evolution-encoding

Das this fit? Do you have other examples?

It fits perfectly, thanks! 
Yes, there's a bunch of other mechanisms/phenomena, such as 
- the developmental windows for learning speech and language,
- the spectrum of reactions to distress (anger, fear, etc.),
- the palmar grasp reflex. 
Basically I'm interested in all biological mechanisms that control our learning, not just affects, and even if they seem irrelevant for AI purposes. As can be seen from Kaj's post there, the way to get these systems to work might be nonintuitive, so every little hint will help in the end.

I think another post might be in order to fully explore the list of all of these biological mechanisms at some point, maybe as a pitstop before going into the full deal. 

I have found a source of some more plausible mechanisms tied to common emotions here: Dares, costly signals, and psychopaths (which references The Psychopath Code, see raw text on Github). These sources are focused on psychopaths but give extremely well-suited descriptions of the following classes of emotions:

  • The predator emotions help us hunt and capture prey.
  • The defense emotions prepare us to detect and deal with predators and competitors.
  • The sexual emotions drive us to find sexual partners.
  • The family emotions let us talk to our parents and care for our offspring.
  • The group emotions let us form small social groups.
  • The social emotions let us form looser and larger social groups.

Some examples: 

Hunger [...] Your digestion slows. Your vision and hearing gets sharper and you focus on distinguishing prey from threats. You feel the need to move, yet you are careful to stay invisible. You walk without haste, and keep your posture relaxed. Your breathing is regular, slow.

Euphoria [...] Your hearing switches off and your vision tunnels in on your target. Your breathing and heartbeat accelerate. Blood flows to your muscles, and glucose feeds into your blood. Your eyes widen, your mouth opens, and you bare your teeth.

Surprise [...] "startle response." You flinch away from the threat, and raise your arms in self-defense. You lift your eyebrows and open your eyes wide to see better. Your hearing gets sharp. You exhale hard to clear your lungs of carbon dioxide. Your heart accelerates and you breathe in deep to oxygenate your body for action.

Love - [...] We establish "closeness" by mutual physical contact. The kinds of contact depend on the relationship. The closer you are to another person the more you feel the emotion. Your eyebrows rise, your pupils widen, you smile and laugh and feel happy. You use open and dominant body language. You are more childlike: playful and uninhibited. You seek more contact. You need less sleep.

All of the descriptions are like this, and I think an excellent source when looking for mechanisms that facilitate the recognition of the more abstract patterns.

Other things that are candidates: 

  • Fear of height could work like the spider thing: The visual system detecting "height" based on depth information and a downward look and respond like in fear of spider thing with increased heart-rate and attention.
  • What we find beautiful could come from a heart-rate increase and or other positively valued responses to low-complexity visual cues like
    • the smoothness of visual features or easy to predict patterns ("clear forms") - at least easy to predict for shallow neuronal networks
    • Same for sound patterns - or maybe a spectrum with many small peaks as in surf or wind sounds or cafe conversation. 
  • The visual cues for sexual attraction are relatively well-known. Obviously, the strength of the endocrine response is high. It is plausible that the high number of different fetishes can be explained by the brain learning to associate anything with such a strong signal. Not just a single specific thing as in the spider response.

There's one issue that I don't have an answer yet: how would the visual system detect "height"? 
Could we presume there is a spatial engine that needs to be taught first, and then linked to this phobia?

Or would it make sense to have a straight link to a spatial predictive system instead, and if the system would predict that there's some uncertainty in if the agent suddenly needs more space to maneuver, and then that space is instead occupied with a void? At least *I* cannot look up when the fear of heights triggers, and get a sudden sensation of vertigo: I need to know where the closest brace-point is when I know falling might be imminent.

The visual system wouldn't detect the abstract concept of height, and that would be the brain's job to figure out by being primed on when the thing triggers and what else correlates with it.

I imagine the visual system would detect visual depth from binocular vision. Babies learn this in the first few months. It is one of the things that cause them distress when it gets activated in the brain. I don't know the research papers, but these might be starting pointers:

https://www.beltz.de/fileadmin/beltz/leseproben/978-3-621-27926-0.pdf (picture 2.2, German)   

https://www.thewonderweeks.com/babys-mental-leaps-first-year/ (week 26)

So visual depth you have without much learning - or with other priming steps ahead of that; I understand these are well researched). What is left is the vertical component, and I guess that it comes from the vestibular system. Looking down + visual depths = height trigger. 

It is funny that you mention the need to grasp something, and maybe that is the hard-wired cue: Close the hand. 

If I understood correctly, babies cannot focus their eyes properly for the first two months, and this may indicate they are learning some universal 3D-spatial models into their heads, as a prerequisite for many of the other instincts they have as later developmental windows. So there has to be some thread of signals that string this system to the later affects/instincts, such as the fear of heights. 

It is also funny to relate the ability of many ungulate babies ability to walk immediately on birth, meaning there has to be some seriously robust set of instincts that coordinate this for them. This blurs the ... requirements... between instinctual and learned coordination, but I believe in the end all cortex-having brains would benefit from moving away from instincts and into learned models.

I'll have to read this one too, thanks.