1950

LESSWRONG
LW

1949
Agent FoundationsCoordination / CooperationEthics & MoralityMechanism DesignAI

11

12 Angry Agents, or: A Plan for AI Empathy

by Ram Rachum, Davidmanheim
14th Oct 2025
14 min read
0

11

11

New Comment
Moderation Log
More from Ram Rachum
View more
Curated and popular this week
0Comments
Agent FoundationsCoordination / CooperationEthics & MoralityMechanism DesignAI

In the previous two posts (first, second) we laid out our take on AI alignment, which involves conservative philosophy and the political school of thought of Agonistic Democracy. We also suggested an approach to AI alignment in which the conflicts between multiple agents lead to an AI system that has a sufficiently evolved understanding of right and wrong, and therefore may develop into a deserving steward for the future of humankind.

In this post we'll flesh out the mechanics of why we think conflicts between multiple AI agents could create an AI system that has a moral understanding, empathy, or a conscience. We'll do that by taking as a case study the classic 1957 film, "12 Angry Men" (Wikipedia, Amazon), which explores these themes through a fictional U.S. murder trial, and the deliberation of its jurors over the defendant's guilt. We argue that 12AM makes points about the process of justice that are transferable to our discussion of AI alignment. We'll attempt to answer the question "what is empathy?" and use that answer to plan a multi-agent AI system in which empathy could arise.

12 Angry Men: A Recap

A teenage boy from a poor neighborhood has been accused of murdering his father with a switchblade. The prosecution presents overwhelming evidence: (a) witnesses claim they saw or heard the killing, (b) the boy had a violent argument with his father that night, (c) he owns a similar knife, and (d) he has a weak alibi. The boy is described as a "slum kid", which may be 1950s code for a person of color. If convicted of this capital murder charge, he will receive a mandatory death sentence.

The movie's shtick is that it spends almost all of its 96 minutes inside the jury room and its attached bathroom; don't you love it when a bold artistic choice saves hundreds of thousands of 1957 dollars? We're given a minimal opening scene in the courtroom, mostly to give a sympathetic face to the accused teenager, and one minimal closing scene at the end. Everything in between is set in the jury room.

The single location gimmick was also used in more modern movies: The Breakfast Club (1985) which takes place in a school library and Reservoir Dogs (1992) which takes place in an abandoned warehouse. We find that the unchanging location allows for a calm and deep exploration of the characters and their relationships; the sterility of the backdrop reminds us of a lab, which prompts us to think of the characters as specimens under our microscope. This is a perfect setup for the study of empathy that we attempt here.

In the jury room, the jurors start off by voting 11-1 in favor of the boy's guilt. At first juror 8 is the lone dissenter, but then he convinces juror 9 to overturn his vote, followed by the rest of the jurors. Some of the jurors are convinced by calm, logical arguments, while others are convinced by intense peer pressure. The last holdout is juror 3; the movie climaxes with juror 3 going into a fit of rage and tears, and then finally changing his vote to "not guilty", leading to the boy's acquittal.

Conflicts in 12AM

Most countries' justice systems do not use a jury system; the U.S.'s justice system does, as do the justice systems of most Commonwealth countries. Therefore, one way to interpret 12AM is: "See? This is why we have juries in the U.S.. If the boy's guilt was determined by a single person instead of twelve different people, he may have been wrongfully convicted." We agree, and we suggest that the way to understand why juries are effective is by returning to Agonistic Democracy, which we discussed in our previous posts. This school of thought claims that what makes our democracies work is not a calm and reasoned debate between the different parties, but the conflicts, power play and petty politics between these parties. We argue that the same logic applies to the judicial system.

Let's remove suspense from this post; we suggest that AI agents should be made to have conflicts with each other like the jurors in 12AM did, and that because of these conflicts they will make for a multi-agent AI system that is able to make just decisions. Let's describe three of the conflicts between the jurors so we could see what their essential characteristics are, and consider whether we could reproduce them in a multi-agent AI system.

  1. "I'll kill you!" Juror 8 had been systematically dismantling the evidence: showing the knife wasn't unique, questioning witness reliability, suggesting alternative scenarios; which increasingly frustrated juror 3, who accused him of making up "fairy tales" to save a guilty kid. When juror 8 pressed him on why he was so personally invested in convicting the boy, juror 3 lost control and lunged across the table shouting "I'll kill you!"

    The prosecution's case relied partly on witnesses claiming the boy shouted "I'm going to kill you" at his father hours before the murder. Juror 3 just demonstrated that people say this phrase without meaning it literally.

  2. The slums rant: Juror 10 launches into a bigoted speech about people from slums being born liars and violent by nature. One by one, the other jurors turn their backs on him or walk away from the table. Even those who initially voted guilty refuse to associate with his reasoning. When he finishes, he sits alone while juror 4 coldly states: "We've heard enough." Juror 10 barely speaks again and quietly changes his vote later.
  3. The torn photograph: Throughout the film, juror 3 makes bitter comments about kids not respecting fathers. Near the end, he pulls out his wallet while ranting, and a photograph falls out—his son. He stares at it, then tears it up. He hasn't spoken to his son in two years after a fight. His insistence on the boy's guilt was never about this case. When he finally votes not guilty, he breaks down crying.

Conflicts and Empathy

Here is a simple way to understand the movie: "When the jurors experienced conflicts with each other that resembled the conflicts experienced by the boy, it was easier for them to relate to the boy's actions, understand his motivations and decide on a just verdict."

We agree with this explanation. We've all been in situations in which we judged someone harshly for doing a bad action, only to realize that we did a similar action that we felt no remorse for. If we're feeling defensive, we might react by saying "it's completely different, I did the action under circumstance X which is acceptable, while the other person did it under circumstance Y which is unacceptable." If we're feeling more humble, we'll sigh, thinking "I guess that person is okay, and I was just being judgmental. I shouldn't judge someone before I walk a mile in their shoes. How many of my moral convictions would crumble if I had more empathy for other people?"

Let's dive into these thoughts.

What makes these thoughts so difficult is that we're treating the concept of empathy as a black box. We know that we have empathy somewhere inside our brains. We know that it's a crucial component of making right moral choices. We know that some interactions, like the "I'll kill you" one above, can cause us to feel a surge of empathy. We can't say with confidence what empathy actually is; and we're not sure whether an AI could have empathy or something equivalent to it, but if it could, we very badly want it to. Therefore, we need to understand what this empathy even is, before we could attempt to replicate it with AI.

Cognitive scientists typically distinguish between cognitive empathy (understanding others' mental states) and affective empathy (sharing their emotional experiences). While these frameworks are useful, we want to explore empathy through a different lens for the purposes of AI alignment. Here's our hot take: Empathy can be understood as resonance, and therefore we need to allow our agents to resonate with each other.

Empathy as Resonance

It's interesting that people use the same word "resonance" for both a scientific meaning and an interpersonal meaning. Scientifically, resonance is an increase in oscillation magnitude caused by an external oscillation. But we casually take "what you said resonated with me" to mean that this person was inspired to empathy by the words they heard.

Let's explore the idea that resonance is more than just a metaphor for empathy. We argue that empathy has mechanics that are similar to those of a physical resonant system.

What is physical resonance all about? "An increase in oscillation magnitude" is correct but not very intuitve. To get an intuition for resonance, we first need to get an intuition for oscillation. Let's provide oversimplified descriptions for both.

Whenever anything in nature oscillates, it's because there's a push-pull dynamic at work: something moves the object away from its resting position and something else tries to restore it back. When you pluck a guitar's string, your finger pushes the string in one direction, and then the string's tension pulls it back towards the center. The string has momentum and it then travels in the opposite direction. It repeats this behavior hundreds of times per second. This is oscillation. The takeaway is that two actors are involved.

When the string vibrates, it causes another string of the same frequency to vibrate sympathetically without direct contact, producing a new sound. This is resonance.

We claim that the essential ingredient for a resonant system is at least two pairs of actors, which we'll notate as A-B and C-D. Something happens between A and B, and it causes C and D to also do the same thing, where C takes the same role to D as A did to B. In other words, there is an isomorphism between A-B and C-D.

In the guitar example, the first string experiences a plucking force (A) that displaces it from equilibrium, but its tension (B) pulls it back toward center, creating oscillation. Similarly, the second string is pushed by air pressure waves (C) while its own tension (D) provides the restoring force. When the frequency of the air waves matches the natural frequency determined by the string's tension, resonance occurs and the second string produces its own sound. That sound produces yet another resonance in our ears: the sound waves (E) push our eardrums while the eardrum membrane's elasticity (F) provides the restoring force. A is to B what C is to D, what E is to F.

We see the "I'll kill you" moment in 12AM as an A-B C-D E-F occurrence. The boy on trial (A) said "I'll kill you" to his father (B); juror 3 (C) said "I'll kill you" to juror 8 (D), and then... Something clicked in all the jurors' minds. What clicked were those Es and Fs.

Resonance between people and within people

We are all many. We don't know exactly how our brains work, but we have many thoughts, emotions and sensations. When we deal with situations, we hear the voices of people from our past, advising us or criticizing us. The relationships between those parts in ourselves are isomorphic to the relations between different people in a social group. The Es that clicked in the jurors' minds are the parts of them that explode with uncontrollable anger. The Fs are the parts that experience someone else's burst of anger. And when we watched the movie, even though we know the jurors are just actors and not really angry, the Es and Fs that we imagined interacting in their mind caused Gs and Hs to interact in ours.

What could make AI agents "juror-like"?

Let's assume we're right, and if we get multiple AI agents to resonate with each other like the jurors did, they will be able to make empathic choices as a group. The question is: What properties will AI agents need to have in order to resonate like these jurors?

This is a very difficult question. We can't be sure we'll succeed in answering it, but let's try sneaking up on it slowly so it won't see us coming.

Intuitively, reinforcement learning agents feel a lot more juror-like than LLMs. Language models don't really care. You could play chess against an LLM, and if it beats you, it won't rate the response in which it delivers the winning move as any better than a response in which it forfeits the game. It's just predicting the next token[1]. In contrast, RL agents do prefer good outcomes to bad outcomes. RL agents get a reward signal that can be positive or negative, and they are trained to take actions that increase their reward. While this is a very partial answer to our question, at least we have a vector.

What's missing is pain. In all of the conflicts we discussed in 12AM, pain was an essential ingredient. Juror 3 didn't just disagree; he flew off the handle. He was in agony, tearing up, and at some point he almost punched juror 8. We've talked about the importance of pain in part 1, and whenever we bring up Agonistic Democracy, we remember that its name starts with "agony". We suggest that pain is essential to empathy, because empathy requires feeling another person's pain.

If we tried to model pain in an AI agent, a negative reward signal might be a good start, but not nearly enough. What's missing for it to be more like pain? This question is dangerously close to the question "what does it take to make an AI agent that has feelings?" which we would rather steer clear of. Our question is easier because (a) pain is a much simpler feeling than e.g. falling in love and (b) we currently only care about the agent behaving like a human that is in pain, rather than having an internal, subjective experience of pain.

Designing our jury room

We'l reveal our answer to the above question in the next section; before we do that, we have to lay down some ground rules for our multi-agent environment.

We care about pain, remember, not because of one individual experiencing pain. We want to reproduce the A-B C-D phenomenon. In the example of the guitar string resonating, the tension between A and B, and also C and D, was a physical force moving a piece of string back and forth. We suggest that with our agents, that physical force would be replaced by pain. This means that we care about a situation in which agent A experiences pain, and agent B interacts with it, dealing with A's pain. When we say that agent B is "dealing" with agent A's pain, we mean that the two agents are trained together for many episodes and therefore agent B is trained to reason around agent A's pain.

Let's loosely define the environment for our thought experiments: Agents A and B are operating in a partially-observable stochastic game (POSG) whose rules we won't specify, but we assume it's sufficiently rich: (1) The agents get much useful data from their observations, (2) they are trained with a deep RL algorithm that takes these observations as input and (3) the environment provides ample opportunity for these algorithms to choose actions that sharply increase the agents' rewards. If you need a mental image, imagine something like Minecraft. We also assume that the environment allows agents to meaningfully interact with each other, both cooperatively and adversarially, observing each other's behavior and taking actions that affect each other's reward. The deep RL algorithm trains each agent to predict the other agent's behavior and even manipulate it; the algorithm can learn both honest manipulation (making friends) and dishonest manipulation (e.g. zero-determinant extortion), as long as it's profitable. Each agent develops a theory of mind (ToM) of the other agent, meaning it has an imperfect but useful understanding of the other agent's observations, actions and rewards.

For our algorithm we might use an Opponent Shaping algorithm such as AdAlign. Opponent Shaping algorithms train the agents to be aware of other agents' training processes, making them better at both cooperation and extortion.

What we described above is a standard multi-agent RL (MARL) setup, except for the opponent shaping part which is still considered experimental. We now ask the question: How could we change this setup so that agents that get a low reward behave more like humans that experience pain?

Pain as performance degradation

We propose that the most important property of pain is this: When living beings feel pain, their performance degrades. This applies to both physical and emotional pain. If we wake up in the morning with a sharp toothache, it could be harder for us to concentrate on our daily tasks, and we'd be less likely to be pleasant to other people. If a coworker hurt our feelings by making a disparaging comment about our work, we're more likely to feel defensive and not be in a calm state that is conducive to knowledge work.[2] Let's shamelessly attempt to implement this property in our MARL setup.

There are many ways to model performance degradation in neural networks. Two ideas that come to mind are random noise and dropout of a subset of the neurons. Let's explore the former. Here is our suggestion for a crude implementation of pain: Whenever an agent's reward signal is under a certain threshold, we throw noise into its neural network. We scale that noise to be inverse in size to the reward, and maybe make it continue for N timesteps with an exponential decay even after the reward signal is back above the threshold. Such false data in the network is reminiscent of the human experience of "seeing stars". The agent will continue operating, but its output neurons will have different activation levels than what they usually have in a no-pain scenario, causing the agent to sometimes choose non-optimal actions. If the reward goes even lower, then that difference becomes even bigger, and the agent becomes less likely to solve problems in its environment and get a high reward. This means that pain can be a vicious cycle.

We defined an agent's pain as being determined by its reward; in a POSG, the reward might depend on anything in the environment, including the agent's action, the actions of all the other agents, and any inanimate objects. Therefore any of these could cause an agent to experience pain.

We established in the last section that the two agents form a limited theory of mind of each other and use it to get more reward. We argue that this ToM could include the other agent's pain. When agent A experiences pain, agent B could learn:

  1. What conditions cause agent A to experience pain, and how much of it;
  2. Which of B's actions, in each environment state, are likely to cause A to experience pain;
  3. How A's action probability distribution changes when A is experiencing pain, i.e. which actions A is more likely to take when it's in pain.

When A and B are working together to achieve a goal that benefits both of them, A's pain would be a detriment to B. B could learn to be careful not to cause A pain, and even move away objects that cause A pain. When A and B are competing for the same resource, A's pain could mean an advantage for B, and B may learn to do actions that cause A pain. Of course, A may learn to retaliate, causing B pain in a tit-for-tat dynamic until B relents.

From pain to empathy to alignment

Our model of AI pain that we presented above is crude, and it's also a wild, wild guess of what kind of AI would eventually lead to alignment. Assuming this guess happens to be correct, there are many missing steps. What would the environment rules look like? What happens when we introduce more agents? How do we get from pain to resonance? How do we get from resonance to empathy, and how do we get the agents to apply that empathy to real-world tasks? What would be the experimental roadmap for answering these questions?

We'll be happy to hear your thoughts, and your feedback on the reasoning we made in this post. In the meanwhile we hope you subscribe to Ram's research mailing list.


Thanks to Cameron Allen, Nitay Alon, Benjamin Dayan, Markov Grey, Andrew Lawrence, Reuth Mirsky and Yonatan Nakar for thoughts and feedback on a draft version of this post.

Stills from "12 Angry Men" (1957) are copyright of Metro-Goldwyn-Mayer Studios, reproduced under fair use for the purpose of criticism and analysis.

  1. ^

     By “LLM” here, we mean a deployed model under standard decoding without an external reward loop. Modern systems are pretrained with maximum-likelihood and often further optimized with RL methods: RLHF and, increasingly, RL without human feedback in verifiable domains (for example, code with unit tests or game environments). Those RL stages shape behavior, but at runtime the model does not compute or maximize a reward unless you explicitly add one (such as a reward model, best-of-N selection, or search).

  2. ^

     In some cases pain can cause us to perform better. A person lifting weights in the gym may find that they're lifting more easily when they're angry, and a theater actor may find that pain allows them to make a more interesting performance.