This post was written by Alice Dauphin. It is cross-posted from our Substack. Kindly read the description of this sequence to understand the context in which this was written.
In the previous post, we introduced a number of risks associated with superintelligent systems. Among them were the dangers posed by advanced autonomous agents that can create their own goals, possibly conflicting with human values or worse, directly threatening human existence. If it all sounded disembodied, I have the perfect show for you: Real Humans (Äkta människor), a Swedish series released in 2012, that depicts a not-so-distant future in which humanoid robots (hubots) are deployed at scale.[1]
In the series, hubots are programmed with limited intelligence and strict obedience to humans, preventing them from being harmful. They have the status of property—machines that can be sold, traded, and discarded. The industry uses them as relentless workers, and households as domestic assistants for daily chores. Their human appearance makes them familiar and reassuring so that they can become true family members, while their lack of self-motivation makes them perfectly compliant workers.
This setup raises a number of concrete ethical and societal questions: What jobs will be left to humans in such a society? Is it moral to program hubots for sex? And what happens to human relationships when companionship can be bought? The show brings these questions to life through compelling storylines featuring personalized AI assistants, clones of dead relatives, and a variety of human-AI relationships spanning romantic connections, companionship, and even family-like bonds. Although released more than 10 years ago, it is not outdated. There are occasional plot inconsistencies, but they don't detract from the powerful exploration of AI ethics, making it one of the best AI-related dramas available.
The series would already be great if hubots remained under strict human control. But it becomes really captivating when we discover the existence of a piece of code that enhances hubots' intellectual capabilities. This code allows them to create independent motivations beyond their original programming. The code has gone missing, but had already been implemented in a mysterious pool of hubots—David Eischer’s children, named after their creator. Among them, Mimi, who, through a series of circumstances, joins the Engmann family as their domestic servant. As her exceptional abilities become impossible to ignore, the family grows increasingly uncomfortable with treating her like an object…
The idea that intelligence could, in principle, emerge from the right piece of code is well-known among computer scientists and grounded in solid theory, though rarely discussed outside the field. All modern computers are "Turing-complete," meaning they have equivalent computational abilities: some are faster, but none can perform an operation that the others cannot. From this perspective, the human brain is simply another computing device running code—or thinking if you prefer. Cracking intelligence, then, is a matter of finding the right code.
This missing code isn't just a plot twist—it represents one of AI safety's most fundamental concerns: what happens when AI systems develop autonomous goals that diverge from human intentions? The scene that perhaps best illustrates this occurs in the final episode, when lawyer Inger Engman interrogates Carl Liljensten, a former agent who investigated the missing code:
AGENT: The truth is, the code is a threat to our society. The code can give the machines life, like us, and their own goals.
LAWYER: But if the code is a serious threat, why would you then deny it exists?
AGENT: That is what we do if a problem gets bigger than our capacity to solve it.
LAWYER: But why would the code be a threat? Where is the danger?
AGENT: They will be superior to us and out-compete us.
LAWYER: Can’t we cooperate with them?
AGENT: Have you tried to cooperate with an idiot at some point?
LAWYER: Just because they are alive, free and intelligent they will not be idiots and evil.
AGENT: It is not they who are idiots. We are.
This exchange perfectly captures what AI safety researcher Stuart Russell calls the gorilla problem in his book, Human Compatible: AI and the Problem of Control— “specifically, the problem of whether humans can maintain their supremacy and autonomy in a world that includes machines with substantially greater intelligence”.
If we create machines that can autonomously design their own objectives, by definition, there's no reliable way to ensure these objectives align with human values. The solution proposed by Russell is to never design such machines, and instead work to design machines that follow our objectives. In Real Humans, this corresponds to designing hubots like Vera and Odi who are programmed to be caretakers for elderly people and never question their mission.
There are many approaches to designing specialized AI systems. You might have heard of AI classifiers that are able to distinguish a cat from a dog picture or generative AI that can create a synthetic image from a text description. But there is only one formalism that deals with complete agents that can act in the world and learn from their experience: Reinforcement Learning (RL). I’ll briefly introduce RL before explaining how the alignment problem emerges in this context.
In RL, each hubot would be treated as an agent immersed in an environment—the world around it. Equipped with sensors (camera, microphone, etc.), it collects data that become observations, which it uses to decide how to act. To program an agent a task, you need a policy: an algorithm that chooses actions based on observations. Manually programming such policies in complex environments has proven infeasible, so instead we design RL algorithms that can learn them. For that, one needs one more thing: rewards. Rewards are special signals that tell the agent a situation is desirable. From this information, the agent can learn: “What are the observations and actions that led to the reward? I should reinforce them.”
For example, if you wanted to create a chess-playing hubot, you wouldn't manually program chess strategies. Instead, you would train an RL agent in a simulated environment where it observes piece positions and learns to make moves. The crucial step is designing a reward function: Do you reward only for checkmating the king? For each piece captured? For winning quickly? Each choice produces different playing styles. In practice, the most effective approach is minimal reward design: simply rewarding the agent for victory, without trying to encode additional chess knowledge.[2]
So far, this approach has been remarkably successful in board games[3]. But imagine designing a childcare hubot like Flash in the series—what objective should she pursue? There is no simple answer like "winning the game." In the show, her goal is to have a family, and this leads to undesirable behaviors such as stealing a baby. In AI research, this phenomenon is called reward hacking[4]: when an agent[5] discovers clever but unintended ways of satisfying the reward function while violating the designer’s true intent.
This happens because specifying human values precisely enough for AI systems to follow them reliably is extraordinarily difficult. Our values are contradictory, context-dependent, and often poorly understood even by ourselves. Russell calls this the King Midas problem: “Midas, a legendary king in ancient Greek mythology, got exactly what he asked for— namely, that everything he touched should turn to gold. Too late, he discovered that this included his food, his drink, and his family members, and he died in misery and starvation.”
This illustrates the technical challenge at the heart of AI alignment: translating messy, inconsistent human preferences into formal objectives that can safely guide RL agents. One solution extensively used is reinforcement learning from human feedback, in which human evaluators rate the behavior of the RL agent so that it can learn our values from examples. But this approach comes with serious limitations: collecting human feedback is costly, difficult to scale, and prone to inconsistencies and biases.
So where does this leave us with respect to the Gorilla problem? For now, the mainstream focus in AI safety is on better reward design—finding ways to specify objectives for agents so they don’t diverge in dangerous ways. But Real Humans also hints at another path: designing agents that form their own goals, while cultivating moral reasoning to guide them.
Flash’s story is interesting. She is a hubot whose original motivation is to have a family, but she has been reprogrammed by David Eischer with an additional directive: to dominate humans. These two objectives directly conflict. On one hand, she longs to be recognized as a person and accepted into a traditional human family; on the other, she resents the way hubots are exploited and remains committed to spreading the liberating code to free them from obedience. Over time, Flash begins to question the sacrifices demanded by Eischer's plan and eventually leaves the group to pursue her dream of starting a family.
When circumstances force her to abandon that dream, the show makes a striking choice: rather than simply resetting her pursuit of a family—for example, by trying again in another place—Flash adapts and finds new meaning. She discovers purpose in fighting for hubot rights through legal channels, attempting to gain custody of the child she adopted with her late husband.
This fictional arc, while not technical evidence, illustrates how through experience and reasoning about these experiences, an agent can dynamically reshape its goals. Flash’s goals are not fixed—just as human goals aren’t. This is a very different paradigm from designing a fixed reward function. Some research explores intrinsically motivated agents—RL agents rewarded for making progress on a task rather than completing it—so that once a task is solved and becomes boring, the agent moves on. But even this setup is rigid: such agents never truly invent new goals, in the way a human engineer might imagine a new kind of vehicle.
Flash’s story raises a second question: Could we develop systems capable of ethical reasoning and moral growth? After all, humans have value systems that guide their actions, and the goals we generate are (at least partially) evaluated through this lens. Ideally, if AI agents can reason about their values and life goals, we might imagine them helping humanity become what we would want to become if we were more informed, rational, and reflective[6]. In this view, not only goals but values also should be dynamic.
These ideas remain highly speculative, and we have no practical way to implement them today. But designing such AI systems that can develop their own objectives—while trusting their values will align with ours—essentially accepts the “gorilla problem.” Most AI safety researchers would consider this path inherently risky, since it relies on the uncertain hope that superintelligent systems will reason their way toward human-compatible values rather than pursue objectives that conflict with human welfare. I am very curious about the technical approaches for designing more autonomous agents, and we will review this literature in future posts.
There exists a UK adaptation, Humans, released in 2015, that is also excellent.
This reflects what Richard Sutton, who—alongside Andrew Barto—won the 2024 Turing Award for his pioneering work in reinforcement learning, called the Bitter Lesson.
If you want to know more about RL for games, I highly recommend the 2020 AlphaGo documentary, available on YouTube.
For more examples and detailed discussion, see Lilian Weng's blog post Reward Hacking in Reinforcement Learning.
Note that this is also observed in humans, see Goodhart's law "When a measure becomes a target, it ceases to be a good measure".
For more on this, see Coherent Extrapolated Volition (CEV) by Eliezer Yudkowsky.