I want to build and promote AI systems that are trained to understand and follow two fundamental principles from biology and economics:
Moderation - Enables the agents to understand the concept of “enough” versus “too much”. The agents would understand that too much of a good thing would be harmful even for the very objective that was maximised for, and it would actively avoid such situations. This is based on the biological principle of homeostasis.
Balancing - Enables the agents to keep many important objectives in balance, in such a manner that having average results in all objectives is preferred to extremes in a few. This is based on the economic principle of diminishing returns.
These approaches should help AIs to cooperate better with other agents and humans, reducing the risks of unstoppable or conflict-prone behaviours.
How is it done today and what are the limitations of the current system
Today, many AI systems optimise for a single goal (for example, maximising an unbounded reward) or a handful of unbounded linearly aggregated metrics. They can end up ignoring side effects and racing toward narrow objectives, leading to conflict or unsafe outcomes. This narrow “maximise forever” approach makes it hard to properly handle bounded objectives as well as trade-offs among multiple important concerns (like safety, trust, or resource constraints).
In multi-agent or multi-objective cases, typical approaches still rely on combining everything into one linear reward function (like a single weighted sum), which is still very prone to Goodhart’s law, specification gaming, and power-seeking behaviours where one (easiest) objective is maximised at the expense of everything else.
By missing natural and thus essential “stop” conditions or “good enough” ranges, systems risk runaway resource use or adversarial behaviour, especially in multi-agent contexts where various AIs each push their own single objective to extremes.
This results in the following problems:
Runaway behaviours: Standard unbounded approaches do not have a stopping mechanism (e.g., no concept of “enough”). When maximising goals which are actually bounded, it would become overwhelming or even harmful for humans when optimised past their target ranges. For example, this applies to human emotions and biological needs.
Side effects: With unbounded maximisation and linear reward aggregation, the AI may sacrifice other factors to push one metric higher. This can lead to unintended consequences or conflict with humans and other agents.
Ignoring diminishing returns: Standard single-objective or linear reward aggregation methods have no natural goal switching mechanism, so the system keeps pushing for more of the same even when it no longer makes sense or is inefficient.
Conflict and poor cooperation: When each AI tries to maximise its own objective with no cap, competition can escalate. Minor tasks can blow up into resource grabs or coordination breakdowns.
Difficult to align with changing human preferences: It can be cumbersome to adjust a single overarching reward to achieve corrigibility. However, real needs change over time. A static or purely unbounded and linearly additive reward system does not handle this gracefully and the agent may even escape, resist, or revert the corrections.
What is new in the proposed approach
The proposed approach introduces utility functions following the “homeostatic” and “diminishing returns” framework for AI goals: instead of unboundedly maximising, many objectives have a target range - this applies to most emotionally and biologically related objectives. The rest of the objectives follow diminishing returns - this applies to most instrumental objectives.
The principle of homeostasis is fundamental in biology. Concurrently, multi-objective balancing based on the principle of diminishing returns is fundamental in economics. These two principles can be applied both in RL training and LLM fine-tuning as utility / reward functions.
By design, having “enough” in one dimension encourages switching attention to other important goals. This would yield more balanced and cooperative AI behaviour. It is modeled on biology, economics, and control theory, including homeostasis, which is used to sustain equilibrium (e.g., body temperature, hunger-satiety). When extended to AI, it would mitigate extreme optimisation behaviours, enable joint resource sharing, and align incentives so that multiple AIs can coexist without seeking unlimited power. Because the principle has proven robust in biological organisms and in control-theoretic mechanisms, I am confident this approach will likewise contribute towards more stable, cooperative behaviour in AI systems.
In detail:
Homeostatic goal structures: Instead of a single metric that grows forever, many goals have comfortable target range. E.g., this applies to objectives like "happiness", "novelty", etc., perhaps including even some meta-level goals such as “safety”, “fairness”, “efficiency”. Moving too far above or below desired range is actively penalised, because it would be directly, indirectly, or heuristically harmful. This is inspired by biology where organisms actively keep variables like temperature and hydration within a healthy zone. By using additional mechanisms such as heuristical penalty for excessive optimisation, it might be possible to partially mitigate even unknown or unmeasured harms.
Built-in tradeoffs via diminishing returns: Balancing multiple goals means that as you get closer to one goal’s “enough” zone, there is less benefit to pushing it further, even if the goal is unbounded. The system naturally shifts efforts to other goals that are more in need.
Adaptiveness to changes: Because the system is designed around balancing multiple bounded (usually also homeostatic) or otherwise diminishing-returns objectives, it can pivot more easily when setpoint / target values are adjusted, or new objectives and constraints are introduced. This is so because stakes involved with each change would be smaller.
Why I think it will be successful
Biological precedent: Living organisms have succeeded for millions of years via homeostasis. They seldom fixate on one factor indefinitely.
Existing multi-objective theory: Tools from control theory, cybernetics, and multi-objective combinatorial optimisation confirm that equilibrium-seeking behaviours can be stable and robust.
Better cooperation: Homeostatic agents are less likely to become “power-hungry”, because they do not gain infinite reward from capturing every resource. They often settle into equilibrium states that are easier to share with others. Diminishing returns in unbounded instrumental objectives also enables balanced consideration of other interests.
What does success look like - what are the benefits that could be enabled by this research
Success of this agenda means that a group of AI agents can pursue tasks without escalating into destructive competition. Concretely, I am imagining multi-agent systems that self-limit their objectives, gracefully and proactively yield or cooperate when another agent’s needs become more urgent, and avoid unmerited “take-all” logic that leads to conflict or otherwise extreme actions. Each agent would be more corrigible, interruptible, and would actively avoid manipulative and exploitative behaviours. This scenario would enable safer expansion of future AI capabilities, as each agent respects their own as well as the others’ essential homeostatic constraints.
In detail, success would be demonstrating an AI or multi-agent set of AIs that:
Are able to recognise and properly internally represent homeostatic objectives. They do not maximise such objectives unboundedly since that would be harmful for the very objective being optimised.
Maintain balanced performance across multiple objectives (including unbounded ones) without letting any single dimension run wild.
Cooperate better with humans or other agents - e.g., avoid exploitation and manipulation, negotiate effectively, share resources, and respect boundaries because there is no incentive to hoard indefinitely.
Adapt when the environment or goals change, without catastrophic failures. This means being corrigible and interruptible (as I define these two principles respectively - 1) being tolerant to changes in the objectives and 2) being tolerant to changes in environment which are intentionally caused by other agents).
Potential risks
Some of the potential risks are the following:
Homeostatic systems could be exploitable and manipulatable if these systems are too cooperative. I am hoping that a well-calibrated “middle” stance provides some resilience against exploitation: the agent stays cooperative but not naively altruistic, avoiding extreme vulnerability.
If other developers do not adopt homeostatic or bounded approaches, unbounded AIs might gain power and dominate over cooperative ones since the cooperative, homeostatic, and balanced systems do not strive towards gaining as much instrumental power.
Misspecification of setpoints: If the “healthy ranges” are badly defined, the system might inadvertently ignore or harm misconfigured dimensions. They may even cause significant side effects on correctly configured dimensions while trying to achieve unachievable targets on the misconfigured objectives. So it is no longer sufficient to state that an objective exists, the target should also be set to a reasonable value.
Adversarial destabilisation: Other actors might manipulate a homeostatic AI by pushing one of its homeostatic actual values / metrics far out of range (for example, by creating risks and forcing the homeostatic agent to protect something from unjustified harm), or by indirectly manipulating it into harmful actions by exploiting its cooperative tendencies.
Complex interactions among goals: Juggling many objectives can introduce subtle failure modes, such as the agent becoming paralysed (though paralysis can occasionally be also a good thing when the agent needs to ask for human confirmation or choice). Most importantly, there are scenarios where balancing multiple objectives is not effectively possible and binary (thus discriminative) choices need to be made. These choices would be either a) for purposes of temporary action serialisation or b) permanent commitment choices between exclusive options. Such binary choices can perhaps still be based on the same concave utility functions framework described in this post, but need much more careful calculation and foresight.
What I am working on at the moment
There are three interrelated directions:
Explaining and demonstrating that application of the above described general principles improves alignment and in fact is essential.
However, standard baseline AI models / frameworks (both RL and LLM based) may be not optimally equipped to learn multi-objective concave utility dynamics as is needed for both homeostasis and diminishing returns. The first step in tackling that problem is building benchmarks for measuring these model alignment difficulties. That is a direction I have been largely working on during recent years and will definitely also expand on in the future. I will write more about this soon.
Thethird direction is finding ways for overcoming the limitations of existing models / training frameworks or finding alternate frameworks, so that better fit with the principles described in this post can be implemented.
Thank you for reading! Curious to hear your thoughts on this. Which angle are you most interested in? If you wish to collaborate or support, let’s connect!
What am I trying to promote, in simple words
I want to build and promote AI systems that are trained to understand and follow two fundamental principles from biology and economics:
These approaches should help AIs to cooperate better with other agents and humans, reducing the risks of unstoppable or conflict-prone behaviours.
How is it done today and what are the limitations of the current system
Today, many AI systems optimise for a single goal (for example, maximising an unbounded reward) or a handful of unbounded linearly aggregated metrics. They can end up ignoring side effects and racing toward narrow objectives, leading to conflict or unsafe outcomes. This narrow “maximise forever” approach makes it hard to properly handle bounded objectives as well as trade-offs among multiple important concerns (like safety, trust, or resource constraints).
In multi-agent or multi-objective cases, typical approaches still rely on combining everything into one linear reward function (like a single weighted sum), which is still very prone to Goodhart’s law, specification gaming, and power-seeking behaviours where one (easiest) objective is maximised at the expense of everything else.
By missing natural and thus essential “stop” conditions or “good enough” ranges, systems risk runaway resource use or adversarial behaviour, especially in multi-agent contexts where various AIs each push their own single objective to extremes.
This results in the following problems:
What is new in the proposed approach
The proposed approach introduces utility functions following the “homeostatic” and “diminishing returns” framework for AI goals: instead of unboundedly maximising, many objectives have a target range - this applies to most emotionally and biologically related objectives. The rest of the objectives follow diminishing returns - this applies to most instrumental objectives.
The principle of homeostasis is fundamental in biology. Concurrently, multi-objective balancing based on the principle of diminishing returns is fundamental in economics. These two principles can be applied both in RL training and LLM fine-tuning as utility / reward functions.
By design, having “enough” in one dimension encourages switching attention to other important goals. This would yield more balanced and cooperative AI behaviour. It is modeled on biology, economics, and control theory, including homeostasis, which is used to sustain equilibrium (e.g., body temperature, hunger-satiety). When extended to AI, it would mitigate extreme optimisation behaviours, enable joint resource sharing, and align incentives so that multiple AIs can coexist without seeking unlimited power. Because the principle has proven robust in biological organisms and in control-theoretic mechanisms, I am confident this approach will likewise contribute towards more stable, cooperative behaviour in AI systems.
In detail:
Why I think it will be successful
What does success look like - what are the benefits that could be enabled by this research
Success of this agenda means that a group of AI agents can pursue tasks without escalating into destructive competition. Concretely, I am imagining multi-agent systems that self-limit their objectives, gracefully and proactively yield or cooperate when another agent’s needs become more urgent, and avoid unmerited “take-all” logic that leads to conflict or otherwise extreme actions. Each agent would be more corrigible, interruptible, and would actively avoid manipulative and exploitative behaviours. This scenario would enable safer expansion of future AI capabilities, as each agent respects their own as well as the others’ essential homeostatic constraints.
In detail, success would be demonstrating an AI or multi-agent set of AIs that:
Potential risks
Some of the potential risks are the following:
What I am working on at the moment
There are three interrelated directions:
Thank you for reading! Curious to hear your thoughts on this. Which angle are you most interested in? If you wish to collaborate or support, let’s connect!
Related links