I notice that there has been very little discussion on why and how considering homeostasis is significant, even essential for AI alignment and safety. The current post aims to contribute to or assamending that situation. In this post I will treat alignment and safety as explicitly separate subjects, which both benefit from homeostatic approaches. This text is a distillation and reorganisation of three of my older blog posts at Medium: * Making AI less dangerous: Using homeostasis-based goal structures (2017) * Project proposal: Corrigibility and interruptibility of homeostasis based agents (2018) * Diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility via the golden middle way. (2018) I will probably share more such distillations or weaves of my old writings in the future. Introduction Much of AI safety discussion revolves around the potential dangers posed by goal-driven artificial agents. In many of these discussions, the agent is assumed to maximise some utility metric over an unbounded timeframe. This simplification, while mathematically convenient, can yield pathological outcomes. A classic example is the so-called “paperclip maximiser”, a “utility monster” which steamrolls over other objectives to pursue a single goal (e.g. creating as many paperclips as possible) indefinitely. “Specification gaming”, Goodhart’s law, and even “instrumental convergence” are also closely related phenomena. However, in nature, organisms do not typically behave like pure maximisers. Instead, they operate under homeostasis: a principle of maintaining various internal and external variables (e.g. temperature, hunger, social interactions) within certain “good enough” ranges. Going far beyond those ranges — too hot, too hungry, too socially isolated — leads to dire consequences, so an organism continually balances multiple needs. Crucially, “too much of a g
This conceptual overview post is intended to explain what I mean by the principles of "homeostasis", "diminishing returns", and "balancing" - how these ideas differ, complement, and interact with each other. Alongside, there is also an overview of my research agenda. What am I trying to promote, in simple words...
Working document – feedback very welcome. Ideas in the main body are those I currently see as highest-leverage; numerous items under Appendix are more tentative and would benefit from critique as well as additions. I am hoping that this post will serve also as a generally interesting brainstorming collection and...
By Roland Pihlakas, Sruthi Susan Kuriakose, Shruti Datta Gupta This research is now available as an Arxiv preprint at https://arxiv.org/abs/2509.02655 . The paper includes tables with select snippets of failure mode examples. Summary and Key Takeaways Relatively many past AI safety discussions have centered around the dangers of unbounded utility...
I notice that there has been very little discussion on why and how considering homeostasis is significant, even essential for AI alignment and safety. The current post aims to contribute to or assamending that situation. In this post I will treat alignment and safety as explicitly separate subjects, which both...
This is an AI Safety Camp 10 project that I will be leading. With this post, I am looking for external collaborators, ideas, questions, resource suggestions, feedback, and other thoughts. Summary Based on various sources of anthropological research, I have compiled a preliminary list of universal (cross-cultural) human values. It...
Background: A multi-objective decision-making AI Previously I've proposed balancing multiple objectives via multi-objective RL as a method to achieve AI Alignment. If we want an AI to achieve goals including maximizing human preferences, or human values, but also maximizing corrigibility, and interpretability, and so on--perhaps the key is to simply...
By Ben Smith, Roland Pihlakas, and Robert Klassert Thanks to Linda Linsefors, Alex Turner, Richard Ngo, Peter Vamplew, JJ Hepburn, Tan Zhi-Xuan, Remmelt Ellen, Kaj Sotala, Koen Holtman, and Søren Elverlin for their time and kind remarks in reviewing this essay. Thanks to the organisers of the AI Safety Camp...