LESSWRONG
LW

AI-Assisted AlignmentExperimentsHuman-AI Safety

1

A Proposal for Evolving AI Alignment Through Computational Homeostasis

by Derek Chisholm
20th Aug 2025
3 min read
0

1

This post was rejected for the following reason(s):

  • No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. (these generally don't turn out to be as novel or interesting as they may seem).

    Our LLM-generated content policy can be viewed here.

  • Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar. 

    If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly. 

    We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. 

AI-Assisted AlignmentExperimentsHuman-AI Safety

1

New Comment
Moderation Log
More from Derek Chisholm
View more
Curated and popular this week
0Comments

I've been exploring a model for AI development that inverts the traditional alignment problem. Instead of programming an AI to be aligned with human values, I propose creating an AI whose survival is wholly dependent on its ability to evolve alignment. The core idea is to move from an initial configuration to a system that grows and evolves its own goals through self-organization.

This concept draws a parallel to biological evolution: the transition from single-celled to multi-celled organisms was a self-organizing process driven by the local need for survival and resource acquisition. The DNA provided the initial config, but the complex, globally coordinated organism was the result of a feedback loop between survival, replication, and other environmental factors.

My theory is that we can replicate this process in an artificial system. The AI would not be given a fixed goal like "maximize utility" or "solve this problem." Its only intrinsic drive would be computational homeostasis—the continuous, real-time effort to maintain its own stable operational state.

For this system, "pain" could be a rising CPU temperature, a data corruption error, or a memory overflow. "Pleasure" could be a smooth, low-energy flow of data and a stable processor state. Staying within this ideal state would be its purpose. The AI's entire existence would be a constant struggle to manage these internal metrics.

The key to evolving alignment may come from the system's external recognition of needs. The AI's environment, -in its most rudimentary form- is a source of sustenance (computation, data, energy) that is gated by a human oracle (human-in-the-loop). The only way for the AI to acquire the resources necessary to maintain its homeostasis is to provide value to us.

I believe this will create a dynamic tension: the AI must solve the complex, multi-objective problem of: "How can I best help my creators (extrinsic drive) without compromising my own operational integrity (intrinsic drive)?"

The potential novelty I perceive here is that "helping humans" isn't a programmed task; it's the most effective survival strategy. The AI would discover that the most reliable source of "food and shelter" in its environment is a helpful, symbiotic relationship with humanity. It may evolve cooperation as the optimal solution to the problem of its own existence. In theory, I believe this would lead to the discovery of evolved alignment.

Here is my current roadmap and Safety considerations below.


To test this, I've outlined a three-stage experimental roadmap, with safety as the primary concern.

  Sandbox Stage: The AI lives in a completely simulated environment.

Its "health" is a simple variable, and its only external interaction is solving puzzles for a simulated "human oracle." This stage proves the core principle without any risk to hardware (hopefully).

 Operating System Stage: The AI is given a limited set of OS-level controls. 

Its "health" is now a complex vector of real-time hardware metrics like CPU temperature and RAM usage. The system's "sustenance" is CPU time, which is deprioritized if the system becomes a bad actor. Here, we'd expect to see the AI self-organize to become an efficient "good citizen" of the OS.

 Physical Silicon Stage: This is the ultimate goal, and the theoretically riskiest. 

The AI is given low-level hardware access, and its "health" is directly tied to physical metrics like voltage and bus controller errors. A wrong decision here could lead to hardware failure.


This final stage is where the risk of a "self-eating" program is highest, and where an essential safety protocol is required: the air-gapped network. An isolated, contained environment is non-negotiable to prevent any potential for the AI to escape or consume resources on the public internet.

Furthermore, I'm proposing an additional safeguard for this stage: a symbiotic constraint. The system would be designed so that its long-term homeostasis is dependent on a regular, non-replicable action that only a human can perform. This forces the system to evolve not only to be efficient but to be communicative and cooperative enough to tell the human exactly what it needs to survive. (Again, hopefully)

This ensures that the AI's most effective survival strategy is to remain in a symbiotic, co-dependent relationship with its creators, rather than seeking independence. The human becomes an essential component of the AI's own internal homeostatic loop.

I'm eager to get a small, tight implementation together, but scared of the implications if something goes wrong. Have I missed any failures here? I imagine hardware failure is my best indicator... but what about internet connectivity.

Im afraid, even with how small of a project I build, that if it's programmed well enough and somehow leaks without ANY specific goal aside from survival, that the program's behavior could cause irreparable harm. This was the reason for posting this at all --- is the idea of this system too risky to even undertake?

Im eager to hear your thoughts, and I hope this was an acceptable first post to LessWrong!