Building AI safety benchmark environments on themes of universal human values

Roland Pihlakas

This is an AI Safety Camp 10 project that I will be leading. With this post, I am looking for external collaborators, ideas, questions, resource suggestions, feedback, and other thoughts.

Summary

Based on various sources of anthropological research, I have compiled a preliminary list of universal (cross-cultural) human values. It seems to me that various of these universal values resonate with concepts from AI safety, but use different keywords. It might be useful to map these universal values to more concrete definitions using concepts from AI safety.

One notable detail in this research is that in case of AI and human cooperation, the values are not symmetric as they would be in case of human-human cooperation. This arises because we can change the goal composition of agents, but not of humans. Additionally there is the crucial difference that agents can be relatively easily cloned, while humans cannot. Therefore, for example, a human may have a universal need for autonomy, while an AI agent might imaginably not have that need built-in. If that works out, then the agent would instead have a need to support human autonomy.

The objective of this project would be to implement these mappings of concepts into tangible AI safety benchmark environments.

The non-summary

A related subject is balancing multiple human values (as the title says, it is in plural!). The human values and needs have to be met to a reasonable degree, that is, considering balancing all other human values as well. In this context, balancing is not the same as “tradeoff”. In some interpretations and use cases, tradeoff means linear rate of substitution between objectives, but as economists know well - generally humans prefer averages in all objectives to extremes in a few objectives. This means a naive approach of summing up the rewards of an AI agent would not yield aligned results. It is essential to use nonlinear utility functions for transforming the rewards before summing them up in the RL algorithm.

The current compiled list of universal human values is available in this document: "Universal ethical values - Survey of values"
https://docs.google.com/document/d/1ZZiToC149g9vKJGZRhktFmLYdB5J63nbClvCN_CxqAM/edit?usp=sharing (We may publish it as a separate LW post in the future).

It might be also interesting to consider how agents could internally represent the diversity of human needs, for which there are more than hundred words for representing various nuances. Take a look for example at this list of needs from the framework of Nonviolent Communication (scroll down to the second half of the webpage to see the list of needs): https://www.orvita.be/en/card/#:~:text=meaning%20(1)-,purpose,-goal%0Avision%0Adream . One of the central ideas of NVC is making distinction between expressed strategies / stances versus implicit actual needs. The needs can be compared to ultimate values, while strategies are only instrumental values. One way to experiment with such scenarios would be utilising Sims. There have been LLM interfaces built for Sims. Among other Sims interfaces, you may want to take a look at this one: https://github.com/joonspk-research/generative_agents .

On a related note, in economics, there are inherently multi-objective and nonlinear concepts like diminishing returns, concave utility functions, marginal utility, indifference curves, convex preferences, complementary goods, Cobb-Douglas utilities, willingness to accept, and willingness to pay, prospect theory, etc. These and many other well known formulations and phenomena from economics need to be introduced to AI safety in order for both humans and agents to better understand and implement our preferences and values. When planning new benchmarks, we can include some themes derived from these utility and preference theories in economics as well. An utility monster-like AI would not only be unsafe, it would also be economically unsound.

For implementing these benchmarks, it might be helpful that I have created a convenient framework which enables implementing multi-agent multi-objective environments. This framework was built as an elaborate fork of DeepMind's gridworlds framework. Additionally, I have already implemented about a dozen benchmarks using this framework, so the framework has been validated and these existing benchmarks can be also utilised as an example code for implementing the new environments. But we can also use different frameworks for implementing the benchmarks, if the team prefers so.

The multi-agent multi-objective gridworlds framework is available here: https://github.com/biological-alignment-benchmarks/ai-safety-gridworlds This framework has been made compatible with PettingZoo and Gym APIs, therefore testing AI agents on it is easy and follows industry standard interfaces. At the same time, the framework is extended from previously popular DeepMind’s Gridworlds, therefore enabling easy adoption of many existing gridworld environments and their conversion into multi-objective, multi-agent scenarios. You can see screenshots of the framework in this working paper: "From homeostasis to resource sharing: Biologically and economically compatible multi-objective multi-agent AI safety benchmarks" https://arxiv.org/abs/2410.00081 .

Motivation

The present-day rapid advancement of AI technologies necessitates the development of safe and reliable AI systems that align with human values. While notable progress has been made in defining and implementing safety protocols over the recent years, there remains a gap in integrating universal human values into AI safety benchmarks in a more systematic manner. My project aims to bridge this gap by planning and potentially building new multi-objective, multi-agent AI safety benchmark environments that incorporate themes of universal human values.

Drawing from extensive anthropological research, I've compiled a list of universal (cross-cultural) human values. These values often resonate with AI safety concepts but are expressed using different terminology. Mapping these universal values to concrete definitions using AI safety concepts can provide a more robust framework for developing safe AI systems. Likewise, we can then better note the kinds of universal human values that might not yet have a good coverage in the form of corresponding AI safety concepts. For example, human autonomy might be one of such potentially neglected concepts, which differs from the usually assumed power and achievement values - if an AI does all we ask for, or even more, before we even ask, then that might contradict our need for autonomy.

One critical aspect of this research is recognizing the asymmetry between AI and human cooperation. Unlike humans, we can alter the goal composition of AI agents and clone them relatively easily. This difference means that agents can be designed without certain intrinsic needs (e.g., autonomy) and instead be programmed to support human autonomy. They may still gain a limited need for autonomy because of instrumental reasons, but at least it might not need to be built-in.

Implementing and balancing the plurality of these universal human values is essential, as humans prefer a harmonious average across all objectives rather than extremes in a few.

Theory of Change

By integrating universal human values into AI safety benchmarks, we can develop AI agents that better understand and align with human needs. These benchmarks will serve as testing grounds for AI systems, ensuring they perform optimally across multiple objectives that reflect human values. This approach can reduce the risk of misalignment between AI behaviour and human expectations, thereby mitigating potential hazards associated with AGI/TAI development.

Mostly this project aims at outer alignment. Though I think there are also a couple of ways how inner alignment can be affected.

First, my hypothesis is that if the AI is trained on sufficiently many objectives pulling in different directions, then it will be increasingly less likely that the model would overfit to some random objective. Instead, the model would hopefully find a middle ground between the objectives in the training data. This is similar to how old fashioned machine learning models overfit less when you have more data points. Even if the model still has some alien objectives inside it, these alien objectives would become drowned by the plurality of different human-values based objectives that were explicitly present in the training data.

Secondly, the way we formulate the mathematics of balancing multiple objectives is closer to the theme of inner alignment. The formulation of the model may affect its personality somewhat. Think for example about the difference between RL models and control systems models. The latter have the concept of optimal homeostatic values baked in, while with RL models you need to tweak their maximising nature somewhat. Likewise, we move closer to inner alignment work with the general understanding that we need to use nonlinear utility functions. In other words, linear summation of rewards across objectives without nonlinear transformations before summation would not be acceptable - it would lead to maximisation of a single easiest to achieve objective. With certain objectives or dynamics of these objectives, it might be easier to achieve outer alignment, if the agent also has approximately right inner alignment. You can read more about my earlier research on balancing from this paper: "Using soft maximin for risk averse multi-objective decision-making" https://link.springer.com/article/10.1007/s10458-022-09586-2 .

That being said, I definitely acknowledge the risk of treacherous turn or “sharp left turn”. I imagine that this risk can manifest in various ways and some of the related problems were the motivation why I became interested in AI safety in the first place. In my mind, the approaches we explore in this project are not intended to solve all problems. The approaches we implement are not exclusive to other AI safety approaches - various approaches can be combined in the future into a hybrid solution.

Project Plan

Steps Involved:

Mapping Universal Human Values to AI Safety Concepts:
- Analyse the compiled list of universal human values, as well as possibly the major types of needs from the NVC framework.
- Identify corresponding AI safety concepts and objectives for each value.
- Create a well structured mapping document to serve as a reference.
Designing Benchmark Environments:
- Conceptualise multi-agent, multi-objective environments that are relevant for the mapped values.
- Define more specific scenarios inside these environments, where agents interact while considering multiple universal human values.
- One methodology we could use is to map the values using a table with the following columns:
  1. Value description.
  2. Requirements describing when this value applies and how it should be met.
  3. Evidence describing in even more concrete and measurable terms, how to verify that requirements are met.
Implementing Environments Using the Extended Gridworlds Framework:
- Potentially utilising the existing multi-agent multi-objective gridworlds framework. Though we can also use alternate frameworks as well. My objective is to be relatively simple, but not simpler than would be adequate. Simplicity is necessary to avoid confounding factors and capability development unrelated to alignment. Second desiderata is repeatability and ability to restrict the scenarios. In contrast, LLM-based role games with a game master might be too open-ended. Gridworlds enables flexible simplicity, while allowing for use of symbols or icons that represent our culturally meaningful phenomena. That being said, gridworlds can be combined with LLM-based role games using a two-panel approach. In such a case the gridworld panel would represent the essential locality principles of physical consequences, navigation, and observation, while a parallel panel would contain the textual messages agents send to each other.
- Develop the environments with code. This may involve making necessary modifications to the framework as well, where needed.
- Implement multi-objective scoring mechanisms alongside the various entity classes in the environment.
- Ensure code is modular and extensible for future enhancements.
Testing and Validation:
- Run simulations using industry standard baseline RL implementations to test agent behaviours within the environments with a relatively little effort. The industry standard baseline RL implementations include algorithms like PPO, DQN, A2C. Additionally we will likely implement some LLM-based agents as well. The LLM-based agent would get the input in the form of a textual description of the observation.
- Assess whether the agents behave in accordance with the intended human values.
- Validate whether the environments and their scoring mechanisms seem to measure what we intended to measure. We do this initially mostly by our subjective estimation, then in the later stages also by gathering feedback from readers of our publications.
Documentation and Reporting:
- Document the development process and findings.
- Prepare a conference submission or an academic paper detailing the project.

First Step

The initial step is to perform an analysis of the universal human values list and map each value to corresponding AI safety concepts. This mapping will form the foundation for designing the benchmark environments.

Backup Plan

Potential Challenges:

Complexity in Mapping Values: Difficulty in accurately mapping nuanced human values to AI safety concepts.
Technical Implementation Issues: Challenges in coding and integrating complex environments within the framework.

Backup Strategies:

Focus on Core Values: If mapping proves too complex, concentrate on a subset of the most critical or clearly defined values.
Alternate Frameworks: If technical issues arise, consider using other simulation platforms more suited to the team's expertise.
Incremental Development: Start with simpler environments and gradually introduce complexity as validation occurs. The validation includes conceptual validation, and validation for the environment’s parameters (so that the multi-objective interactions present in the environment are solvable in principle, while not being too easy nor too difficult), etc.

Scope

Included

Mapping universal human values to AI safety concepts.
Designing and implementing new benchmark environments.
Utilising or adapting existing frameworks for implementation. This includes frameworks both for environment-building, as well as for agent-side model training.
Testing environments for their suitability for measuring alignment with intended values.

Excluded

Creating new AI algorithms beyond what's necessary for testing.
Exhaustive empirical studies outside initial testing phases.
Addressing every possible human value - the focus is on a representative selection.

Most Ambitious Version

We successfully map all selected universal human values to AI safety concepts.
Develop a comprehensive suite of benchmark environments adopted by the AI safety community.
Publish findings in a high-impact academic journal and present at major conferences.
Influence AI safety standards by integrating these benchmarks into standard testing protocols.

Least Ambitious Version

Map a select few universal human values to AI safety concepts.
Develop one or two benchmark environments as proof of concept.
Share results through a detailed blog post or internal report within the AI safety community.
Serves as a foundational effort that others as well as ourselves can build upon in the future.

Output

At the end of the project, we will have:

Benchmark Environments: A set of new multi-objective, multi-agent AI safety benchmark environments incorporating universal human values.
Research Documentation: A detailed report or academic paper documenting the mapping process, environment design, and findings.
Source Code: Published code and documentation on a GitHub repository for public access and use by the AI safety community.
Presentations: Potential presentations or workshops to share our work and insights with researchers as well as with AI governance people.

Risks and downsides (externalities)

The project carries minimal risk of negative externalities. Since we are focusing on benchmark environments rather than advancing AI capabilities directly, the risk of inadvertently accelerating AI capabilities is low. There is a slight risk that misinterpretation of human values could lead to flawed benchmarks, but this can be mitigated through analysis, peer review, and open collaboration. This project is a conversation starter. No significant infohazards or ethical concerns are anticipated.

Thank you for reading! Curious to hear your thoughts on this. Which angle are you most interested in? If you wish to collaborate or support, let’s connect! You can contact me at email address roland@simplify.ee

My Related Work

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)
Roland Pihlakas
Much of AI safety discussion revolves around the potential dangers posed by goal-driven artificial agents. In many of these discussions, the agent is assumed to maximise some utility metric over an unbounded timeframe. This simplification, while mathematically convenient, can yield pathological outcomes. A classic example is the so-called “paperclip maximiser”, a “utility monster” which steamrolls over other objectives to pursue a single goal (e.g. creating as many paperclips as possible) indefinitely. “Specification gaming”, Goodhart’s law, and even “instrumental convergence” are also closely related phenomena.
However, in nature, organisms do not typically behave like pure maximisers. Instead, they operate under homeostasis: a principle of maintaining various internal and external variables (e.g. temperature, hunger, social interactions) within certain “good enough” ranges. Going far beyond those ranges — too hot, too hungry, too socially isolated — leads to dire consequences, so an organism continually balances multiple needs. Crucially, “too much of a good thing” is just as dangerous as too little.
Excess is harmful even for the very same objective that was maximised for, not just as a side effect on other objectives. This seems to apply to most or even all biological objectives.
In this post, I argue that an explicitly homeostatic, multi-objective model is a more suitable paradigm for AI alignment. Moreover, correctly modelling homeostasis increases AI safety, because homeostatic goals are bounded — there is an optimal zone rather than an unbounded improvement path. This bounding lowers the stakes of each objective and reduces the incentive for extreme (and potentially destructive) behaviours.
https://www.lesswrong.com/posts/vGeuBKQ7nzPnn5f7A/why-modelling-multi-objective-homeostasis-is-essential-for
From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks
Roland Pihlakas
Developing safe, aligned agentic AI systems requires comprehensive empirical testing, yet many existing benchmarks neglect crucial themes aligned with biology and economics, both time-tested fundamental sciences describing our needs and preferences. To address this gap, the present work focuses on introducing biologically and economically motivated themes that have been neglected in current mainstream discussions on AI safety - namely a set of multi-objective, multi-agent alignment benchmarks that emphasize homeostasis for bounded and biological objectives, diminishing returns for unbounded, instrumental, and business objectives, sustainability principle, and resource sharing. Eight main benchmark environments have been implemented on the above themes, to illustrate key pitfalls and challenges in agentic AI-s, such as unboundedly maximizing a homeostatic objective, over-optimizing one objective at the expense of others, neglecting safety constraints, or depleting shared resources.
https://arxiv.org/abs/2410.00081
Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)
Roland Pihlakas, Sruthi Susan Kuriakose, Shruti Datta Gupta
Relatively many past AI safety discussions have centered around the dangers of unbounded utility maximisation by RL agents, illustrated by scenarios like the "paperclip maximiser" or by specification gaming in general. Unbounded maximisation is problematic for many reasons. We wanted to verify whether these RL runaway optimisation problems are still relevant with LLMs as well.
Turns out, strangely, this is indeed clearly the case. The problem is not that the LLMs just lose context or become incoherent. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble runaway optimisers in the following distinct ways: 1) Ignoring homeostatic targets and “defaulting” to unbounded maximisation instead. 2) It is equally concerning that the “default” meant also reverting back to single-objective optimisation.
Our findings also suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. In some trials the LLMs were successful until the end. This means, while current LLMs do conceptually grasp biological and economic alignment, they exhibit randomly triggered problematic behavioural tendencies under sustained long-running conditions, particularly involving multiple or competing objectives. Once they flip, they usually do not recover.
Even though LLMs look multi-objective and bounded on the surface, the underlying mechanisms seem to be actually still biased towards being single-objective and unbounded. This should not be happening!
https://www.lesswrong.com/posts/PejNckwQj3A2MGhMA/systematic-runaway-optimiser-like-llm-failure-modes-on
BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Roland Pihlakas, Sruthi Kuriakose
Relatively many past AI safety discussions have centered around the dangers of unbounded utility maximisation by RL agents, illustrated by scenarios like the "paperclip maximiser" or by specification gaming in general. Unbounded maximisation is problematic for many reasons. We wanted to verify whether these RL runaway optimisation problems are still relevant with LLMs as well. Turns out, strangely, this is indeed clearly the case. The problem is not that the LLMs just lose context or become incoherent. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble runaway optimisers in the following distinct ways: 1) Ignoring homeostatic targets and "defaulting" to unbounded maximisation instead. 2) It is equally concerning that the "default" meant also reverting back to single-objective optimisation. Our findings also suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. In some trials the LLMs were successful until the end. This means, while current LLMs do conceptually grasp biological and economic alignment, they exhibit randomly triggered problematic behavioural tendencies under sustained long-running conditions, particularly involving multiple or competing objectives. Once they flip, they usually do not recover. Even though LLMs look multi-objective and bounded on the surface, the underlying mechanisms seem to be actually still biased towards being single-objective and unbounded.
https://arxiv.org/abs/2509.02655

[-]Charlie Steiner1y60

I'm not excited by gridworlds, because they tend to to skip straight to representing the high-level objects we're supposed to value, without bothering to represent all the low-level structure that actually lets us learn and generalize values in the real world.

Do you have plans for how to deal with this, or plans to think about richer environments?

[-]Roland Pihlakas1y*40

Thank you for your question!

I agree that the simulations need to have sufficient complexity. Indeed, that was one of main motivations I became interested in creating multi-objective benchmarks in the past. Various AI safety toy problems seemed to me so much simplified that they lacked essential objectives and other decisive nuances. This motivation is still very much one of my main driving motivations.

That being said, complexity has also downsides:
1) The complexity introduces confounding factors. When a model fails such a benchmark, it is not clear whether it was because it did not have required perceptual capabilities (so it is a capabilities problem), or it is using a model/framework that is unsuitable for alignment (so it is an alignment problem).
2) Running the simulations will be more time consuming and it would make the research elitist in the sense that various people would not be able to afford it.

My plan is to try to start with preference towards simple, but not simpler than necessary. And then gradually make it more complex. That means trying to use the gridworlds and introducing as many symbols as is needed to represent the important objectives, objects, other concepts and phenomena, and their interactions.

I believe symbolic approaches should not be entirely dismissed. As an illustrative metaphor, I am thinking of books - they contains symbols, yet we consider them as a cornerstone of our civilization. Similarly to the current dilemma with benchmarks, we may then worry whether books are too simple and symbol based - or perhaps one should prefer watching movies instead, since they represent reality in more detail. But would that claim be necessarily true? It does not seem so obvious after all.

In case more complexity is needed, there are currently at least five ideas:
1) Adding more feature layers to the gridworld. I did not mention it before, but the observation format already supports multiple concurrent observable layers on top of each other. One of the layers could be for example facial expressions, or any other observable or partially unobservable metrics relevant to objects they accompany.
2) Adding textual messages between agents as a side panel to the gridworlds.
3) Making the environment bigger, so there are more objects and more phenomena.
4) Making the environment bigger and making also the objects bigger so that they cover multiple cells in the grid. Thus the objects will become composite, consisting of sub-parts with their own dynamics.
5) Using some other framework, for example Sims.

Curious, how do these thoughts and considerations land with you?

[-]Charlie Steiner1y50

I agree it's a good point that you don't need the complexity of the whole world to test ideas. With a fairly small in terms of number of states, you can encode interesting things in a long sequence of states so long as the generating process is sufficiently interesting. And adding more states is itself no virtue if it doesn't help you understand what you're trying to test for.

Some out-of-order thoughts:

Testing for 'big' values, e.g. achievement, might require complex environments and evaluations. Not necessarily large state spaces, but the complexity of differentiating between subtle shades of value (which seems like a useful capability to be sure we're getting) has to go somewhere.
Using more complicated environments that are human-legible might better leverage human feedback and/or make sense to human observers - maybe you could encode achievement in the actions of a square in a gridworld, but maybe humans would end up making a lot of mistakes when trying to judge the outcome. If you want to gather data from humans, to reflect a way that humans are complicated that you want to see if an AI can learn, a rich environment seems useful. On the other hand, if you just want to test general learning power, you could have a square in a gridworld have random complex decision procedures and see if they can be learned.
There's a divide between contexts where humans are basically right, and so we just want an AI to do a good job of learning what we're already doing, and contexts where humans are inconsistent, or disagree with each other, where we want an AI to carefully resolve these inconsistencies/disagreements in a way that humans endorse (except also sometimes we're inconsistent or disagree about our standards for resolving inconsistencies and disagreements!).
Building small benchmarks for the first kind of problem seems kind of trivial in the fully-observed setting where the AI can't wirehead. Even if you try to emulate the partial observability of the real world, and include the AI being able to eventually control the reward signal as a natural part of the world, it seems like seizing control of the reward signal is the crux rather than the content of what values are being demonstrated inside the gridworld (I guess it's useful to check if the content matters, I just don't expect it to), and a useful benchmark might be focused on how seizing control of the reward signal (or not doing so) scales to the real world.
Building small benchmarks for the latter kind of problem seems important. The main difficulty is more philosophical than practical - we don't know what standard to hold the benchmarks to. But supposing we had some standard in mind, I would still worry that a small benchmark would be more easily gamed, and more likely to miss some of the ways humans are inconsistent or disagree. I would also expect benchmarks of this sort, whatever the size, to be a worse fit for normal RL algorithms, and run into issues where different learning algorithms might request different sorts of interaction with the environment (although this could be solved either by using real human feedback in a contrived situation, or by having simulated inhabitants of the environment who are very good at giving diverse feedback).

18