The Preference Fulfillment Hypothesis (PFH) asserts that “Humans have an innate motivation ('preference fulfillment', PF) to fulfill the preferences of those they care about". For the rest of this post, I model PF as a relationship between:

  • Some model of an external utility function (exists independently of the primary agent)
  • An external agent
  • An action that the primary agent interprets as granting positive utility in accordance with the external agent's utility function
  • The primary agent
  • Some shared environment

To provide an example; Agent A assigns positive utility to the outcome of Agent B experiencing positive utility due to PF, and thus positively associates between actions that result in positive utility for Agent B and positive utility for itself. Agent A engaging in PF implies that it has access to some model of Agent B’s utility function (a model that isn’t necessarily correct), which in practice is both learned and internal (i.e. Agent A is simulating Agent B). Humans converge upon abstractions of the utility functions of agents whose preferences they fulfill (as opposed to perfect low-level models of them), implicating that some agent specifications might abstract well. As stated in the original PFH post, this has interesting implications for corrigibility.[1]

Select Agent Specifications as Natural Abstractions

Why Might Some Agent Specifications be Natural Abstractions?

Some agent specifications might be natural abstractions because:

  • All humans appear to converge upon abstractions of the people they are trying to simulate
  • Cooperation requires simulation, and might be a convergently useful capability

Neither of these are standalone arguments for why some agent specifications should be considered to abstract well, and are mere seed evidence. For example, all agent specifications could abstract incredibly poorly, but this could be such an essential skill in the human context that we learn it regardless. To counter this; any complex game in which an agent is trying to optimize for some outcome likely involves both an interaction with another agent, as well as some outcome affected by the behavior of that agent. As a result, it is a logical capability to develop under selection pressures like natural selection, as the simulation of other agents can enhance an agent's ability to forecast outcomes. Regardless of whether or not all agent specifications abstract well, humans conclusively simulate other humans, and this is why we do not need to express our moral preferences in formal ways. For example, in most societies death is considered a net negative, and yet we do not actively pursue the containing of all members of a society in small boxes fitted with feeding tubes such that their lives are maximally extended. The caveat of not wanting to live in a feeding tube box is implied, and thus from a notion as abstract as “dying is bad”, we deduce a wealth of useful information. Importantly, human specifications abstracting well increases the viability of this as an alignment strategy, but leveraging the simulation of other agents as an alignment technique isn’t dependent on this.

Which Agent Specifications Could be Natural Abstractions?

Another aspect to consider is that just because some agent specifications might abstract well does not mean that all do. For example, whilst human specifications may abstract well, the same does not necessarily go for alien specifications. In keeping with the Universality Hypothesis, most intelligences should form similar abstractions of the same specifications, but not all specifications need to be able to be abstracted naturally by most intelligences. 

There is no reason for this to certainly influence the agenda stated in the introduction; regardless of whether or not all agent specifications abstract well, the primary concern from the perspective of a safety researcher is whether or not human specifications do. I am interested in questions like “What if agents are more likely to form abstractions of specifications similar to their own?”, and I believe that this hypothesis could again be tested with existing technology. I’m unsure of how a result like “agents are more likely to form abstractions of specifications similar to their own” would conform to the Universality Hypothesis, as it is unclear to me if an object of abstraction like “specifications similar to that of the agent’s own” could be treated as an object in the same manner “tree” is. This confusion stems from the fact that the former (although a static concept) varies from agent to agent in its implementation.

Testing for Select Agent Specifications as Natural Abstractions

A successful method should attempt to answer one or more of the following questions:

  • Do non-human intelligences engage in the simulation of other agents?
  • Do some agent specifications abstract well?
  • Can these abstractions be modified? If so, how?
  • How can we interpret these abstractions?
  • How can we estimate these abstractions? (What information about the external agent might we need to make these predictions? What about information about the primary agent?)

My first idea:

  1. Approximate a low-level model of some external agent as well as that model's complexity (e.g. as a measure of how many bits it would take to encode that probability distribution using some encoding scheme)
  2. Obtain probability distributions over actions for the same external agent specifications from non-human sources (e.g. LLMs, RL agents)
  3. Forecast the behavior of the external agent using the produced probability distribution
  4. Calculate the divergence of the non-human and low-level distributions using some distance measure, as well as difference in complexity approximations
  5. Using the complexity measures, make inferences regarding the degree of abstraction applied to the external agent specification (e.g. if some agent produces a model with significantly lower estimated complexity than another but maintains a similar forecasting proficiency, it can be assumed that its abstraction contains less redundant information)
  6. Based on the complexity and distance dissensions between non-human and low-level abstractions, produce information that could be helpful when answering “Do non-human intelligences engage in the simulation of other agents?” and “Do some agent specifications abstract well?” (e.g. Most agents converged at a similar abstraction of x but not of y, they might do this because…)

Cons for this approach:

  • It seems difficult to construct some agent with specifications simple enough that it is possible to analyze, but complicated enough as to maintain uniqueness from examples seen during training
    • In this case, the experimental data becomes very useless, because it tells us nothing about the architecture of the language model or the abstractability of the concept we are testing
    • Imagine you applied this method to an LLM trained on tens of thousands of discussions and instances of literature about game theory and your scenario bears similarity to one mentioned there, and now you have invalid experimental data
  • The game in which the external agent exists needs to be very simple so as to make developing a low-level approximation of the agents plausible

Some pros:

  • Easy (not necessarily accurate) means of quantifying the degree of abstraction as well as the similarity between those abstractions and some lower-level model of the external agent
  • Easily extended to answer questions like “Can these abstractions be modified?” as you would already have comparative data for some baseline unmodified abstraction, as well as a framework for quantifying the distance between the modified and unmodified abstractions
  • Provides useful information regarding the questions “Do non-human intelligences engage in the simulation of other agents?” and “Do some agent specifications abstract well?”

A robust proof that human specifications abstract well probably looks like a selection theorem, and might describe a phenomena like “under the selection pressure of PF, a system will simulate external agents”. I would then try to generalize this to describe agent simulation under less niche selection pressures, which I would assume is possible as humans appear to simulate external agents, a feature that emerged as a result of the selection pressures we are under.

Model of PF

In the context of the example given in the introduction, I find this is a useful representation of my model of PF:

Where  is the environment at time  is some internal representation of the environment at time ta is an action and  is some goal for agent  refers to an abstraction of . A successful formalization of PF should describe how the model above relates to emergent simulative behavior.

Graphically, the selection theorem mentioned earlier should describe “???”.

Implications for Corrigibility

If future models naturally converge upon useful abstractions of human specifications, and these abstractions can be expressed in an interpretable manner, perhaps shaping these abstractions is possible. RLHF could already be doing this, and if we could estimate these abstractions we might be able to apply it in a significantly more targeted manner, increasing its scalability. I am skeptical of RLHF being the only way to shape abstractions of human specifications, and if a PF-esque selection pressure holds true for non-human intelligence, more advanced applied alignment strategies likely become a lot easier to devise.

Distribution Shift

"The hardest part of alignment is getting the AGI to generalize the values we give it to new and different environments." Aligning an AGI to abstractions of those values could result in better generalization. If those values can be estimated via simulating other agents, and the subject of that simulation has specifications that abstract well, solving distribution shift could be a mildly smaller hurdle. It seems intuitive that this is so due to it appearing to be how humans account for distribution shift. In place of complex morality formalizations, we use convenient abstractions and can use them to make decisions under pressures like PF.


It seems almost certain that humans simulate other humans, e.g. through PF. It is probable that alien intelligences do the same (although not necessarily through PF), and that this would entail an abstraction of the specifications of the agent being simulated. By being able to estimate alien abstractions of some general human prefernces, inferences could be made regarding whether or not these abstractions are robust or safe. 

  1. ^

    Note that the term "positive utility" in this hypothetical refers to positive utility in a relative sense. If Agent B were to pick lottery numbers that Agent A knew would result in Agent B attaining 100$, and Agent A then suggested numbers that would result in Agent B attaining 10$, (assuming Agent B is trying to optimize for fiscal reward) this would not conform to this definition of PF, as Agent A is lowering the utility of the outcome for Agent B in spite of it still bearing positive utility relative to not having picked lottery numbers at all.


New Comment
3 comments, sorted by Click to highlight new comments since: Today at 11:36 PM

Note: I skimmed before bed, so I might have missed something. With that said -

Dang, I would not say this is "classic first alignment idea" material, it seems, while not qualitatively fundamentally new, a nicely distilled framing of disparate ideas with some insights as a result. I think it carries some severe issues with the foundation it builds on, but seems to me to reduce them nicely, and might even plug into ideas I've been thinking are plausible as a way to build reliable "coprotection", the word I've been using for a while to refer to innate other's-preference fulfillment drive - that is, a network of overlapping utility functions. @carado tells me that relying on networks of utility functions is unreliable if we suddenly see a drastically more powerful singleton, but we'll see, I guess - it doesn't seem totally hopeless and still seems like being able to detect the network of overlapping utility functions would do a lot for us.

(Obligatory note: if you're not already familiar with how weird they are, utility functions are not even slightly reward functions, due to the fact that utility functions do not include any discounting whatsoever, and allow arbitrary amounts of finesse in the choice between options, both of which are not practical when ranking reward under finite amounts of human capability. humans do not encode them directly anywhere, neither do any current AIs, but for theoretical reasons I still suspect we'd like to encode them formally if we could.)

My core worries are about whether your system can be trusted to reliably simulate the actual beings in an acceptable level of relative detail as intelligence increases, and retain appropriate levels of uncertainty without becoming hyperdesperate, and then additionally we would still need to figure out how to make it naturally generate defensible objectives. But it seems to me to be an interesting and promising overall approach.

Have you looked at or the papers mentioned? they seem relevant to your interests. Finite factored sets might also be of interest if you're at all curious to try to design for things that scale to arbitrary levels of superintelligence.

Thank you for the feedback, I have repaired the post introduction in accordance your commentary on utility functions. I challenge the assumption that a system not being able to reliably simulate an agent with human specifications is worrying, and I would like to make clear that the agenda I am pushing is not:

  1. Capabilities and understanding through simulation scale proportionately
  2. More capable systems can simulate, and therefore comprehend the goals of other systems to a greater extent
  3. By dint of some unknown means we align AGI to this deep understanding of our goals

I agree that in the context of a plan like this a failure to establish robust abstractions of human values could be catastrophic, but when applied to a scenario like trying to align a powerful LLM, being able to estimate and interpret even incorrect abstractions could be vitally important. This could look like:

  1. Estimate the LLMs abstraction of say "what do humans want me to refuse to reply to"
  2. Compare this to some desired abstraction
  3. Apply some technique like RLHF accordingly

Of course an actual implementation probably wouldn't look like that, ("what do humans want me to refuse to reply to" isn't necessarily one unified concept that can be easily abstracted and interpreted) but it is a high level overview of why pursuing questions like "do some specifications abstract well?" could still be useful even if they do not.

I hadn't come across the relative abstracted agency post, but I find its insights incredibly useful. Over the next few days I will update this post to include its terminology. I find it likely that testing whether or not some specifications abstract well would provide useful information as to how targets are modeled as agents, but the usefulness of being able to test this in existing systems depends strongly on how the current LLM paradigm scales as we approach superintelligence. Regardless I'm sure any indication as to how targets are modeled as agents could be valuable, even in systems incapable of scaling to superintelligence. 

Hey, welcome! I am very skeptical that we can get out good models of humans just by finding simple ones according to some simple metric. Instead, the metric itself by which we use to evaluate which models are good is a complicated bit of human preferences.

Hopefully we can still get the AI to do most of the hard work for us by learning it, but it does mean that that solution is un-"natural" in the way natural abstractions are natural, and would require us to have a better understanding of how to bootstrap value learning from a kernel of labeled data or human feedback.

Anyhow, I still found this post interesting enough to upvote, stick around.