The Preference Fulfillment Hypothesis (PFH) asserts that “Humans have an innate motivation ('preference fulfillment', PF) to fulfill the preferences of those they care about". For the rest of this post, I model PF as a relationship between:
To provide an example; Agent A assigns positive utility to the outcome of Agent B experiencing positive utility due to PF, and thus positively associates between actions that result in positive utility for Agent B and positive utility for itself. Agent A engaging in PF implies that it has access to some model of Agent B’s utility function (a model that isn’t necessarily correct), which in practice is both learned and internal (i.e. Agent A is simulating Agent B). Humans converge upon abstractions of the utility functions of agents whose preferences they fulfill (as opposed to perfect low-level models of them), implicating that some agent specifications might abstract well. As stated in the original PFH post, this has interesting implications for corrigibility.
Some agent specifications might be natural abstractions because:
Neither of these are standalone arguments for why some agent specifications should be considered to abstract well, and are mere seed evidence. For example, all agent specifications could abstract incredibly poorly, but this could be such an essential skill in the human context that we learn it regardless. To counter this; any complex game in which an agent is trying to optimize for some outcome likely involves both an interaction with another agent, as well as some outcome affected by the behavior of that agent. As a result, it is a logical capability to develop under selection pressures like natural selection, as the simulation of other agents can enhance an agent's ability to forecast outcomes. Regardless of whether or not all agent specifications abstract well, humans conclusively simulate other humans, and this is why we do not need to express our moral preferences in formal ways. For example, in most societies death is considered a net negative, and yet we do not actively pursue the containing of all members of a society in small boxes fitted with feeding tubes such that their lives are maximally extended. The caveat of not wanting to live in a feeding tube box is implied, and thus from a notion as abstract as “dying is bad”, we deduce a wealth of useful information. Importantly, human specifications abstracting well increases the viability of this as an alignment strategy, but leveraging the simulation of other agents as an alignment technique isn’t dependent on this.
Another aspect to consider is that just because some agent specifications might abstract well does not mean that all do. For example, whilst human specifications may abstract well, the same does not necessarily go for alien specifications. In keeping with the Universality Hypothesis, most intelligences should form similar abstractions of the same specifications, but not all specifications need to be able to be abstracted naturally by most intelligences.
There is no reason for this to certainly influence the agenda stated in the introduction; regardless of whether or not all agent specifications abstract well, the primary concern from the perspective of a safety researcher is whether or not human specifications do. I am interested in questions like “What if agents are more likely to form abstractions of specifications similar to their own?”, and I believe that this hypothesis could again be tested with existing technology. I’m unsure of how a result like “agents are more likely to form abstractions of specifications similar to their own” would conform to the Universality Hypothesis, as it is unclear to me if an object of abstraction like “specifications similar to that of the agent’s own” could be treated as an object in the same manner “tree” is. This confusion stems from the fact that the former (although a static concept) varies from agent to agent in its implementation.
A successful method should attempt to answer one or more of the following questions:
A robust proof that human specifications abstract well probably looks like a selection theorem, and might describe a phenomena like “under the selection pressure of PF, a system will simulate external agents”. I would then try to generalize this to describe agent simulation under less niche selection pressures, which I would assume is possible as humans appear to simulate external agents, a feature that emerged as a result of the selection pressures we are under.
In the context of the example given in the introduction, I find this is a useful representation of my model of PF:
Where Et is the environment at time t, iEt is some internal representation of the environment at time t, a is an action and gN is some goal for agent N. abs(n) refers to an abstraction of n. A successful formalization of PF should describe how the model above relates to emergent simulative behavior.
Graphically, the selection theorem mentioned earlier should describe “???”.
If future models naturally converge upon useful abstractions of human specifications, and these abstractions can be expressed in an interpretable manner, perhaps shaping these abstractions is possible. RLHF could already be doing this, and if we could estimate these abstractions we might be able to apply it in a significantly more targeted manner, increasing its scalability. I am skeptical of RLHF being the only way to shape abstractions of human specifications, and if a PF-esque selection pressure holds true for non-human intelligence, more advanced applied alignment strategies likely become a lot easier to devise.
"The hardest part of alignment is getting the AGI to generalize the values we give it to new and different environments." Aligning an AGI to abstractions of those values could result in better generalization. If those values can be estimated via simulating other agents, and the subject of that simulation has specifications that abstract well, solving distribution shift could be a mildly smaller hurdle. It seems intuitive that this is so due to it appearing to be how humans account for distribution shift. In place of complex morality formalizations, we use convenient abstractions and can use them to make decisions under pressures like PF.
It seems almost certain that humans simulate other humans, e.g. through PF. It is probable that alien intelligences do the same (although not necessarily through PF), and that this would entail an abstraction of the specifications of the agent being simulated. By being able to estimate alien abstractions of some general human prefernces, inferences could be made regarding whether or not these abstractions are robust or safe.
Note that the term "positive utility" in this hypothetical refers to positive utility in a relative sense. If Agent B were to pick lottery numbers that Agent A knew would result in Agent B attaining 100$, and Agent A then suggested numbers that would result in Agent B attaining 10$, (assuming Agent B is trying to optimize for fiscal reward) this would not conform to this definition of PF, as Agent A is lowering the utility of the outcome for Agent B in spite of it still bearing positive utility relative to not having picked lottery numbers at all.
Note: I skimmed before bed, so I might have missed something. With that said -
Dang, I would not say this is "classic first alignment idea" material, it seems, while not qualitatively fundamentally new, a nicely distilled framing of disparate ideas with some insights as a result. I think it carries some severe issues with the foundation it builds on, but seems to me to reduce them nicely, and might even plug into ideas I've been thinking are plausible as a way to build reliable "coprotection", the word I've been using for a while to refer to innate other's-preference fulfillment drive - that is, a network of overlapping utility functions. @carado tells me that relying on networks of utility functions is unreliable if we suddenly see a drastically more powerful singleton, but we'll see, I guess - it doesn't seem totally hopeless and still seems like being able to detect the network of overlapping utility functions would do a lot for us.
(Obligatory note: if you're not already familiar with how weird they are, utility functions are not even slightly reward functions, due to the fact that utility functions do not include any discounting whatsoever, and allow arbitrary amounts of finesse in the choice between options, both of which are not practical when ranking reward under finite amounts of human capability. humans do not encode them directly anywhere, neither do any current AIs, but for theoretical reasons I still suspect we'd like to encode them formally if we could.)
My core worries are about whether your system can be trusted to reliably simulate the actual beings in an acceptable level of relative detail as intelligence increases, and retain appropriate levels of uncertainty without becoming hyperdesperate, and then additionally we would still need to figure out how to make it naturally generate defensible objectives. But it seems to me to be an interesting and promising overall approach.
Have you looked at https://www.lesswrong.com/posts/JqWQxTyWxig8Ltd2p/relative-abstracted-agency or the papers mentioned? they seem relevant to your interests. Finite factored sets might also be of interest if you're at all curious to try to design for things that scale to arbitrary levels of superintelligence.
Thank you for the feedback, I have repaired the post introduction in accordance your commentary on utility functions. I challenge the assumption that a system not being able to reliably simulate an agent with human specifications is worrying, and I would like to make clear that the agenda I am pushing is not:
I agree that in the context of a plan like this a failure to establish robust abstractions of human values could be catastrophic, but when applied to a scenario like trying to align a powerful LLM, being able to estimate and interpret even incorrect abstractions could be vitally important. This could look like:
Of course an actual implementation probably wouldn't look like that, ("what do humans want me to refuse to reply to" isn't necessarily one unified concept that can be easily abstracted and interpreted) but it is a high level overview of why pursuing questions like "do some specifications abstract well?" could still be useful even if they do not.I hadn't come across the relative abstracted agency post, but I find its insights incredibly useful. Over the next few days I will update this post to include its terminology. I find it likely that testing whether or not some specifications abstract well would provide useful information as to how targets are modeled as agents, but the usefulness of being able to test this in existing systems depends strongly on how the current LLM paradigm scales as we approach superintelligence. Regardless I'm sure any indication as to how targets are modeled as agents could be valuable, even in systems incapable of scaling to superintelligence.
Hey, welcome! I am very skeptical that we can get out good models of humans just by finding simple ones according to some simple metric. Instead, the metric itself by which we use to evaluate which models are good is a complicated bit of human preferences.
Hopefully we can still get the AI to do most of the hard work for us by learning it, but it does mean that that solution is un-"natural" in the way natural abstractions are natural, and would require us to have a better understanding of how to bootstrap value learning from a kernel of labeled data or human feedback.
Anyhow, I still found this post interesting enough to upvote, stick around.