New paper on aligning AI with human values

ryan.lowe

We’ve just published a paper on a new way to align language models with human values. We wanted to post it here to get more feedback from folk who have thought deeply about alignment.

I'm pretty excited about it. In the past, I worked on RLHF, InstructGPT, and GPT-4 alignment (though one could make the claim that this isn't "real alignment research"). In the last year, I've found myself gravitating more towards the question of "what do we align to?". It turns out that this is extremely hard, but I think the set of ideas in this paper are some of the best I've come across.

I also think "what do we align to?" is directly relevant to longer-term alignment research / X-risk. We have a section about this in the paper, and I'd love to hear from people who disagree.

You can find the paper here: https://meaningalignment.org/values-and-alignment-paper. Below I've pasted the abstract, in addition to the section of our discussion where we we relate the paper more explicitly to traditional alignment research.

Fire away!

Abstract

There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but there is very little work on what that means and how we actually do it. We split the problem of “aligning to human values” into three parts: first, eliciting values from people; second, reconciling those values into an alignment target for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what are “good” ways to synthesize diverse human inputs about values into a target for aligning language models? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a large language model to interview participants about their values in particular contexts; our approach is inspired by the philosophy of values advanced by Taylor (1977), Chang (2004a), and others. We trial MGE with a representative sample of 500 Americans, on 3 intentionally divisive prompts (e.g. advice about abortion). Our results demonstrate that MGE is promising for improving model alignment across all 6 criteria. For example, almost all participants (89.1%) felt well represented by the process, and (89%) thought the final moral graph was fair, even if their value wasn’t voted as the wisest. Our process often results in “expert” values (e.g. values from women who have solicited abortion advice) rising to the top of the moral graph, without defining who is considered an expert in advance.

[...]

Relevance to alignment research

This paper is about what human values are and how we can align to them. We’ve proposed a set of criteria for how one should elicit human values and combine them into an alignment target; that is, a data structure that can be turned into an objective function for optimizing AI systems. We’ve also developed a method, Moral Graph Elicitation, for producing an alignment target and argued that it performs well on our criteria through our case study in Section 5.

Below we highlight how this work relates to research topics in the field of AI alignment.

Outer alignment. This line of research is somewhat different from what typically falls in the bucket of alignment research. It is most closely related to “outer alignment”, which is concerned with defining the “right” objective function to optimize. However, outer alignment research rarely considers the legitimacy of the process that produces the objective function to optimize. It is not simply a matter of coming up with a good answer; it matters how we come up with the answer, because we must aspire to a world where the people and institutions who use these systems broadly endorse what they are trying to do for us. This has become an increasing focus of more recent alignment work (Ji et al., 2024).

Deception. One of the main motivations of alignment research is to detect or mitigate deception from AI; in other words, scenarios where an AI system attempts to manipulate the beliefs or actions of people to achieve an undesirable outcome. This is most often explored through “inner alignment” research, which is concerned with how models at test time might optimize something different than the objective we intended to set. We believe that coming up with robust alignment targets (as defined in Section 3.1) is also directly relevant to AI deception. Specifically, a non-robust alignment target is vulnerable to being hijacked by both human and AI systems, without requiring any inner alignment failures. As described in Section 3.2, there will be a huge incentive to do this because AI systems will become increasingly powerful, both economically and culturally. A motivated actor (human or AI) could manipulate a non-robust alignment target using money, rhetoric, or hacking. A robust target and elicitation process would shut down those avenues for manipulation.

Over-optimization. The moral graph may also be useful for mitigating over-optimization. This is because each value in the moral graph is connected with a context in which that value is applicable. In our experiments, the context is simply the prompt, but more generally a context might be represented by a certain range of tokens in a conversation or action trajectory. Thus, there’s a clear bounded area in which each value applies, and it’s less likely that any one value will be pushed too hard or universalized. Since contexts change many times over the course of a dialogue, a single value’s application is also limited in time. While this doesn’t mean that models will do the right thing, it means pursuing their objective function isn’t the same as monomaniacally pursuing a single goal. Of course, over-optimization could still occur within a particular context.

On top of this, one of the reasons to be worried about over-optimization is that optimization is usually carried out over goals or preferences. But these are only a proxy for what we really care about, and it’s this misalignment which is our chief concern. We believe our articulation of human values as constitutive attentional policies is much closer to “what we really care about”, and is thus less prone to over-optimization.

Coherent extrapolated volition. Perhaps the most popular framing of “what AI should optimize” from an alignment perspective is coherent extrapolated volition (CEV) (Yudkowsky, 2001):

Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted

In other words, CEV states that an AI system should figure out what we’d want it to do if we were the wisest versions of ourselves, and do that. It’s unclear how the AI should do this exactly. The overarching vision is one where humans are treated like black boxes, and the goal of an AI is to serve them by observing our behavior and simulating what we might want. This is similar to the frame from cooperative inverse reinforcement learning (CIRL), where agents attempt to infer the human’s reward function based on observing their behavior. These “black box” approaches require training models on opaque reward functions28, which are then susceptible to unforeseeable consequences due to misalignments between the reward function and our real values.

Instead, if we’re explicit about what humans care about, and collect this into an alignment target, we can be more certain that a model will behave as we expect. We can do things like audit the target, trace unwanted behavior to particular contexts, and prevent the target from being manipulated. In other words, rather than treating humans as black boxes, it’s much easier if we can take a snapshot of what humans care about, and train a model to care about these things too. Moral Graph Elicitation is our attempt to do this in a clever way.

Scaling to superintelligence. We hope the moral graph’s structure can scale to superintelligence, because a superintelligence can add edges to a moral graph which human beings might be able to double check. The edges in the moral graph do not just represent arbitrary opinions of a population, they are modeled on a theory of human moral reasoning and learning, mentioned in Section 2.3. As described here, the moral graph captures some aspects of moral learning by human beings, but we believe the same moral reasoning and learning can be done by an AI system such that a superintelligent AI would be able to iterate further on a moral graph, developing new values and edges. These new values and edges might still be able to be evaluated by humans, or by weaker systems that in turn can be evaluated by humans (Burns et al., 2023). The “value transition stories” part of our experiment shows that people can assess the quality of claimed “gains in wisdom”. Also, the fact that participants retroactively endorsed values that were considered wiser than theirs by other participants, implies that lesser systems (or humans) can evaluate moral reasoning done by a stronger system. If this works, an ASI could evolve its own morality in a human-inspectable, human-compatible way–a kind of process-based moral supervision.

[-]Charlie Steiner2y62

Awesome to see this come to fruition. I think if a dozen different groups independently tried to attack this same problem head-on, we'd learn useful stuff each time.

I'll read the whole paper more thoroughly soon, but my biggest question so far is if you collected data about what happens to your observables if you change the process along sensible-seeming axes.

[-]Charlie Steiner2y20

I have now read the paper, and still think you did a great job.

One gripe I have is with this framing:

We believe our articulation of human values as constitutive attentional policies is much closer to “what we really care about”, and is thus less prone to over-optimization

If you were to heavily optimize for text that humans would rate highly on specific values, you would run into the usual problems (e.g. model incentivized to manipulate the human). Your success here doesn't come from the formulation of the values per se, but rather from the architecture that turns them into text/actions - rather than optimizing for them directly, you can prompt a LLM that's anchored on normal human text to mildly optimize them for you.

This difference implies some important points about scaling to more intelligent systems (even without making any big pivots):

we don't want the model to optimize for the stated values unboundedly hard, so we'll have to end up asking for something mild and human-anchored more explicitly.
If another use of AI is proposing changes to the moral graph, we don't want that process to form an optimization feedback loop (unless we're really sure).

The main difference made by the choice of format of values is where to draw the boundary between legible human deliberation, and illegible LLM common sense.

I'm excited for future projects that are sort of in this vein but try to tackle moral conflict, or that try to use continuous rather than discrete prompts that can interpolate values, or explore different sorts of training of the illegible-common-sense part, or any of a dozen other things.

[-]Nathan Helm-Burger2y52

Still reading the paper, but so far I love it. This feels like a big step forward in thinking about the issues at hand which addresses so many of the concerns I had about limitations of previous works. Whether or not the proposed technical solution works out as well as hoped, I feel confident that your framing of the problem and presentation of desiderata of a solution are really excellent. I think that alone is a big step forward for the frontier of thought on this subject.