By Ben Smith, Roland Pihlakas, and Robert Klassert
Thanks to Linda Linsefors, Alex Turner, Richard Ngo, Peter Vamplew, JJ Hepburn, Tan Zhi-Xuan, Remmelt Ellen, Kaj Sotala, Koen Holtman, and Søren Elverlin for their time and kind remarks in reviewing this essay. Thanks to the organisers of the AI Safety Camp for incubating this project from its inception and for connecting our team.
For the last 9 months, we have been investigating the case for a multi-objective approach to reinforcement learning in AI Safety. Based on our work so far, we’re moderately convinced that multi-objective reinforcement learning should be explored as a useful way to help us understand ways in which we can achieve safe superintelligence. We’re writing this post to explain why, to inform readers of the work we and our colleagues are doing in this area, and invite critical feedback about our approach and about multi-objective RL in general.
We were first attracted to the multi-objective space because human values are inherently multi-objective--in any number of frames: deontological, utilitarian, and virtue ethics; egotistical vs. moral objectives; maximizing life values including hedonistic pleasure, eudaemonic meaning, or the enjoyment of power and status. AGI systems aiming to solve for human values are likely to be multi-objective themselves, if not by explicit design, then multi-objective systems would emerge from learning about human preferences.
As a first pass at technical research in this area, we took a commonly-used example, the “BreakableBottles” problem, and showed that for low-impact AI, an agent could more quickly solve this toy problem if it uses a conservative but flexible trade-off between alignment and performance values, compared to using a thresholded alignment system to maximize a certain amount of alignment and only then maximizing on performance. Such tradeoffs will be critical for understanding the conflicts between more abstract human objectives a human-preference-maximizing AGI would encounter.
To send feedback you can (a) contribute to discussion by commenting on this forum post; (b) send feedback anonymously; or (c) directly send feedback to Ben (email@example.com), Roland (firstname.lastname@example.org), or Robert (email@example.com).
In reinforcement learning, an agent learns which actions lead to reward, and selects them. Multi-objective RL typically describes games in which an agent selects an action based on its ability to fulfill more than one objective. In a low-impact AI context, objectives might be “make money” and “avoid negative impacts on my environment”. At any point in time, an agent can assess the value of each action by its ability to fulfill each of these objectives. The values of each action in terms of each objective make up a value vector. An area of focus for research in multi-objective RL is how to combine that vector into a single scalar value representing the overall value of an action, to be compared to its alternatives, or sometimes, how to compare the vectors directly. Agents learn the consequences of each action in terms of each of their objectives, and actions are evaluated based on their consequences with respect to each objective. It has previously been argued (Vamplew et al., 2017) that human-aligned artificial intelligence is a multi-objective problem. Objectives can be combined through various means, such as achieving a thresholded minima for one objective and maximizing another, or through some kind of non-linear weighted combination of each objective into a single reward function. At a high level, this is as simple as combining the outputs from possibly transformed individual objective rewards and selecting an action based on that combination. Exploring ways to combine objectives in ways that embed principles we care about, like conservatism, is a primary goal for multi-objective RL research and a key reason we value multi-objective RL research distinct from single-objective RL.
It seems to us there are some good reasons to explore multi-objective RL for applications in aligning superintelligence. The three main reasons are the potential to reduce Goodharting, the parallels with biological human intelligence and broader philosophical and societal objectives, and the potential to better understand systems that may develop multiple objectives even from a single final goal. There are also some fundamental problems that need to be solved if it is to be useful, and although this essay primarily addresses potential opportunities, we touch on a few challenges we’ve identified in a section below. Overall, we think using pluralistic value systems in agents has so far received too little attention.
Low-impact RL aims to set two objectives for an RL agent. The first, Primary objective is to achieve some goal. This could be any reasonable goal we want to achieve, such as maximizing human happiness, maximizing GDP, or just making a private profit. The second, Safety objective is to have as little impact as possible on things unrelated to the primary objective while the primary objective is achieved. Despite the names, the Safety objective has usually a higher priority than the Primary objective.
The low-impact approach lessens the risk of undesirable changes by punishing any change that is not an explicit part of the primary objective. There are many proposals for how to define a measure of low-impact, e.g., deviations from the default course of the world, irreversibility, relative reachability, or attainable utility preservation.
Like low-impact RL, multi-objective RL balances an agent’s objectives, but its aims are more expansive and it balances more than two objectives. We might configure a multi-objective RL with ‘Safety’ objectives as one or more objectives among many. But the aim, rather than to constrain a single objective with a Safety objective, is to constraint a range of objectives with each other. Additionally, multi-objective RL work often explores methods for non-linear scalarization, whereas low-impact RL work to-date has typically used linear scalarization (see Vamplew et al., 2021).
This essay primarily explores the case for exploring multi-objective RL in the context of AI Alignment and so we haven’t aspired to present a fully objective list of possible pros and cons. With that said, we have identified several potential problems we are concerned could threaten the relevance or usefulness of a multi-objective RL agenda. In particular, these might be best seen as plausibility problems. They could conceivably limit us from actually implementing a system that is capable of intelligently balancing multiple objectives.
Multi-objective RL is an already ongoing field of research in academia. Its focus is not primarily on AGI Alignment (although we’ll highlight a few researchers within the alignment community below), and we believe that if applied further in AGI Alignment, multi-objective RL research is likely to yield useful insight. Although the objective scale calibration problem, the wireheading problem, and others, are currently unsolved and are relevant to AGI Alignment, we see opportunities to make progress in these critical areas, including existing work that, in our view, makes progress on various aspects of the calibration problem (Vamplew 2021, Turner, Hadfield-Menell, Tadepalli, 2020). Peter Vamplew has been exploring multi-attribute approaches to low-impact AI and has demonstrated novel impact-based ways to trade off primary and alignment objectives. Alexander Turner and colleagues, working in the low-impact AI research, use a multi-objective space to build a conservative agent that prefers to preserve attainable rewards by avoiding actions that close off options. A key area of interest is exploring how to balance, in non-linear fashion, a set of objectives such that some intuitively appealing outcome is addressed, and our own workshop paper is one example of this.
Even if AGI could derive all useful human objectives through a single directive to “satisfy human preferences” as a single final goal, better understanding multi-objective RL will be useful for understanding how such an AGI might balance competing priorities. That is because human preferences are multi-objective, and so even a human-preference-maximizing agent will, in an emergent sense, become a multi-objective agent, developing multiple sub-objectives to fulfill. Consequently, studying explicitly multi-objective systems are likely to provide insight into how those objectives are likely to play off against one another.
There are a number of questions within multi-objective reinforcement learning that are interesting to explore: this is our first attempt at sketching out a research agenda for the area. Some of these questions, like the potential problems mentioned above, could represent risks that multi-objective RL turns out to be less relevant to AI Alignment. Others are interesting and important questions, important to know how to apply and build multi-objective RL but not decisive for its relevance to AI Alignment.
Which of these problems seems particularly important to you?
We recently presented work on multi-objective reinforcement learning aiming to describe a concave non-linear transform that achieves a conservative outcome by magnifying possible losses more than possible gains at the Multi-objective Decision-Making Workshop 2021. A number of researchers presented various projects on multi-objective decision-making. Many of these could have broader relevance for AGI Alignment, and we believe the implications of work like this for AGI Alignment should be more explicitly explored. One particularly important relevant paper was “Multi-Objective Decision Making for Trustworthy AI” by Mannion, Heintz, Karimpanal, and Vamplew. The authors explore why multi-objective work makes an AI trustworthy; we believe their arguments likely apply as much for transformative AGI as they do for present-day AI systems.
In writing up our work, “Soft maximin approaches to Multi-Objective Decision-making for encoding human intuitive values”, we were interested in multi-objective decision-making because of the potential for an agent to balance conflicting moral priorities. To do this, we wanted to design an agent that would prioritize avoiding ‘moral losses’ over seeking ‘moral gains’, without being paralysed by inaction if all options involved tradeoffs, as moral choices so often do. So, we explored a conservative transformation function that prioritizes the avoidance of losses more than accruing gains, imposing diminishing returns on larger gains but computing exponentially larger negative utilities as costs grow larger.
This model incentivizes an agent to balance each objective conservatively. Past work had designed agents that use a thresholded value for its alignment objective, and only optimize for performance once it has become satisfactory on alignment. In many circumstances it might be desirable for agents to learn optimizing for both objectives simultaneously, and our method provides a way to do that, while actually yielding superior performance on alignment in some circumstances.
Our group as well as many of the other presenters from that workshop are publishing our ideas in a special issue of the Autonomous Agents and Multi-Agent Systems, which comes out in April 2022.
We are currently exploring appropriate calibration for objectives in a set of toy problems introduced by Vamplew et al. (2021). In particular, we’re interested in the relative performance of a continuous non-linear transformation function compared to a discrete, thresholded transformation function on each of the tasks, as well as how performance in each of the functions is robust to variance in the task and its reward structure.
We invite critical feedback about our approach to this topic, about our potential research directions, and about the broad relevance of multi-objective reinforcement learning to AGI Alignment. We will be very grateful for any comments you provide below! Which of the open questions in multi-objective AI do you think are most compelling or important for AGI Alignment research? Do some seem irrelevant or trivial? Are there others we have missed that you believe are important?
Great post, thanks for writing it!!
The links to http://modem2021.cs.nuigalway.ie/ are down at the moment, is that temporary, or did the website move or something?
Is it fair to say that all the things you're doing with multi-objective RL could also be called "single-objective RL with a more complicated objective"? Like, if you calculate the vector of values V, and then use a scalarization function S, then I could just say to you "Nope, you're doing normal single-objective RL, using the objective function S(V)." Right?
(Not that there's anything wrong with that, just want to make sure I understand.)
…this pops out at me because the two reasons I personally like multi-objective RL are not like that. Instead they're things that I think you genuinely can't do with one objective function, even a complicated one built out of multiple pieces combined nonlinearly. Namely, (1) transparency/interpretability [because a human can inspect the vector V], and (2) real-time control [because a human can change the scalarization function on the fly]. Incidentally, I think (2) is part of how brains work; an example of the real-time control is that if you're hungry, entertaining a plan that involves eating gets extra points from the brainstem/hypothalamus (positive coefficient), whereas if you're nauseous, it loses points (negative coefficient). That's my model anyway, you can disagree :) As for transparency/interpretability, I've suggested that maybe the vector V should have thousands of entries, like one for every word in the dictionary … or even millions of entries, or infinity, I dunno, can't have too much of a good thing. :-)
You can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process - as Ben said, during action selection.Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research.For example, I would say that transforming the rewards instead of Q values is more risk-averse as well as "fair" towards individual timesteps, since it does not average out the negative outcomes across time before exponentiating them. But it also results in slower learning by the agent.
Finally there is a third approach which uses lexicographical ordering between objectives or sets of objectives. Vamplew has done work on this direction. This approach is truly multi-objective in the sense that there is no aggregation at all. Instead the vectors must be compared during RL action selection without aggregation. The downside is that it is unwieldy to have many objectives (or sets of objectives) lexicographically ordered.
I imagine that the lexicographical approach and our continuous nonlinear transformation approaches are complementary. There could be for example two main sets of objectives: one set for alignment objectives, the other set for performance objectives. Inside a set there would be nonlinear transformation and then aggregation applied, but between the sets there would be lexicographical ordering applied. In other words there would be a hierarchy of objectives. By having only two sets in lexicographical ordering the lexicographical ordering does not become unwieldy.
This approach would be a bit analogous to the approach used by constraint programming, though more flexible. The safety objectives would act as a constraint against performance objectives. An approach that is almost in absurd manner missing from classical naive RL, but which is very essential, widely known, and technically developed in practical applications, that is, in constraint programming! In the hybrid approach proposed in the above paragraph the difference from classical constraint programming would be that among the safety objectives there would still be flexibility and ability to trade (in a risk-averse way).Finally, when we say "multi-objective" then it does not just refer to the technical details of the computation. It also stresses the importance of acknowledging the need for researching and making more explicit the inherent presence and even structure of multiple objectives inside any abstract top objective. To encode knowledge in a way that constrains incorrect solutions but not correct solutions. As well as acknowledging the potential existence of even more complex, nonlinear interactions between these multiple objectives. We did not focus on nonlinear interactions between the objectives yet, but these interactions are possibly relevant in the future.I totally agree that in a reasonable agent the objectives or target values / set-points do change, as it is also exemplified by biological systems.Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
That's right. What I mainly have in mind is a vector of Q-learned values V and a scalarization function that combines them in some (probably non-linear) way. Note that in our technical work, the combination occurs during action selection, not during reward assignment and learning.
I guess whether one calls this "multi-objective RL" is semantic. Because objectives are combined during action selection, not during learning itself, I would not call it "single objective RL with a complicated objective". If you combined objectives during reward, then I could call it that.
re: your example of real-time control during hunger, I think yours is a pretty reasonable model. I haven't thought about homeostatic processes in this project (my upcoming paper is all about them!). Definitely am not suggesting that our particular implementation of "MORL" (if we can call it that) is the only or even the best sort of MORL. I'm just trying to get started on understanding it! I really like the way you put it. It makes me think that perhaps the brain is a sort of multi-objective decision-making system with no single combinatory mechanism at all except for the emergent winner of whatever kind of output happens in a particular context--that could plausibly be different depending on whether an action is moving limbs, talking, or mentally setting an intention for a long term plan.
Thanks for writing this up! I support your call for more alignment
research that looks more deeply at the structure of the
objective/reward function. In general I feel that the reward function part of the
alignment problem/solution space could use much more attention,
especially because I do not expect traditional ML research community to look
Traditional basic ML research tends to abstract away from the problem
of writing am aligned reward function: it all about investigating
improvements to general-purpose machine learning, machine learning
that can optimize for any possible 'black box' reward function R.
In the work you did, you show that this black box view of the reward
function is too narrow. Once you open up the black box and treat the
reward function as a vector, you can define additional criteria about
how machine learning performance can be aligned or unaligned.
In general, I found that once you take the leap and start
contemplating reward function design, certain problems of AI alignment
can become much more tractable. To give an example: the management of
self-modification incentives in agents becomes kind of trivial if you
can add terms to the reward function which read out some physical
sensors, see for example section 5 of my paper
So I have been somewhat
puzzled by the question of why there is so little alignment research
in this direction, or why so few people step up an point out that this
kind of stuff is trivial. Maybe this is because improving the reward function
is not considered to be a part of ML research. If I try to manage
self-modification incentives with my hands tied behind my back,
without being allowed to install physical sensors coupled to reward
function terms, the whole problem becomes much less tractable. Not
completely intractable, but the the solutions I then find (see
this earlier paper ) are mathematicaly much more
complex, and less robust under mistakes of
I sometimes have the suspicion that there are whole non-ML conferences
or bodies of literature devoted to alignment related reward function
design, but I am just not seeing them. Unfortunately, it looks like
the modem2021 workshop website with the papers you linked to is
currently down. It was working two weeks ago.
So a general literature search related question: while doing your
project, did you encounter any interesting conferences or papers that
I should be reading, if I want to read more work on aligned reward
function design? I have already read Human-aligned artificial
intelligence is a multiobjective
Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
The paper is now published with open access here:
The only resource I'd recommend, beyond MODEM, when that's back up, and our upcoming JAMAAS special issue, is to check out Elicit, Ought's GPT-3-based AI lit search engine (yes, they're teaching GPT-3 about how to create a superintelligent AI. hmm). It's in beta, but if they waitlist you and don't accept you in, email me and I'll suggest they add you. I wouldn't say it'll necessarily show you research you're not aware of, but I found it very useful for getting into the AI Alignment literature for the first time myself.