# Ω 20

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

There have been multiple practical suggestions for methods about how we should extract the values of a given human. Here are four common classes of such methods:

• Methods that put high weight on human (bounded) quasi-rationality, or revealed preferences. For example, we can assume the Kasparov was actually trying to win against DeepBlue, not trying desperately to lose while inadvertently playing excellent chess.
• Methods that pay attention to our explicitly stated values.
• Methods that use regret, surprise, joy, or similar emotions, to estimate what humans actually want. This could be seen as a form of human TD learning.
• Methods based on an explicit procedure for constructing the values, such as CEV and Paul's indirect normativity.

## Divergent methods

The first question is why we would expect these methods to point even vaguely in the same direction. They all take very different approaches - why do we think they're measuring the same thing?

The answer is that they roughly match up in situations we encounter everyday. In such typical situations, people who feel regret are likely to act to avoid that situation again, to express displeasure about the situation, etc.

By analogy, consider a town where there are only two weather events: bright sunny days and snow storms. In that town there is a strong correlation between barometric pressure, wind speed, cloud cover, and temperature. All four indicators track different things, but, in this town, they are basically interchangeable.

But if the weather grows more diverse, this correlation can break down. Rain storms, cloudy days, meteor impacts: all these can disrupt the alignment of the different indicators.

Similarly, we expect that an AI could remove us from typical situations and put us into extreme situations - at least "extreme" from the perspective of the everyday world where we forged the intuitions that those methods of extracting values roughly match up. Not only do we expect this, but we desire this: a world without absolute poverty, for example, is the kind of world we would want the AI to move us into, if it could.

In those extreme and unprecedented situations, we could end up with revealed preferences pointing one way, stated preferences another, while regret and CEV point in different directions entierly. In that case, we might be tempted to ask "should we follow regret or stated preferences?" But that would be the wrong question to ask: our methods no longer correlated with each other, let alone with some fundamental measure of human values.

We are thus in an undefined state; in order to continue, we need a meta-method that decides between the different methods. But what criteria could such meta-method use for deciding (note that simply getting human feedback is not generically an option)? Well, it would have to select the method which best matches up with human values in this extreme situation. To do that, it needs a definition - a theory - of what human values actually are.

## Underdefined methods

The previous section understates the problems with purely practical ways of assessing human values. It pointed out divergences between the methods in "extreme situations". Perhaps we were imagining these extreme situations as the equivalent of a meteor impact on weather system: bizarre edge cases where reasonable methods finally break down.

But all those actually methods fail in typical situations as well. If we interpret the methods naively, they fail often. For example, in 1919, some of the Chicago White Sox baseball team were actually trying to lose. If we ask someone their stated values in a political debate or a courtroom, we don't expect an honest answer. Emotion based approaches fail in situations where humans deliberately expose themselves to nostalgia, or fear, or other "negative" emotions (eg through scary movies). And there are failure modes for the explicit procedures, too.

This is true if we interpret the methods naively. If we were more "reasonable" or "sophisticated", we would point out that don't expect those methods to be valid in every typical situation. In fact, we can do better than that: we have a good intuitive understanding of when the methods succeed and when they fail, and different people have similar intuitions (we all understand that people are more honest in relaxed private settings that stressful public ones, for example). It's as if we lived in a town with either sunny days or snow storms except on weekends. Then everyone could agree that the different indicators correlate during the week. So the more sophisticated methods would include something like "ignore the data if it's Saturday or Sunday".

But there are problems with this analogy. Unlike for the weather, there are no clear principle for deciding when it's the equivalent of the weekend. Yes, we have an intuitive grasp of when stated preferences fail, for instance. But as Moravec's paradox shows, an intuitive understanding doesn't translate into an explicit, formal definition - and it's that kind of formal definition that we need if we want to code up those methods. Even worse, we don't all agree as to when the methods fail. For example, some economists deny the very existence of mental illness, while psychiatrists (and most laypeople) very much feel these exist.

## Human judgement and machine patching

So figuring out whether the methods apply is an exercise in human judgement. Figuring out whether the methods have gone wrong is a similar exercise (see the Last Judge in CEV). And figuring out what to do when they don't apply is also an exercise in human judgement - if we judge that someone is lying about their stated preferences, we could just reverse their statement to get their true values.

So we need to patch the methods using our human judgement. And probably patch the patches and so on. Not only is the patching process a terrible and incomplete way of constructing a safe goal for the AI, but human judgements are not consistent - we can be swayed in things as basic as whether a behaviour is rational, let alone all the situational biases that cloud our assessments of more complicated issues.

So obviously, the solution to these problems is to figure out which human is best in their judgements, and then to see under what circumstances these judgements can be least biased, and how to present the information to them in the most impartial way and then automate that judgement...

Stop that. It's silly.. The correct solution is not to assess the rationality of human judgements of methods of extracting human values. The correct solution is to come up with a better theoretical definition of what human values are. Armed with such a theory, we can resolve or ignore the above issues in a direct and principled way.

# Building a theory of human values

Just because we need a theory of human values, doesn't mean that it's easy to find one - the universe is cruel like that.

A big part of my current approach is to build such a theory. I will present an overview of my theory in a subsequent post, though most of the pieces have appeared in past posts already. My approach uses three key components:

1. A way of defining the basic preferences (and basic meta-preferences) of a given human, even if these are under-defined or situational.
2. A method for synthesising such basic preferences into a single utility function or similar object.
3. A guarantee we won't end up in a terrible place, due to noise or different choices in the two definitions above.

# Ω 20

New Comment
[-]Wei DaiΩ5140

In those extreme and unprecedented situations, we could end up with revealed preferences pointing one way, stated preferences another, while regret and CEV point in different directions entierly. In that case, we might be tempted to ask “should we follow regret or stated preferences?” But that would be the wrong question to ask: our methods no longer correlated with each other, let alone with some fundamental measure of human values.

The last part of this doesn't make sense to me. CEV is rather underdefined, but Paul’s indirect normativity (which you also cited in the OP as being in the same category as CEV) essentially consists of a group of virtual humans in some ideal virtual environment trying to determine the fundamental measure of human values as best as they can. Why would you not expect the output of that to be correlated with some fundamental measure of human values? If their output isn't correlated with that, how can we expect to do any better?

Indirect normativity has specific failure mode - eg siren worlds, or social pressures going bad, or humans getting very twisted in that ideal environment in ways that we can't yet predict. More to the point, these failure modes are ones that we can talk about from outside - we can say things like "these precautions should prevent the humans from getting too twisted, but we can't fully guarantee it".

That means that we can't use indirect normativity as a definition of human values, as we already know how it could fail. A better understanding of what values are could result in being able to automate the checking as to whether it failed or not, which would me that we could include that in the definition.

More to the point, these failure modes are ones that we can talk about from outside

So can the idealized humans inside a definition of indirect normativity, which motivates them to develop some theory and then quarantine parts of the process to examine their behavior from outside the quarantined parts. If that is allowed, any failure mode that can be fixed by noticing a bug in a running system becomes anti-inductive: if you can anticipate it, it won't be present.

Yes, that's the almost fully general counterargument: punt all the problems to the wiser versions of ourselves.

But some of these problems are issues that I specifically came up with. I don't trust that idealised non-mes would necessarily have realised these problems even if put in that idealised situation. Or they might have come up with them too late, after they had already altered themselves.

I also don't think that I'm particularly special, so other people can and will think up problems with the system that hadn't occurred to me or anyone else.

This suggests that we'd need to include a huge amount of different idealised humans in the scheme. Which, in turn, increases the chance of the scheme failing due to social dynamics, unless we design it carefully ahead of time.

So I think it is highly valuable to get a lot of people thinking about the potential flaws and improvements for the system before implementing it.

That's why I think that "punting to the wiser versions of ourselves" is useful, but not a sufficient answer. The better we can solve the key questions ("what are these 'wiser' versions?", "how is the whole setup designed?", "what questions exactly is it trying to answer?"), the better the wiser ourselves will be at their tasks.

The better we can solve the key questions ("what are these 'wiser' versions?", "how is the whole setup designed?", "what questions exactly is it trying to answer?"), the better the wiser ourselves will be at their tasks.

I feel like this statement suggests that we might not be doomed if we make a bunch of progress, but not full progress on these statements. I agree with that assessment, but it felt on reading the post like the post was making the claim "Unless we fully specify a correct theory of human values, we are doomed".

I think that I'd view something like Paul's indirect normativity approach as requiring that we do enough thinking in advance to get some critical set of considerations known by the participating humans, but once that's in place we should be able to go from this core set to get the rest of the considerations. And it seems possible that we can do this without a fully-solved theory of human value (but any theoretical progress in advance we can make on defining human value is quite useful).

I currently agree with this view. But I'd add that a theory of human values is a direct way to solve some of the critical considerations.

Yes, that's the almost fully general counterargument: punt all the problems to the wiser versions of ourselves.

It's not clear what the relevant difference is between then and now, so the argument that it's more important to solve a problem now is as suspect as the argument that the problem should be solved later.

How are we currently in a better position to influence the outcome? If we are, then the reason for being in a better position is a more important feature of the present situation than object-level solutions that we can produce.

We have a much clearer understanding of the pressures we are under now, as to what pressures simulated versions of ourselves would be in the future. Also, we agree much more strongly with the values of our current selves than with the values of possible simulated future selves.

Consequently, we should try and solve early the problems with value alignment, and punt technical problems to our future simulated selves.

How are we currently in a better position to influence the outcome?

It's not particularly a question of influencing the outcome, but of reaching the right solution. It would be a tragedy if our future selves had great influence, but pernicious values.

Promoted to curated: You've written a lot of good posts about AI Alignment and related problems, but most of them are a bit too inaccessible to curate, or only really make sense when read together with a lot of other posts. And besides that, I think this post is one of the best and most important ones that I think you've written in the last year, and that I've found to clarify some of my thoughts on this topic, so I think it made sense to curate it.

Thanks a lot for all the work that you are doing, and I hope this post can serve as a good starting point for people who are interested in reading more of your research.

Cheers!

Maybe I'm reading your post wrong, but it seems that you're assuming that a coherent approach is needed in a way that could be counter-productive. I think that a model of an individual's preferences is likely to be better represented by taking multiple approaches, where each fails differently. I'd think that a method that extends or uses revealed preferences would have advantages and disadvantages that none of, say, stated preferences, TD Learning, CEV, or indirect normativity share, and the same would be true for each of that list. I think that we want that type of robust multi-model approach as part of the way we mitigate over-optimization failures, and to limit our downside from model specification errors.

(I also think that we might be better off building AI to evaluate actions on the basis of some moral congress approach using differently elicited preferences across multiple groups, and where decisions need a super-majority of some sort as a hedge against over-optimization of an incompletely specified version of morality. But it may be over-restrictive, and not allow any actions - so it's a weakly held theory, and I haven't discussed it with anyone.)

I think that a model of an individual's preferences is likely to be better represented by taking multiple approaches, where each fails differently.

I agree. But what counts as a failure? Unless we have a theory of what we're trying to define, we can't define failure beyond our own vague intuitions. But once we have a better theory, defining failure becomes a lot easier.

I agree, and think work in the area is valuable, but would still argue that unless we expect a correct and coherent answer, any single approach is going to be less effective than an average of (contradictory, somewhat unclear) different models.

As an analogue, I think that effort into improving individual prediction accuracy and calibration is valuable, but for most estimation questions, I'd bet on an average of 50 untrained idiots over any single superforecaster.

Imagine you have 10 sensors measuring different things. The environment (including things you can affect) permutes their values over time. Let's say they operate like thermometers and you're trying to keep them within specified ranges. If the sensors are trinary (too low, too high, just right) you already have 59,049 states to navigate in your tradeoff space. The higher the resolution the sensors the faster the combinatoric explosion. So a small number of parameters leads to a very complex seeming situation.

I strongly agree, but I think the format of the thing we get, and how to apply it, are still going to require more thought.

Human values as they exist inside humans are going to exist natively as several different, perhaps conflicting, ways of judging human internal ways of representing the world. So first you have to make a model of a human, and figure out how you're going to locate intentional-stance elements like "representation of the world." Then you run into ontological crises from moving the human's models and judgments into some common, more accurate model (that an AI might use). Get the wrong answer in one of these ontological crises, and the modeled utility function may assign high value to something we would regard as deceptive, or as wireheading the human (such reactions might give some hints towards how we want to resolve such ontological crises).

Once we're comparing human judgments on a level playing field, we can still run into problems of conflicts, problems of circularity, and other weird meta-level conflicts where we don't value some values that I'm not sure how to address in a principled way. But suppose we compress these judgments into one utility function within the larger model. Are we then done? I'm not sure.