MIRI recently blogged about the workshop paper that I presented at AAAI.

My abstract:

Hypothetical “value learning” AIs learn human values and then try to act according to those values. The design of such AIs, however, is hampered by the fact that there exists no satisfactory definition of what exactly human values are. After arguing that the standard concept of preference is insufficient as a definition, I draw on reinforcement learning theory, emotion research, and moral psychology to offer an alternative definition. In this definition, human values are conceptualized as mental representations that encode the brain’s value function (in the reinforcement learning sense) by being imbued with a context-sensitive affective gloss. I finish with a discussion of the implications that this hypothesis has on the design of value learners.

Their summary:

Economic treatments of agency standardly assume that preferences encode some consistent ordering over world-states revealed in agents’ choices. Real-world preferences, however, have structure that is not always captured in economic models. A person can have conflicting preferences about whether to study for an exam, for example, and the choice they end up making may depend on complex, context-sensitive psychological dynamics, rather than on a simple comparison of two numbers representing how much one wants to study or not study.

Sotala argues that our preferences are better understood in terms of evolutionary theory and reinforcement learning. Humans evolved to pursue activities that are likely to lead to certain outcomes — outcomes that tended to improve our ancestors’ fitness. We prefer those outcomes, even if they no longer actually maximize fitness; and we also prefer events that we have learned tend to produce such outcomes.

Affect and emotion, on Sotala’s account, psychologically mediate our preferences. We enjoy and desire states that are highly rewarding in our evolved reward function. Over time, we also learn to enjoy and desire states that seem likely to lead to high-reward states. On this view, our preferences function to group together events that lead on expectation to similarly rewarding outcomes for similar reasons; and over our lifetimes we come to inherently value states that lead to high reward, instead of just valuing such states instrumentally. Rather than directly mapping onto our rewards, our preferences map onto our expectation of rewards.

Sotala proposes that value learning systems informed by this model of human psychology could more reliably reconstruct human values. On this model, for example, we can expect human preferences to change as we find new ways to move toward high-reward states. New experiences can change which states my emotions categorize as “likely to lead to reward,” and they can thereby modify which states I enjoy and desire. Value learning systems that take these facts about humans’ psychological dynamics into account may be better equipped to take our likely future preferences into account, rather than optimizing for our current preferences alone.

Would be curious to hear whether anyone here has any thoughts. This is basically a "putting rough ideas together and seeing if they make any sense" kind of paper, aimed at clarifying the hypothesis and seeing whether others kind find any obvious holes in it, rather than being at the stage of a serious scientific theory yet.




6 comments, sorted by Click to highlight new comments since: Today at 9:35 PM
New Comment

Okay, so let's use some concrete examples to see if I understand this abstract correctly.

You say that the chain of causation is from fitness (natural selection) ---> outcomes ---> activities

So, for example: reproduction ---> sex ---> flirting/dancing/tattooing/money/bodybuilding.

Natural selection programs us to have a terminal goal of reproduction. HOWEVER, it would be a bad idea for an AI to conclude, "OK, humans want reproduction? I'll give them reproduction. I'll help the humans reproduce 10 quadrillion people. The more reproduction, the better, right?"

The AI would need to look ahead and see, "OK, the programmed goal of reproduction has caused humans to prefer a specific outcome, sex, which tended to lead to reproduction in the original (ancestral) programming environment, but might no longer do so. Humans have, in other words, come to cherish sex as a terminal goal in its own right through their affective responses to its reward payoff. So, let's make sure that humans can have as much sex as possible, regardless of whether it will really lead to more reproduction. That will make humans happy, right?"

But then the AI would need to look ahead one step further and see, "OK, the preferred outcome of sex has, in turn, caused humans to enjoy, for their own sake, specific activities that, in the experience and learning of particular humans in their singular lifetimes (we are no longer talking about instinctual programming here, but rather culture), has tended in their particular circumstances, to lead to this preferred outcome of sex. In one culture, humans found that flirting tended to lead to sex, and so they formed a positive affective connotation with flirting and came to view flirting as a terminal goal in its own right. In another culture, dancing appeared to be the key to sex, and so dancing became a terminal goal in that culture. In other cultures, bodybuilding, accumulation of money, etc. seemed to lead to sex, and so humans became attached to those activities for their own sake, even beyond the extent to which those activities continued to lead to more sex. So really, the way to make these humans happy would be to pay attention to their particular cultures and psychologies and see which activities they have come to develop a positive affective bond with...because THESE activities have become the humans' new conscious terminal goals. So we AI robots should work hard to make it easy for the humans to engage in as much flirting/dancing/bodybuilding/money accumulation/etc. as possible."

Would this be an accurate example of what you are talking about?

So to first note a few things:

  • The model is currently mostly focused on modeling the values of just a single person. I do talk a bit about how to figure out the values of entire societies of people, but that's mostly outside the scope of this paper.
  • Evolution's "original intent" doesn't really matter for its own sake. That sexual attraction has evolved for the purpose of furthering reproduction isn't considered particularly important; what's important is just that humans experience sexual attraction and find it pleasant (or not, as the case may be).

Those things said, your final step does sound reasonably close to the kind of thing I was thinking of. We can look at some particular individual, note that a combination of the surrounding culture and their own sexuality made them try out flirting and dancing and find both rewarding, and then come to value those things for their own sake. And conclude that their ideal future would probably be likely to include fair amounts of both.

Though of course there are also all kinds of questions about, for example, exactly how rewarding and enjoyable do they find those things. Maybe someone feels positive about the concept of being the kind of person who'd enjoy dance, but isn't actually the kind of person who'd enjoy dance. Resolving that kind of a conflict would probably either mean helping them to learn to enjoy dance, or to help them give up the ideal of needing to be that kind of a person. The correct action would depend on exactly how deeply their reasons for not enjoying dance ran, and on what their other values were.

Also it's possible that upon an examination of the person's psychology, the AI would conclude that while they did enjoy flirting and dance, there were other things that they would enjoy more - either right now, or given enough time and effort. The AI might then work to create a situation where they could focus more on those more rewarding things.

Now with these kinds of questions there is the issue of exactly what kinds of interventions are allowed by the AI. After all, the most effective way of making somebody have maximally rewarding experiences would be to rewire their brain to always receive maximal reward. Here's where I take a page of Paul Christiano's book and his suggestion of approval-directed agents, and propose that the AI is only allowed to do the kinds of things to the human that the human's current values would approve of. So if the human doesn't want to have their brain rewired, but is okay with the AI suggesting new kinds of activities that they might enjoy, then that's what happens. (Of course, "only doing the kinds of things that the human's current values would approve of" is really vague and hand-wavy at this point, and would need to be defined a lot better.)

Let's assume such an AI could be created perfectly.

Wouldn't there be a danger of freezing human values forever to the values of the society which created it?

Imagine somehow the Victorian people (using steampunk or whatever) managed to build such an AI, and that AI would forever enforce their values. Would you be happy with every single value it enforced?

So in this formulation, human values are explicitly considered to be dynamic and in a constant change as people accumulate new experiences and have their environment change. Say that the Victorians invent a steampunk version of the Internet; that's going to cause them to have new kinds of experiences, which will cause changes in their values.

Both individuals and societies also have lots of different value conflicts that they will want to resolve; see e.g. the last three paragraphs of this comment. Resolving those conflicts and helping people find the most rewarding things will naturally change their values.

Now there is still a bit of a risk of value lock-in, in that the AI is postulated to use the society's existing values as the rule that determines what kinds of adjustments to values are acceptable. But I think that there's an inevitable tradeoff, in that we both want to allow for value evolution, and to make sure that we don't end up in a future that would contain nothing of value (as judged by us current-day humans). Unless we are prepared to just let anything happen (in which case why bother with Friendly AI stuff in the first place?), we need to have our existing values guide some of the development process.

I don't want to speak for the original author, but I imagine that presumably the AI would take into account that the Victorian society's culture was changing based on its interactions with the AI, and that the AI would try to safeguard the new, updated values...until such a time as those new values became obsolete as well.

In other words, it sounds like under this scheme the AI's conception of human values would not be hardcoded. Instead, it would observe our affect to see what sorts of new activities had become terminal in their own right that made us intrinsically happy to participate in, and the AI would adapt to this change in human culture to facilitate the achievement of those new activities.

That said, I'm still unsure about how one could guarantee that the AI could not hack its own "human affect detector" to make it very easy for itself by forcing smiles on everyone's face under torture and defining torture as the preferred human activity.

I endorse this comment.

That said, I'm still unsure about how one could guarantee that the AI could not hack its own "human affect detector" to make it very easy for itself by forcing smiles on everyone's face under torture and defining torture as the preferred human activity.

That's a valid question, but note that it's asking a different question than the one that this model is addressing. (This model asks "what are human values and what do we want the AI to do with them", your question here is "how can we prevent the AI from wireheading itself in a way that stops it doing the things that we want it to do". "What" versus "how".)

New to LessWrong?