Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a very rough intuition pump for possible alternatives to value learning.

In broad strokes, the goal of (ambitious) value learning is to define and implement a notion of cooperation (or helpfulness) in terms of two activities: (1) figuring out what humans value, (2) working to optimize that.

I'm going to try to sketch an alternative notion of cooperation/helpfulness. This intuition is based on libertarian or anarcho-capitalist ideas, but in some ways seems closer to what humans do when they try to help.


I was talking to Andrew Forrester about the suffering golem thought experiment. I'm not sure who came up with this thought experiment, but the idea is:

Suffering Golem: A golem suffers terribly in every moment of its existence, but it says it wants to keep living. Do you kill it?

The idea is that if you think it is good to kill it, you're a hedonic utilitarian: your altruistic motives have to do with maximizing pleasure / minimizing suffering. If you think it should not be killed, then you're more likely to be a preference utilitarian: your altruistic motives have to do with what the person values for themselves, rather than some other thing you think would be good for them. (I tend to lean toward preference utilitarianism myself, but don't think the question is obvious.)

Andrew Forrester was against killing it, but justified his answer with something like "it's none of your business. You could provide convenient means of suicide if you wanted..." This wasn't an expression of not caring about the welfare of the golem. Rather, it was a way of saying that you want to preserve the golem's autonomy.

I realized that although his answer was consistent with preference utilitarianism on the surface, it went beyond. I think he would likely have a similar response to the following thought experiment:

Confused Golem: A golem hates every moment of its existence, and would prefer to die, but it is unable to admit the fact to itself. It thinks that it loves life and wants to continue living. Perhaps it could eventually realize that it preferred not to exist if it thought about the question for long enough, but that day is a long way off. Do you kill it?

The autonomy-preserving move is to not kill the confused golem. You might talk to the golem about what it wants, but you wouldn't actively optimize for convincing it that it actually wants to die. (That would subtract from its autonomy.)

Informed Consent?

If I imagine something which is only motivated to help, where "help" is interpreted in the autonomy-centric way, it seems like the idea is an entity which only acts on your informed consent. It will sit and do nothing until it is confident that there is something which you want it to do, and understand the consequences of it doing.

Imagine you buy a robot which runs on these rules. The robot sits patiently in your house. Sitting there is not a violation of its directive, because it did not place itself there; whatever the consequences may be, they are a result of your autonomous action. However, it does watch you, which may violate informed consent. It has a high prior probability that you understand it is watching and consent to this, because the packaging had prominent warnings about this. Watching you is necessary for the robot to function, since it must infer your consent. It may shut off if it infers that you do not understand this.

The robot will continue doing nothing until it has gained confidence that you have fully read the instruction booklet, which contains the basic facts about how the robot functions. You may then issue commands to the robot. The instruction booklet recommends that, before you try this, you say "I consent to discuss the meaning of my commands with you, the robot." This clears the robot to ask clarifying questions about commands and to tell you about the likely consequences of commands if it does not think you understand them. Without giving consent to this, the robot will often fail to do anything without offering any explanation for its failure.

Another recommended command is "Let's discuss how we can work together." This clears the robot to make general inquiries about how it should behave toward you. Once you issue this command, for the duration of the conversation, the robot will formulate intelligent questions about what you might want, what you like and dislike, where you struggle in your life, and so on. It will make suggestions about how it can help, many invented on the spot. At some point during the conversation it will likely ask if it should maintain this level of candor going forward, or if it should only discuss its tasks in such an open-ended way upon request.

Gentle Value Extraction

What the robot will absolutely not do during this initial interview is pry into your personal life with questions optimized to extract the maximum useful information about your values and life difficulties. Although that might be the most useful thing it could do during its initial interview with you, it would break your autonomy, because many humans are uncomfortable discussing certain topics, and breaking these norms is not a reasonable consequence to expect from the command you've issued. Since humans may not even wish to consent to the robot knowing various personal details (and may accidentally reveal enough information for the robot to figure things out), the robot has to tread lightly in its inferences, too. Even asking directly whether a certain topic is OK may be an unwanted and unexpected act, making it impossible to go there unless the human brings it up on their own initiative.

The robot might not even try to gently move the discussion in the direction of greater openness about private details, because "trying to get the human to open up more" is not an obvious consequence of discussing potential tasks. But it isn't obvious; maybe trying to get people to open up is normal enough for a conversation that this is fine. The instruction booklet could warn users about it, making it an expected consequence and therefore part of what is consented to.

Explicit Consent vs Inferred Consent

At this point, someone might be thinking "Why are you talking about the robot inferring that the human consents to certain things as reasonable expectations of giving certain commands? Why give so much leeway? We could just require explicit consent instead."

Explicit consent is so impractical as to border on meaninglessness. We want the robot to have some autonomy in how it executes commands. If it knows we like cream in our coffee, it makes sense for it to just put the cream in without asking every time, or us issuing a general rule. Cream in the coffee is a reasonable expectation. The way I think about it, an explicit consent requirement would force us to approve every motor command precisely; the freedom to intelligently carry out complex tasks in response to commands requires a certain amount of freedom to infer consent.

Another way of thinking about the problem is that explicit consent places dictionary-definition English in too much of a special position. We can convey our meaning in any number of ways. In a sufficiently information-rich context, a glance might be sufficient.

Turning things the other way around, there are also cases when explicit consent doesn't make for inferred consent. If someone is made to consent under duress, consent should not be inferred.

The biggest argument I see in favor of explicit consent is that it makes for a much lower risk of misunderstanding. Misunderstanding is certainly a serious concern, and one reason why humans often require explicit consent in high-stakes situations. However, in the context of consent-based robotics, there are likely better ways of addressing the concern:

  • Requiring higher confidence in inferred consent. This might be modulated by the inferred importance of the question in a situation, so that explicit consent is required in practice for anything of importance, due to the high confidence it establishes. Measuring "importance" in this way creates its own potential safety concerns, of course.
  • Using highly robust machine-learning techniques, so that spuriously high confidence in inferred consent is very unlikely.

What Does Informed Consent Mean?

There's a conceptual problem here which feels very similar to impact measures. An impact measure is supposed to quantify the extent to which an action changes things in general. Informed consent seems to require that we quantify the degree to which a change fits within certain expectations. The notion of "change" seems to be common between the two.

For example, at least according to my intuition, an impact measure should not penalize an AI much for the butterfly-effect changes in weather patterns which any action implies. The future will include hurricanes destroying cities in a broad variety of circumstances, and small actions may create large changes in which hurricanes / which cities. If a particular action foreseeably changes the overall impact of the hurricane/city pattern on other important variables in a significant way, then an impact measure should penalize it.

Similarly, a human can have informed consent as to the consequences of a robot going to the grocery store and buying bananas without understanding all the consequences on future weather patterns, even though this will involve some large changes to which hurricanes destroy which cities at some point later. On the other hand, if the robot walks to the grocery store in just the right way so as to cause a series of severe hurricanes to tip the right dominoes to cause a severe economic collapse which would otherwise not have happened, then this is a significant unexpected consequence of going to the grocery store which the human would need to consent to separately.

Human Rationality Assumptions

The bad news is that this approach seems likely to run into essentially all the same conceptual difficulties as value learning, if not more. Although the conceptual framework is not as strongly tied to VNM-style utility functions as value learning is, the robot still needs to infer what the human believes and wants: belief for the "informed" part, and want for the "consent" part. This still sounds like it is most naturally formulated in a VNM-ish framework, although there may be other ways.

As such, it doesn't seem like it helps any with the difficulties of assuming human rationality.

Helping Animals

My friend mentioned that the suffering golem scenario depends a great deal on whether the golem is sentient. Mercy-killing suffering animals is OK, even good, without any consent. More generally, there are lots of acceptable ways of helping animals which break their autonomy in significant ways, such as taking them to the vet against protests. One might say the same thing of children.

It isn't obvious what makes the difference, but one idea might be: where there is no capacity for informed consent, other principles may apply. But what would this imply for humans? There may be consequences of actions which we lack the capacity to understand. Should the robot simply try to optimize for our preferences on those issues, without constraining acceptable consequences by consent?

How should an autonomy-respecting robot interact with children? Respecting human autonomy absolutely might make it impossible to help with certain household tasks like changing diapers. If so, the approach might not result in very capable agents.

Respecting All Humans

So far, I've focused on a thought experiment of a robot respecting the autonomy of a single designated user. Ultimately, it seems like an approach to alignment has to deal with all humans. However, getting "consent" from all humans seems impossible. How can a consent-based approach approve any actions, then?

One idea is to only require consent from humans who are impacted by an action. Any action which impacts the whole future would require consent from everyone (?), but low-impact actions could be carried out with only consent from those involved.

It's not clear to me how to approach this question.

Connections to Other Problems

  • As I mentioned, this seems to connect to impact measures.
  • The agent as described also may be a mild optimizer, because (1) it has to avoid thinking about things when those things are not understood consequences of carrying out commands, (2) plans are constrained by the human probability distribution over plans, somehow (I'm not sure how it works, but there's definitely an aspect of "unexpected plans are not allowed" in play here).
  • There is a connection to transparency, in that impacts of actions have to be described/understood (and approved) in order to be allowed.
  • The agent as I've described it sounds potentially corrigible, in that resistance to shutdown or modification would have to be an understood and approved consequence of a command.


Ω 20

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 5:42 PM

Autonomy is a value and can be expressed as a part of a utility function, I think. So ambitious value learning should be able to capture it, so an aligned AI based on ambitious value learning would respect someone's autonomy when they value it themselves. If they don't, why impose it upon them?

(This assumes that we manage to solve the general problems with ambitious value learning. Is the point here that you expect we can't solve those problems and therefore need an alternative? The idea doesn't help with "the difficulties of assuming human rationality" though so what problems does it help with?)

ETA: Is the idea that even trying to do ambitious value learning constitutes violating someone's autonomy (in other words someone could have a preference against having ambitious value learning done on them) and by the time we learn this it would be too late?

Autonomy is a value and can be expressed as a part of a utility function, I think. So ambitious value learning should be able to capture it, so an aligned AI based on ambitious value learning would respect someone's autonomy when they value it themselves. If they don't, why impose it upon them?

One could make a similar argument for corrigibility: ambitious value learning would respect our desire for it to behave corrigibly if we actually wanted that, and if we didn't want that, why impose it?

Corrigibility makes sense as something to ensure in its own right because it is good to have in case the value learning is not doing what it should (or something else is going wrong).

I think respect for autonomy is similarly useful. It helps avoid evil-genie (perverse instantiation) type failures by requiring that we understand what we are asking the AI to do. It helps avoid preference-manipulation problems which value learning approaches might otherwise have, because regardless of how well expected-human-value is optimized by manipulating human preferences, such manipulation usually involves fooling the human, which violates autonomy.

(In cases where humans understand the implications of value manipulation and consent to it, it's much less concerning -- though we still want to make sure the AI isn't prone to pressure humans into that, and think carefully about whether it is really OK.)

Is the point here that you expect we can't solve those problems and therefore need an alternative? The idea doesn't help with "the difficulties of assuming human rationality" though so what problems does it help with?

It's less an alternative in terms of avoiding the things which make value learning hard, and more an alternative in terms of providing a different way to apply the same underlying insights, to make something which is less of a ruthless maximizer at the end.

In other words, it doesn't avoid the central problems of ambitious value learning (such as "what does it mean for irrational beings to have values?"), but it is a different way to try to put those insights together into a safe system. You might add other safety precautions to an ambitious value learner, such as [ambitious value learning + corrigibility + mild optimization + low impact + transparency]. Consent-based systems could be an alternative to that agglomerated approach, either replacing some of the safety measures or making them less difficult to include by providing a different foundation to build on.

Is the idea that even trying to do ambitious value learning constitutes violating someone's autonomy (in other words someone could have a preference against having ambitious value learning done on them) and by the time we learn this it would be too late?

I think there are a couple of ways in which this is true.

  • I mentioned cases where a value-learner might violate privacy in ways humans wouldn't want, because the overall result is positive in terms of the extent to which the AI can optimize human values. This is somewhat bad, but it isn't X-risk bad. It's not my real concern. I pointed it out because I think it is part of the bigger picture; it provides a good example of the kind of optimization a value-learner is likely to engage in, which we don't really want.
  • I think the consent/autonomy idea actually gets close (though maybe not close enough) to something fundamental about safety concerns which follow an "unexpected result of optimizing something reasonable-looking" pattern. As such, it may be better to make it an explicit design feature, rather than trust the system to realize that it should be careful about maintaining human autonomy before it does anything dangerous.
  • It seems plausible that, interacting with humans over time, a system which respects autonomy at a basic level would converge to different overall behavior than a value-learning system which trades autonomy off with other values. If you actually get ambitious value learning really right, this is just bad. But, I don't endorse your "why impose it on them?" argument. Humans could eventually decide to run all-out value-learning optimization (without mild optimization, without low-impact constraints, without hard-coded corrigibility). Preserving human autonomy in the meantime seems

One could make a similar argument for corrigibility: ambitious value learning would respect our desire for it to behave corrigibly if we actually wanted that, and if we didn’t want that, why impose it?

There's a disanalogy in that autonomy is probably a terminal value whereas corrigibility is only an instrumental one. In other words, I don't want a corrigible AI for the sake of having a corrigible AI, I want one so it will help me reach my other goals. I do (probably) want autonomy, and not only because it would help me reach other goals. So in fact ambitious value learning will not learn to behave corrigibly, I think, because the AI will probably think it has a better way of giving me what I ultimately want.

Oh, I think I see a different way of stating your argument that avoids this disanalogy: we're not concerned about autonomy as a terminal value here, but as an instrumental one like corrigibility. If ambitious value learning works perfectly, then it would learn autonomy as a terminal value, but we want to implement autonomy-respecting AI mainly because that would help us get what we want in case ambitious value learning fails to works perfectly.

I think I understand the basic idea and motivation now, and I'll just point out that autonomy-respecting AI seems share several problems with other non-goal-directed approaches to AI safety.

I like this framework, but it also reminds me a case of informed consent failure: "This site uses cookies. By continuing to browse the site, you are agreeing to our use of cookies. Find out more" - and other user agreements which nobody reads.

Anyway, to make a robot which is able to discern different types oа consent is AI safety complete task - so AI safety should be solved before this robot arrive to the home of the user. I explored a similar model in "Dangerous value learners."

Rephrasing a command is a good way to ensure understanding and to establish the consent, like in case: Alice: "I want coffee in bed"; Robot: "Do you want it to be poured in bed"?

This seems like an interesting idea for how to build an AI system in practice, along the same lines as corrigibility. We notice that value learning is not very robust: if you aren't very good at value learning, then you can get very bad behavior, and human values are sufficiently complex that you do need to be very capable in order to be sufficiently good at value learning. With (a particular kind of) corrigibility, we instead set the goal to be to make an AI system that is trying to help us, which seems more achievable even when the AI system is not very capable. Similarly, if we formalize or learn informed consent reasonably well (which seems easier to do since it is not as complex as "human values"), then our AI systems will likely have good behavior (though they will probably not have the best possible behavior, since they are limited by having to respect informed consent).

However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI's "motivational system". This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are "somewhat" corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn't seem to have an analogous benefit.

Preference utilitarianism really falls apart when you can't trust that expressed preferences are true and durable.

And I'd like a little more definition of "autonomy" as a value - how do you operationally detect whether you're infringing on someone's autonomy? Is it just the right to make bad decisions (those which contradict stated goals and beliefs)? Is it related to having an non-public (or non-consistent) utility function ?

I wouldn't say that preference utilitarianism "falls apart"; it just becomes much harder to implement.

And I'd like a little more definition of "autonomy" as a value - how do you operationally detect whether you're infringing on someone's autonomy?

My (still very informal) suggestion is that you don't try to measure autonomy directly and optimize for it. Instead, you try to define and operate from informed consent. This (maybe) allows a system to have enough autonomy to perform complex and open-ended tasks, but not so much that you expect perverse instantiations of goals.

My proposed definition of informed consent is "the human wants X and understands the consequences of the AI doing X", where X is something like a probability distribution on plans which the AI might enact. (... that formalization is very rough)

Is it just the right to make bad decisions (those which contradict stated goals and beliefs)?

This is certainly part of respecting an agent's autonomy. I think more generally respecting someone's autonomy means not taking away their freedom, not making decisions on their behalf without having prior permission to do so, and avoiding operating from assumptions about what is good or bad for a person.

The Suffering Golem is no thought experiment. There are actual people who live with great suffering. Some of them wish to die, but some do not. Should you kill someone who is in untreatable pain, against their definitely expressed, compos mentis wishes? Should such an act be legally not murder but justifiable homicide, justified by the amount of suffering thereby prevented?

I say no. What do others say?

What you describe does not want to be a thought experiment, because it doesn't abstract away relevant confounders (moral value of human life). The setup in the post is better at being a thought experiment for the distinctions being discussed (moral value of golem's life more clearly depends on a moral framework). In this context, it's misleading to ask whether something should be done instead of whether it's the action that's hedonistic utilitarian / preference utilitarian / autonomy-preserving.

Leave legality out if it - laws and enforcement is about really generic social behaviors, and is always going to encode a different set of expectations than a nuanced morality. I assert that it's perfectly moral (noble even) to be legally punished for making a correct moral choice.

Also, separate "correct action in face of high uncertainty" from "correct action if you can read the source code / detect and measure the experiences". I bias strongly against killing when there's significant uncertainty about current or future preferences/experiences. I think it's probably right to kill if you can somehow know the remainder of their life is negative value to them.

In fact, I'm not sure that any human in constant (or even frequent) deep pain can be considered compos mentis on this topic. By the time the pain is known to be constant, the reaction and anticipation of the pain has altered the person's cognitive approach.

That said, I try to remain humble in my demands of others. I won't kill a sentient being for pure altruism[1], and will in fact put barriers in place to suicide so that someone needs to maintain the desire and expend thought and further pain to achieve it. I don't actually judge suicide as wrong, or even as a mistake, but I don't understand the universe or others' experiences well enough to want to make it easy.

Really, human experience is so short already (a century at most, less for most of us), and it's going to end regardless of my or the sufferer's intent. exactly when it ends is far less important than what I can do to make the remaining time slightly less unpleasant.

[1] meaning I don't think I'll ever have sufficient evidence that killing them would benefit them more than other actions I can take. There are other utilitarian reasons I might be willing to kill, such as preventing 3^^3 dust specs. That's not what this post is about though - it's altruistic, but not toward the killed victim.

Let me babble some nearby strategies that are explicitly not judged on their wisdom:

Do not what the user wants you to do, but what he expects you to do.

If the animal/user would consent to your help eventually, help it then. If it wouldn't, help it now.

It seems to me both strategies articulated here reflect three values.

  1. All else equal, implementing a helpful act is better than not doing so, and what makes that act good is its outcome rather than meeting some stated desire of a consenting user.
  2. There can be additional value to strategies where consent is obtained first.
  3. Helping sooner is better than helping later when a time delay would not change the outcome of the help or the extent to which the help matched stated desires.

Obviously you were very clear in not explicitly judging the strategies you mentioned, but reading them did make me think about how someone who did find them wise might respond to situations where the net impact of the helpful act decreases between when it could first be implimented and when the receiver would offer explicit consent.

Supposing for whatever reason, the value of the helpful act would cease to be beneficial right shortly before the user would consent, but up until this moment the net impact would be constant. Can I conceive of a sense in which there would be benefit to waiting until the last possible moment, as close as possible in time to when the user would consent? Or is the notion of receiving consent all or nothing in this sense?

Reasoning about utility functions, ie restricting deontological to consequentalist mindspace, seems a misstep, because slightly changing utility functions tends to change alignment a lot, and slightly changing deontological injunctions might not, making it easier for us to hillclimb mindspace.

Perhaps we should have some mathematical discussion of utilityfunction-space, mindspace, its consequentialist subspace, the injection turing-machines -> mindspace, the function mindspace -> alignment, how well that function can be optimized, properties that make for good lemmata about the previous such as continuity, mindspace modulo equal utility functions, etc.

Aaand I've started it. What shape has mindspace?

My gut response is that hillclimbing is itself consequentialist, so this doesn't really help with fragility of value; if you get the hillclimbing direction slightly wrong, you'll still end up somewhere very wrong. On the other hand, Paul's approach rests on something which we could call a deontological approach to the hillclimbing part (IE, amplification steps do not rely on throwing more optimization power at a pre-specified function).

We are doing the hillclimbing, and implementing other object-level strategies does not help. Paul proposes something, we estimate the design's alignment, he tweaks the design to improve it. That's the hill-climbing I mean.

New to LessWrong?