In this post I will first recap how debate can help with value learning and that a standard debater optimizes for convincingness. Then I will illustrate how two subsystems could help with value learning in a similar way, without optimizing for convincingness. (Of course this new system could have its own issues, which I don't analyse in depth.)

Debate serves to get a training signal about human values

Debate (for the purpose of AI safety) can be interpreted as a tool to collect training signals about human values. Debate is especially useful when we don’t know our values or their full implications and we can’t just verbalize or demonstrate what we want.

Standard debate set-up

Two debaters are given a problem (related to human values) and each proposes a solution. The debaters then defend their solution (and attack the other’s) via a debate. After having been exposed to a lot of arguments during the debate, a human decides which solution seems better. This judgement serves as the ground truth for the human’s preference (after being informed by the debaters) and can be used as a training signal about what they really want. Through debate we get question-answer pairs which can be used to train a preference predictor.

A debater optimizes for convincingness

An agent will optimize for the goal it is given. In the case of a debater, the agent is optimizing to be judged positively by a human. This means that the main incentive of a debater is to maximise convincingness (potentially by using misleading arguments).

Encouraging the debater to be truthful

The current debate proposal aims to shift this incentive towards truthfulness by empowering both debaters to expose deception in the other debater’s arguments (through cross-examination). (Other methods of disincentivizing deception exist or may exist in the future as well.) I think this disincentivizing may be like a bandaid, when what we want is to prevent the wound in the first place.

A simplistic model of what a debater optimizes for

A debater first comes up with a potential solution. After it has committed to a solution, it tries to convince the judge (through debate) that this is the solution that the judge values most highly. Initially, the AI may reason about what the human’s preference is, but then it tries to persuade the human that the AI’s guess is in fact the human’s preference. In reality, these two steps are actually intertwined: A debater tries to find a solution that will be easiest to convince the human of.

What we want from a debater

What we want the system to do is: “Allow us to state our preferences in domains where we previously couldn’t”. We won’t know how to find a desirable solution to some problems, and sometimes we won’t even know how to choose between two given solutions. In these cases, we would like an AI to help us find desirable solutions and help us discover our preferences.

An alternative system that does not aim to convince

A debater helps us discover our preferences by proposing a solution and then helping us reason about the solution (via a debate). However, these functions need not be performed by one and the same agent. Instead, we could train a different agent for each of these components:

  1. an option generator, which gives us proposals for potential solutions (in domains in which we can assess them); and
  2. an epistemic helper, which helps us to deliberate about options (expands the domains in which we can assess options).

We may not always need both subsystems

In some cases, an epistemic helper could be so effective that it empowers us to generate good solutions ourselves. In such cases, we don’t need the option generator. In other cases, we may be capable of assessing different options right away (even though we can’t generate them ourselves). In these cases, we don’t need the epistemic helper.

Option generator

An option generator tries to come up with solutions that we would pick. To some extent, this is a preference predictor. The agent receives a problem and predicts what solution the human would propose or pick. This agent is trained by receiving many questions (about values) answered by a human. This agent should not interact with the human whose values it’s predicting and ideally has no means of influencing their values.

Epistemic helper

The second agent would help us reason about our values. We need this agent to empower us to answer questions that we by ourselves don’t know the answers to. It should be rewarded for how much it helped us reason.

[I don’t have a good proposal for how to build an epistemic helper, but will explain what I mean by giving some examples.]

An epistemic helper similar to a debater

Consider a debate in which each debater has to defend a potential solution proposed by the value predictor, but instead of being judged on how convincing their solution was, they are being judged on how insightful (illuminating and helpful) their utterances seemed. This would lead to different behaviour than optimizing for convincingness. For example, consider a debate about going on a holiday to Bali or Alaska. Suppose the human does not have a passport to go to Bali, but could get an expedient passport. In the standard debate set-up, the advocate for Bali will not bring up the lack of passport because the debater knows there is a counter-argument. However, it could be useful for the human to know that it should get an expedient passport if it wants to go to Bali. Unfortunately, by default, an agent that is rewarded for insightfulness would optimize for eliciting the emotion of feeling helped in the judge, rather than for helping the judge.

Epistemic helpers dissimilar to debaters

Epistemic helpers could, of course, take many other forms as well. From the top of my head, an epistemic helper could look like an agent that: behaves like a therapist and mostly guides the human to introspect better; acts like a teacher; or produces visualizations of information such as Bayesian networks or potential trajectories; etc.

Rewarding epistemic helpers

Examples of how we could reward epistemic helpers:

  • Judged based on the human experience, i.e. the judge feels like they understand the world better. (This is similar to how debate is currently judged.)
  • Based on if the human became more capable. We could:
    • Test the human on how well they predict events (in the real world or in simulations)
    • Test how well people predict personal evaluations of happiness
      • We could test how well people can predict their own evaluations of happiness. However, this could lead to self-fulfilling prophecies.
      • Predict other people’s evaluation.
    • Test people on how good they are at finding solutions to problems. (Often, even if a solution is difficult to find, given a potential solution it is easy to test whether it is correct or not.)
    • Also see “Understanding comes from passing exams

A potential downside of using an epistemic helper is that we could get paralyzed by considerations, when we actually want a “one-handed economist” .

[I was inspired to write about splitting up debate into subsystems by a discussion between Joe Collman and Abram Demski during Web-TAISU. After writing I noticed a more refined idea is explained here.]

Thanks to Vojta and Misha for helpful comments and interesting discussions.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 4:17 AM

The idea of an epistemic helper seems really interesting. The obvious problem is that now it's incentivized to manipulate what predictions get made, and how those predictions turn out in the world.

Generally speaking, epistemic helpers / oracles seem dangerous unless they're not agents. Here I mean that an "agent" chooses actions by planning ahead and picking actions on the basis of their predicted consequences, or is trained using an objective that's a function of the consequences of its actions. A "non-agent" in this sense just has to pick its actions based on something other than their predicted consequences, or be trained on an objective that's not a function of the consequences.

I agree that if you score an oracle based on how accurate it is, then it is incentivized to steer the world towards states where easy questions get asked.

I think that in these considerations it matters how powerful we assume the agent to be. You made me realize that specifying the scope and detailing the application area of the proposed approach better could have made my post more interesting. In many cases making the world more predictable may be very difficult for the agent, compared to causing the human to predict the world better. In the short term I think deploying an agentic oracle could be safe.

What about the info helper by itself? Is it much more useful in this context than it would be alone? Rewarding based on human prediction skill seems the best to me. I think Bostrom might have mentioned this problem (educating someone on a topic) somewhere.

I think Bostrom might have mentioned this problem (educating someone on a topic) somewhere.

Cool! I'm not familiar with it

In the case that the epistemic helper can explain us enough for us to come up with solutions ourselves, the info helper is as useful by itself.

However, sometimes even if we get educated about a domain or problem, we may not be creative enough to propose good solutions ourselves. In such cases we would need an agent to propose options to us. It would be good if an agent that gets trained to come up with solutions that we approve of is not the same agent that explains to us why we should or should not approve of a solution (because if it were, it would have an incentive to convince us).