In this post I will first recap how debate can help with value learning and that a standard debater optimizes for convincingness. Then I will illustrate how two subsystems could help with value learning in a similar way, without optimizing for convincingness. (Of course this new system could have its own issues, which I don't analyse in depth.)
Debate serves to get a training signal about human values
Debate (for the purpose of AI safety) can be interpreted as a tool to collect training signals about human values. Debate is especially useful when we don’t know our values or their full implications and we can’t just verbalize or demonstrate what we want.
Standard debate set-up
Two debaters are given a problem (related to human values) and each proposes a solution. The debaters then defend their solution (and attack the other’s) via a debate. After having been exposed to a lot of arguments during the debate, a human decides which solution seems better. This judgement serves as the ground truth for the human’s preference (after being informed by the debaters) and can be used as a training signal about what they really want. Through debate we get question-answer pairs which can be used to train a preference predictor.
A debater optimizes for convincingness
An agent will optimize for the goal it is given. In the case of a debater, the agent is optimizing to be judged positively by a human. This means that the main incentive of a debater is to maximise convincingness (potentially by using misleading arguments).
Encouraging the debater to be truthful
The current debate proposal aims to shift this incentive towards truthfulness by empowering both debaters to expose deception in the other debater’s arguments (through cross-examination). (Other methods of disincentivizing deception exist or may exist in the future as well.) I think this disincentivizing may be like a bandaid, when what we want is to prevent the wound in the first place.
A simplistic model of what a debater optimizes for
A debater first comes up with a potential solution. After it has committed to a solution, it tries to convince the judge (through debate) that this is the solution that the judge values most highly. Initially, the AI may reason about what the human’s preference is, but then it tries to persuade the human that the AI’s guess is in fact the human’s preference. In reality, these two steps are actually intertwined: A debater tries to find a solution that will be easiest to convince the human of.
What we want from a debater
What we want the system to do is: “Allow us to state our preferences in domains where we previously couldn’t”. We won’t know how to find a desirable solution to some problems, and sometimes we won’t even know how to choose between two given solutions. In these cases, we would like an AI to help us find desirable solutions and help us discover our preferences.
An alternative system that does not aim to convince
A debater helps us discover our preferences by proposing a solution and then helping us reason about the solution (via a debate). However, these functions need not be performed by one and the same agent. Instead, we could train a different agent for each of these components:
- an option generator, which gives us proposals for potential solutions (in domains in which we can assess them); and
- an epistemic helper, which helps us to deliberate about options (expands the domains in which we can assess options).
We may not always need both subsystems
In some cases, an epistemic helper could be so effective that it empowers us to generate good solutions ourselves. In such cases, we don’t need the option generator. In other cases, we may be capable of assessing different options right away (even though we can’t generate them ourselves). In these cases, we don’t need the epistemic helper.
An option generator tries to come up with solutions that we would pick. To some extent, this is a preference predictor. The agent receives a problem and predicts what solution the human would propose or pick. This agent is trained by receiving many questions (about values) answered by a human. This agent should not interact with the human whose values it’s predicting and ideally has no means of influencing their values.
The second agent would help us reason about our values. We need this agent to empower us to answer questions that we by ourselves don’t know the answers to. It should be rewarded for how much it helped us reason.
[I don’t have a good proposal for how to build an epistemic helper, but will explain what I mean by giving some examples.]
An epistemic helper similar to a debater
Consider a debate in which each debater has to defend a potential solution proposed by the value predictor, but instead of being judged on how convincing their solution was, they are being judged on how insightful (illuminating and helpful) their utterances seemed. This would lead to different behaviour than optimizing for convincingness. For example, consider a debate about going on a holiday to Bali or Alaska. Suppose the human does not have a passport to go to Bali, but could get an expedient passport. In the standard debate set-up, the advocate for Bali will not bring up the lack of passport because the debater knows there is a counter-argument. However, it could be useful for the human to know that it should get an expedient passport if it wants to go to Bali. Unfortunately, by default, an agent that is rewarded for insightfulness would optimize for eliciting the emotion of feeling helped in the judge, rather than for helping the judge.
Epistemic helpers dissimilar to debaters
Epistemic helpers could, of course, take many other forms as well. From the top of my head, an epistemic helper could look like an agent that: behaves like a therapist and mostly guides the human to introspect better; acts like a teacher; or produces visualizations of information such as Bayesian networks or potential trajectories; etc.
Rewarding epistemic helpers
Examples of how we could reward epistemic helpers:
- Judged based on the human experience, i.e. the judge feels like they understand the world better. (This is similar to how debate is currently judged.)
- Based on if the human became more capable. We could:
- Test the human on how well they predict events (in the real world or in simulations)
- Test how well people predict personal evaluations of happiness
- We could test how well people can predict their own evaluations of happiness. However, this could lead to self-fulfilling prophecies.
- Predict other people’s evaluation.
- Test people on how good they are at finding solutions to problems. (Often, even if a solution is difficult to find, given a potential solution it is easy to test whether it is correct or not.)
- Also see “Understanding comes from passing exams”
A potential downside of using an epistemic helper is that we could get paralyzed by considerations, when we actually want a “one-handed economist” .
[I was inspired to write about splitting up debate into subsystems by a discussion between Joe Collman and Abram Demski during Web-TAISU. After writing I noticed a more refined idea is explained here.]
Thanks to Vojta and Misha for helpful comments and interesting discussions.