Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Paul has written about an open problem: can we use active learning to get a good predictor, if all we have is a mixture of a good predictor and some possibly bad predictors? I'd like to propose a refinement to the scheme.

A simplified version of Paul's problem is the following. Suppose we have n predictors. These predictors might, for example, predict how a human would answer a given question. One is known to be good; the rest may be arbitrarily evil. We don't know which predictor is good. How do we use this set of predictors to imitate a human almost as well as the good predictor?

Paul proposes using active learning to identify questions on which these predictors disagree, then asking them to the human. This allows the system to eliminate some predictors from consideration when they make bad predictions. However, if any of the predictors is evil, then it might intentionally disagree with the good predictor on a "weird" question, which causes bad results when it is asked to the human. This seems like a serious problem, since due to simulation warfare, it does not seem unlikely for one of the predictors to actually be evil.

I propose the following refinement. The system should consider a stream of questions coming in and answer each one before seeing the next. It should examine each question and see if the predictors that have scored well in the past disagree significantly on their predictions of the answer to this question. If so, then it should ask the question to the human, output the human's answer, and give each predictor a score based on how good its prediction of the human's answer was. If not, then it can go ahead and output the consensus prediction. Ideally, we could prove that this algorithm will not ask the human very many questions (preferably, not much more than linear in ). This setting is similar to the setting of online selective sampling, especially the KWIK ("knows what it knows") variant.

A single predictor cannot cause the system to ask the human an arbitrary question. It can bide its time and wait for a particular question to come up naturally, but this does not give it much power: a question that would come up naturally is unlikely to be bad to ask to the human, if the rest of the system is set up correctly.

One problem is that the predictors might disagree at an inopportune time, when we would really prefer for the system not to have to "pause" to ask the human a question. Paul discusses this problem as a general failure mode for counterfactual oversight. I think there's a partial solution: try to predict what questions will come up in the future, finding one that the predictors disagree on. This search should not be too powerful: in particular, it should avoid looking at the internals of the predictors, and it should avoid searching through a huge number of possibilities. If the search for the controversial question is not powerful, then the question will have low mutual information with the predictors' internals, preventing the predictors from exposing the human to highly optimized questions. KWIK-style algorithms generally continue to work in this setting.

I'm currently working on an algorithm that can handle this application better than other selective sampling algorithms can; see Paul's post for more details on what makes this application special. I previously made the mistake of attacking this problem without doing enough of a literature search to find the selective sampling literature. I think I will have a better chance of making progress on this problem after reading over this literature more.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 5:11 PM

The problem with applying this approach in the to the kinds of schemes I have been considering is that the behavior of the overall system implicitly depends on its answers to a very large number of questions. See the tree-structured case here.

So I'm not sure that we can avoid the problem you describe by selective sampling.

As you've pointed out, the underlying learning problem is also insoluble. So there are really two apparently distinct problems to be resolved.

It does seem bad if the tree is so large that it contains questions that are unsafe to ask a real human. But I'm not sure why you would want to do this. If the question is unsafe to ask a real human, then it seems like most ways of asking the question within the tree structure are also unsafe. Unless you're doing something like processing the answer to a question using a computer program instead of actually showing the answer to a human?

If none of the questions in (a sample of) the tree is unsafe to ask the human, then there's a simple recursive algorithm that will find where predictors disagree in the tree (whenever these disagreements propagate to the root node). You'd probably want to set it up so that the human takes as input a question X, and returns either an answer to X or a pair of 2 questions Y and Z; Y is asked to a second HCH, and (Z ++ answer to Y) is asked to a third HCH to get the answer to X. To find a disagreement, you can start by looking at the root node X, seeing if the predictors disagree on its answer; if they do, then see whether they disagree on what the root human does; if they don't disagree on what the root human does, then see if they disagree on the answer to Y; and so on.

EDIT: now that I thought about it more, it seems like your original active learning proposal will work fine with 100,000 questions. I think I mentally replaced 100,000 with at some point, and then criticized using active learning with this many questions.

If the question is unsafe to ask a real human, then it seems like most ways of asking the question within the tree structure are also unsafe

I agree, and retract my complaint. The hierarchical structure does make the problem more subtle, but doesn't rule out the approach you outlined.

A second potential problem is that I would really like to synthesize potentially problematic data in advance. This can't be done using quite the technique you suggest, though you could imagine somehow forcing the synthesized data to look much like data that will actually appear (and really you want to do something like that anyway). The situation seems tricky and pretty subtle.