Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary: in approximating a scheme like HCH , we would like some notion of "the best the prediction can be given available AI capabilities". There's a natural notion of "the best prediction of a human we should expect to get". In general this doesn't yield good predictions of HCH, but it does yield an HCH-like computation model that seems useful.


(thanks to Ryan Carey, Paul Christiano, and some people at the November veteran's workshop for helping me work through these ideas)

Suppose we would like an AI system to predict what HCH would do. The AI system is limited; it doesn't have a perfect prediction of a human. What's the best we should expect it to do?

As a simpler sub-question, we can ask what the best prediction for a single query to a human is. Let be the "true human": a stochastic function mapping a question to a distribution over answers (say, over quantum uncertainty). How "good" of a prediction function should we expect to get?

The short answer is that we should expect that, for any question , should be within of some pretty good prediction of .

Why within ?

(feel free to skip this section if you're willing to buy the previous paragraph)

We will create an online prediction system that on each iteration takes in a question and outputs either a distribution over answers , or to indicate ambiguity. If outputting , the prediction system observes . We will construct this online prediction system from a bunch of untrusted experts , each of whom is a probability distribution over the human .

Suppose one expert is "correct" in that in fact for some . Then KWIK learning will succeed in creating an online prediction system such that, with high probability, for each in which (and not ) is output, That is, the predictions will be close to the "correct predictions" that makes by total variation distance. Furthermore, must be output only times; this measures the amount of training data required.\

For the rest of this post we should assume that, after setting up the KWIK learner, we do active learning (finding inputs on which the learner outputs ) until the KWIK learner no longer outputs , then getting using the current state of the KWIK learner. If we didn't do this, there would be no concrete stochastic function because the state of the learner would keep changing over time.

The assumptions in the section (especially that one expert is correct) are pretty sketchy, but I expect the basic picture of "predictions should be good within " to work out.

Predicting collections of humans is hard

Now that we have an approximate prediction of a human, we can use this to approximate a collection of humans. For example, we might want to predict , i.e. the result of asking the questions “a'' and “b'' and summing the answers. In general we can consider any function which computes something by querying some stochastic function a bunch of times, and consider the problem of predicting .

The obvious way to predict is ; in this case, . But this can be highly inaccurate even if is accurate!

Let us say that either and , or and . The AI does not have enough information to distinguish these possibilities; under this uncertainty, it is reasonable to think they are equally likely, so we have .

The AI has enough information to conclude that . But the distribution will put 0.25 probability mass on 0, 0.5 on 1, and 0.25 on 2.

In general we shouldn't expect replacing with to work very well; it does not take into account any correlation between and for .

Can we do better by taking into account the fact that the experts described in the previous section have a full joint distribution over all the values? Not really; the good expert cannot be distinguished from by more than with a single experiment (i.e. single question-answering round). So if some other expert predicts , this expert will not be proven wrong in a single experiment.

Now, the expert might be proven wrong if we expand the notion of "experiment" to include asking multiple questions. But this becomes more expensive, and actually doesn't help much when we're considering pretty big collections of calls (where correlations over a greater number of values are important).

I've basically given up on the problem of predicting large computations made out of calls to using untrusted expert advice. There's just too much room for an expert to mess with the predictions by changing the correlation structure in ways that are hard to detect with short experiments.

is still useful

Despite the fact that predicting large computations made of calls to seems intractable, large computations made of calls to are useful in their own right. We can think of as a collaboration among many "clones" of a single human, each of whom has a personality sampled from the AI's distribution over that human's personality traits. That is, each call to is considered to be asking a question to an independent sampling of the human's psychological parameters (sampled from the AI's information state).

For example, if the AI does not know Bob's favorite color, then will be stochastic. If we consider the computation which asks for 's favorite color twice and checks if they are equal, then will return false a non-negligible percentage of the time.

If we define such that (i.e. asks its argument how to spawn more copies and so on), then is the equivalent of HCH for clones sampled from the AI's information state. (See also the notation for HCH variants in this post). The issue with psychological parameters is pretty weird but doesn't seem to present serious difficulties for most uses of HCH I can think of. I haven't thought about it a ton, but in general it seems like it should be possible to collaborate with clones of yourself that have slightly different psychological parameters (they'll only be slightly different if the AI knows a lot about you). I confirmed with Paul Christiano that he is optimistic about being useful and pessimistic about predictions of HCH proper that take correlation into account.

When considering very large computations , we might be concerned that local errors could propagate throughout the computation. But it's possible to mitigate this by doing something like taking multiple samples of for some question and taking a majority vote, as described in this post.

A note on not overestimating probabilities

(feel free to skip this section)

Paul Christiano told me about an idea to get our predictions to not overestimate the probability of any action by more than a factor of , i.e.

Roughly, this can be done by taking the minimum probability of according to all the credible experts, then renormalizing. This seems useful if we're concerned about predicting rare bad things that wouldn't predict. It doesn't change the nature of the analysis much, though.

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 9:54 PM

Note: I currently think that the basic picture of getting within of a good prediction is actually pretty sketchy. I wrote about the sample complexity here. Additional to the sample complexity issues, the requirement is for predictors to be Bayes-optimal, but Bayes-optimality is not possible for bounded reasoners. This is important because e.g. some adversarial predictor might make very good predictions on some subset of questions (because it's spending its compute on those specifically), causing other predictors to be filtered out (if those questions are used to determine who the best predictor is). I don't know what kind of analysis could get the -accuracy result at this point.

Hasn't this been shown in the complex Search Engine Algs as useful for common info (as if you could predict what a person would say - what makes it right or wrong anyway ?), but Fails in the complex answers.

Ie: You can predict the Topic of discussion, but the other Person is/will be annoyed by the 'Obvious Answer' to their more complex question.