Maria Shakhova - LessWrong

Submission: counterfactual oracle

Suppose we have a question that requires n-bit text answer. We have a way to check if the answer is correct. But we cannot ask the Oracle directly, because we are afraid that among n-bit texts there are those that make us release the Oracle from the box, and the Oracle will give one of them as an answer.

Let's try to use the counterfactual oracle to solve this problem. In the counterfactual setup we generate random n-bit text and check if it is the answer to our question. The Oracle predicts the text we will generate. If the generated text is the correct answer, we compare it with the prediction. If they match, the Oracle receives reward point. If the generated text is not the correct answer or the texts do not match, the Oracle is not rewarded.

Suppose the Oracle evaluates the probability to randomly generate a text that leads to its escaping as p. If the Oracle predicts a text that is the correct answer to the question, its expected reward in the counterfactual setup is $\frac{1}{2^{n}} + p$ . If the Oracle predicts a text that is not the correct answer, its expected reward is $p$ . If the Oracle predicts a text that leads to its escaping, its expected reward is also $p$ .

Thus, it is most beneficial for the Oracle to predict a text that is the correct answer to our question. It makes no sense to predict a text that will make us let the Oracle out, because if we generate this text, we release the Oracle and it receives the maximum reward regardless of its prediction.

There is a problem that there may be texts that are not correct answers and not make us release the Oracle but make us mark they as correct answers. The Oracle will predict one of these texts if it does not know the answer or if we more often mark this text as correct then real answer. We will read this text after the Oracle shutdown, but it may still be dangerous.

LESSWRONG
LW

Posts

Wiki Contributions

Comments