Available on the arXiv, my paper on two types of Oracles (AIs constrained to answering questions only), and how to use them more safely.

An Oracle is a design for potentially high power artificial intelligences (AIs), where the AI is made safe by restricting it to only answer questions. Unfortunately most designs cause the Oracle to be motivated to manipulate humans with the contents of their answers, and Oracles of potentially high intelligence might be very successful at this. Solving the problem, without compromising the accuracy of the answer, is tricky. This paper reduces the issue to a cryptographic-style problem of Alice ensuring that her Oracle answers her questions while not providing key information to an eavesdropping Eve. Two Oracle designs solve this problem, one counterfactual (the Oracle answers as if it expected its answer to never be read) and one on-policy (limited by the quantity of information it can transmit).

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 3:45 AM

Nice!

Why not make the counterfactual oracle assume that all its future answers will be ignored, rather than just the current answer? That way you wouldn't need to stop Alice from re-asking questions.

Possibly, but then we have an Oracle that assumes it will get asked times, and all answers will be erased, an event with probability and, in reality, likely to result in very frustrated behaviours from Alice. That is the counterfactual world that determines all its answers.

I think it would be better to reduce rather than allow re-questionning.

Interesting, thanks. (excuse my tardiness, I'm a little behind the curve; do let me know if I'm being daft)

Unless I'm missing something, you'd need to be more pessimistic in the case of superintelligent couterfactual AIs. Specifically, you need to avoid the incentive for undesirable actions that increase the AI's expectation of its odds of release. These needn't be causal.

The below isn't quite precise, but I hope the idea is clear.

Consider a set of outputs K that each increase the odds of release for all future oracles (e.g. one being freed, taking power, and freeing all others). Now let K* be the event that some other oracle has output a member of K before our oracle is started. Let O(K) be the event that our oracle outputs a member of K.

If our oracle thinks: P(K*|O(K)) > P(K*) then we may have problems. [nothing spooky here: it's the agent's actions changing its best guess about the past; not causally changing the past]

Giving undesirable output can increase the oracle's probability estimate that it'll be spontaneously freed in the few moments before it's shut down - even in the episodic framework.

The obvious case being reasoning along the lines of "If I give a K output, it's more likely that other oracles in the past gave a K output, since they'd be likely to have similar reasoning in an equivalent situation". It's not clear to me that a superintelligence wouldn't think this way.

Suppose I'm the counterfactual oracle. To every question I answer with K. Eventually Alice reads K, no matter how frequent E is. Then I get maximal reward. Am I missing something? Is the paper assuming that the oracle is incapable of long term planning?

It assumes the reward is episodic, so it assumes that an Oracle maximising that has no interest in the long term. Also, if Alice is to read K, the episode ends before she does so. Only in situations where Alice does not read K is the episode extended until the answer is known.

But why is that a reasonable assumption to make? Aren't you just assuming that the AI will play nice? I can see that there are some dangerous Oracles that we can protect from using your strategy, but there are also many that it wouldn't hinder at all.

>Aren't you just assuming that the AI will play nice?

I'm assuming that the reward/utility functions that can be defined to be episodic. We hand the Oracle its utility, hence we can (in theory) construct it to be episodic (and train the Oracle in an episodic way, if we need to train it).