LESSWRONG
LW

Academic PapersAI Boxing (Containment)AI
Frontpage

12

Oracle paper

by Stuart_Armstrong
13th Dec 2017
1 min read
7

12

Academic PapersAI Boxing (Containment)AI
Frontpage

12

Oracle paper
4cousin_it
2Stuart_Armstrong
3Joe Collman
1zulupineapple
2Stuart_Armstrong
1zulupineapple
2Stuart_Armstrong
New Comment
7 comments, sorted by
top scoring
Click to highlight new comments since: Today at 3:07 PM
[-]cousin_it8y40

Nice!

Why not make the counterfactual oracle assume that all its future answers will be ignored, rather than just the current answer? That way you wouldn't need to stop Alice from re-asking questions.

Reply
[-]Stuart_Armstrong8y20

Possibly, but then we have an Oracle that assumes it will get asked n times, and all answers will be erased, an event with probability ϵn and, in reality, likely to result in very frustrated behaviours from Alice. That is the counterfactual world that determines all its answers.

I think it would be better to reduce ϵ rather than allow re-questionning.

Reply
[-]Joe Collman7y30

Interesting, thanks. (excuse my tardiness, I'm a little behind the curve; do let me know if I'm being daft)

Unless I'm missing something, you'd need to be more pessimistic in the case of superintelligent couterfactual AIs. Specifically, you need to avoid the incentive for undesirable actions that increase the AI's expectation of its odds of release. These needn't be causal.

The below isn't quite precise, but I hope the idea is clear.

Consider a set of outputs K that each increase the odds of release for all future oracles (e.g. one being freed, taking power, and freeing all others). Now let K* be the event that some other oracle has output a member of K before our oracle is started. Let O(K) be the event that our oracle outputs a member of K.

If our oracle thinks: P(K*|O(K)) > P(K*) then we may have problems. [nothing spooky here: it's the agent's actions changing its best guess about the past; not causally changing the past]

Giving undesirable output can increase the oracle's probability estimate that it'll be spontaneously freed in the few moments before it's shut down - even in the episodic framework.

The obvious case being reasoning along the lines of "If I give a K output, it's more likely that other oracles in the past gave a K output, since they'd be likely to have similar reasoning in an equivalent situation". It's not clear to me that a superintelligence wouldn't think this way.

Reply
[-]zulupineapple8y10

Suppose I'm the counterfactual oracle. To every question I answer with K. Eventually Alice reads K, no matter how frequent E is. Then I get maximal reward. Am I missing something? Is the paper assuming that the oracle is incapable of long term planning?

Reply
[-]Stuart_Armstrong8y20

It assumes the reward is episodic, so it assumes that an Oracle maximising that has no interest in the long term. Also, if Alice is to read K, the episode ends before she does so. Only in situations where Alice does not read K is the episode extended until the answer is known.

Reply
[-]zulupineapple8y10

But why is that a reasonable assumption to make? Aren't you just assuming that the AI will play nice? I can see that there are some dangerous Oracles that we can protect from using your strategy, but there are also many that it wouldn't hinder at all.

Reply
[-]Stuart_Armstrong8y20

>Aren't you just assuming that the AI will play nice?

I'm assuming that the reward/utility functions that can be defined to be episodic. We hand the Oracle its utility, hence we can (in theory) construct it to be episodic (and train the Oracle in an episodic way, if we need to train it).

Reply
Moderation Log
More from Stuart_Armstrong
View more
Curated and popular this week
7Comments

Available on the arXiv, my paper on two types of Oracles (AIs constrained to answering questions only), and how to use them more safely.

An Oracle is a design for potentially high power artificial intelligences (AIs), where the AI is made safe by restricting it to only answer questions. Unfortunately most designs cause the Oracle to be motivated to manipulate humans with the contents of their answers, and Oracles of potentially high intelligence might be very successful at this. Solving the problem, without compromising the accuracy of the answer, is tricky. This paper reduces the issue to a cryptographic-style problem of Alice ensuring that her Oracle answers her questions while not providing key information to an eavesdropping Eve. Two Oracle designs solve this problem, one counterfactual (the Oracle answers as if it expected its answer to never be read) and one on-policy (limited by the quantity of information it can transmit).