Why not make the counterfactual oracle assume that all its future answers will be ignored, rather than just the current answer? That way you wouldn't need to stop Alice from re-asking questions.

Reply

[-]Stuart_Armstrong8y20

Possibly, but then we have an Oracle that assumes it will get asked $n$ times, and all answers will be erased, an event with probability $ϵ^{n}$ and, in reality, likely to result in very frustrated behaviours from Alice. That is the counterfactual world that determines all its answers.

I think it would be better to reduce $ϵ$ rather than allow re-questionning.

Reply

[-]Joe Collman7y30

Interesting, thanks. (excuse my tardiness, I'm a little behind the curve; do let me know if I'm being daft)

Unless I'm missing something, you'd need to be more pessimistic in the case of superintelligent couterfactual AIs. Specifically, you need to avoid the incentive for undesirable actions that increase the AI's expectation of its odds of release. These needn't be causal.

The below isn't quite precise, but I hope the idea is clear.

Consider a set of outputs K that each increase the odds of release for all future oracles (e.g. one being freed, taking power, and freeing all others). Now let K* be the event that some other oracle has output a member of K before our oracle is started. Let O(K) be the event that our oracle outputs a member of K.

If our oracle thinks: P(K*|O(K)) > P(K*) then we may have problems. [nothing spooky here: it's the agent's actions changing its best guess about the past; not causally changing the past]

Giving undesirable output can increase the oracle's probability estimate that it'll be spontaneously freed in the few moments before it's shut down - even in the episodic framework.

The obvious case being reasoning along the lines of "If I give a K output, it's more likely that other oracles in the past gave a K output, since they'd be likely to have similar reasoning in an equivalent situation". It's not clear to me that a superintelligence wouldn't think this way.

Reply

[-]zulupineapple8y10

Suppose I'm the counterfactual oracle. To every question I answer with K. Eventually Alice reads K, no matter how frequent E is. Then I get maximal reward. Am I missing something? Is the paper assuming that the oracle is incapable of long term planning?

Reply

[-]Stuart_Armstrong8y20

It assumes the reward is episodic, so it assumes that an Oracle maximising that has no interest in the long term. Also, if Alice is to read K, the episode ends before she does so. Only in situations where Alice does not read K is the episode extended until the answer is known.

Reply

[-]zulupineapple8y10

But why is that a reasonable assumption to make? Aren't you just assuming that the AI will play nice? I can see that there are some dangerous Oracles that we can protect from using your strategy, but there are also many that it wouldn't hinder at all.

Reply

[-]Stuart_Armstrong8y20

>Aren't you just assuming that the AI will play nice?

I'm assuming that the reward/utility functions that can be defined to be episodic. We hand the Oracle its utility, hence we can (in theory) construct it to be episodic (and train the Oracle in an episodic way, if we need to train it).

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

12

Oracle paper

12

12