Most people here probably already understand this by now, so this is more to prevent new people from getting confused about the point of Counterfactual Oracles (in the ML setting) because there's not a top-level post that explains it clearly at a conceptual level. Paul Christiano does have a blog post titled Counterfactual oversight vs. training data, which talks about the same thing as this post except that he uses the term "counterfactual oversight", which is just Counterfactual Oracles applied to human imitation (which he proposes to use to "oversee" some larger AI system). But the fact he doesn't mention "Counterfactual Oracle" makes it hard for people to find that post or see the connection between it and Counterfactual Oracles. And as long as I'm writing a new top level post, I might as well try to explain it with my own words. The second part of this post lists some remaining problems with oracles/predictors that are not solved by Counterfactual Oracles.
Without further ado, I think when the Counterfactual Oracle is translated into the ML setting, it has three essential characteristics:
- Supervised training - This is safer than reinforcement learning because we don't have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle's output during a training episode is an automated system that computes the Oracle's reward/loss, and that system is secure because it's just computing a simple distance metric (comparing the Oracle's output to the training label), then reward hacking and self-confirming predictions can't happen. (ETA: See this comment for how I've changed my mind about this.)
- Independent labeling of data - This is usually taken for granted in supervised learning but perhaps should be explicitly emphasized here. To prevent self-confirming predictions, the labeling of data has to be done without causal influence from the Oracle. That is, the Oracle's output should be isolated in some protected system until it's time to compute the reward/loss so that it can't affect how the training label is generated. But note that it's perfectly fine for humans or other systems to look at the question/input that is given to the Oracle, in order generate the training label.
- Online learning - The Oracle never stops learning, so it can eventually adjust to any distributional shift. (But note that an Oracle implemented using current ML techniques might perform very badly, possibly in a malign way, while it's still adjusting.)
- Random selection of training episodes - I'll just quote Paul here: "Counterfactual oversight consists of labelling a random subset of data and using it as online training data. The key difference is that any given data point may become a training data point, with the decision made after the learning system has made a decision about it. As long as the randomization is unpredictable to the learner, this gives us a formal guarantee that there can’t be any noticeable difference between the training and test data. And therefore if our learner behaves well on training data, it really must behave well on test data." (Note that what Stuart Armstrong calls "erasure" just means that the current episode has been selected as a training episode.)
Remaining Safety Problems
Counterfactual Oracles solve (or are a proposal to solve) some safety problems associated with predictors/oracles, but others remain. Besides the distributional shift problem mentioned above, here are a few more that come to mind. Note that these are not problems specific to Counterfactual Oracles, but Stuart Armstrong's Safe Uses of AI Oracles seems wrong or misleading when it says "This paper presented two Oracle designs which are both safe and useful."
- Benign Induction - This is a concern that the Oracle might come to believe (perhaps correctly in some sense) that we probably live in a simulation or Zoo Earth, and that at some specific point the alien simulation runners or zoo keepers will intervene in the simulation or zoo to change the training signal for the Oracle, causing the Oracle (on both fake and real Earths) to output something that for example causes our world to be taken over by the aliens.
- Reward Takeover on Real Earth - The Oracle may predict a future in which its training data / reward signal comes under the control of a hostile agent that will arise or already exists on the real Earth (e.g., an unaligned AI or an unfriendly government) which causes it to output something that benefits that hostile agent at the expense of its current user.
- Safe Approximation - If the Oracle is not able to exactly predict something, could its attempt at approximation cause safety problems? For example if we use it to predict human behavior, could the approximate prediction be unsafe because it ends up predicting a human-like agent with non-human values?
- Inner alignment - The ML training process may not produce a model that actually optimizes for what we intend for it to optimize for (namely minimizing loss for just the current episode, conditional on the current episode being selected as a training episode).
(I just made up the names for #2 and #3 so please feel free to suggest improvements for them.)