Submission. For the counterfactual oracle, ask it to provide a proof of an important mathematical theorem (perhaps one of the Millenium prizes) in a automated theorem prover format. Since the correctness of this proof should be verifiable by a theorem prover, the loss function in the counterfactual scenario is 1 if the prover did not validate the proof and 0 if it did validate the proof.
This assumes that we've already made progress in setting up automated theorem proving software that already has incorporated all of current mathematical knowledge. The Lean theorem prover seems most promising, but perhaps the counterfactual loss function could include randomly choosing from various different theorem proving software.
This idea has the benefit of having a concrete way of automatically scoring the response in the counterfactual world given the logical nature of mathematical statements. Also, given that we are already attempting to solve these problems and may eventually solve them, the main risk seems to be accelerating follow-on progress enabled by such proofs and not suggesting new actions that humanity is not already pursuing.
It may be susceptible to 1) providing proofs that are not human interpretable and 2) taking advantage of bugs in theorem provers. Having an ensemble of theorem provers may mitigate 2).
(See here for more info on Lean and here (tweeted about here and here) for an example of inputting all the definitions (not requisite proofs or lemmas) to define perfectoid spaces as a demonstration of its future potential.)
Submission. For the counterfactual oracle, ask it to output a game-playing software agent that can beat a specified opponent agent (or ensemble of agents). In the counterfactual scenario, have these two agents compete and give the oracle a reward 1 if it wins greater than 50% (or develops a higher ELO score, etc.) and reward 0 otherwise.
As with deep reinforcement learning, this requires a game or simulated environment where automated self-play can rapidly be conducted for the counterfactual scoring. This should be able to produce highly skilled game agents on such limited domains.
This could be augmented by asking for an agent that can play a suite of games and must beat the current state of the art approaches at all games. In the counterfactual scenario, either one of these games is randomly selected to check the oracle agent's performance, or all games are checked. Such a game suite may indirectly demonstrate principles for generally competent agents.
Idea. Since we may not want the oracle to suggest to us new actions to take or new experiments to conduct, it may be better for the oracle to reveal to us new patters that are already present in current experimental or observational datasets. For example, it could predict new astronomical phenomena that are already present in current astronomical datasets.
The problem I'm having is how to automate this scoring when in the counterfactual scenario. If we just ask the oracle to predict some pattern in the data, it may report trivial patterns. This could be ameliorated by randomly choosing from several different datasets that cover a similar area, but again I'm not sure how to automatically judge for novelty of predictions.
I think it is worth emphasizing this distinction between conducting new experiments and revealing phenomena (with new theories) that are already present in current data. The danger could be that one could be led astray by a false theory that seemed to match all current observations, but would suggest other dangerous actions.
This idea reminds me of the (fictional) evidence for an alien AI already present in SETI recordings that Wintermute/Neuromancer discovers.