# Contest: $1,000 for good questions to ask to an Oracle AI 4 min read31st Jul 2019156 comments # 57 # Ω 14 Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Edit: contest closed now, will start assessing the entries. # The contest I'm offering$1,000 for good questions to ask of AI Oracles. Good questions are those that are safe and useful: that allows us to get information out of the Oracle without increasing risk.

To enter, put your suggestion in the comments below. The contest ends at the end[1] of the 31st of August, 2019.

## Oracles

A perennial suggestion for a safe AI design is the Oracle AI: an AI confined to a sandbox of some sort, that interacts with the world only by answering questions.

This is, of course, not safe in general; an Oracle AI can influence the world through the contents of its answers, allowing it to potentially escape the sandbox.

Two of the safest designs seem to be the counterfactual Oracle, and the low bandwidth Oracle. These are detailed here, here, and here, but in short:

• A counterfactual Oracle is one whose objective function (or reward, or loss function) is only non-trivial in worlds where its answer is not seen by humans. Hence it has no motivation to manipulate humans through its answer.
• A low bandwidth Oracle is one that must select its answers off a relatively small list. Though this answer is a self-confirming prediction, the negative effects and potential for manipulation is restricted because there are only a few possible answers available.

Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren't asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation.

The counterfactual and low bandwidth Oracles are safer than unrestricted Oracles, but this safety comes at a price. The price is that we can no longer "ask" the Oracle any question we feel like, and we certainly can't have long discussions to clarify terms and so on. For the counterfactual Oracle, the answer might not even mean anything real to us - it's about another world, that we don't inhabit.

Despite this, its possible to get a surprising amount of good work out of these designs. To give one example, suppose we want to fund various one of a million projects on AI safety, but are unsure which one would perform better. We can't directly ask either Oracle, but there are indirect ways of getting advice:

• We could ask the low bandwidth Oracle which team A we should fund; we then choose a team B at random, and reward the Oracle if, at the end of a year, we judge A to have performed better than B.
• The counterfactual Oracle can answer a similar question, indirectly. We commit that, if we don't see its answer, we will select team A and team B at random and fund them for year, and compare their performance at the end of the year. We then ask for which team A[2] it expects to most consistently outperform any team B.

Both these answers get around some of the restrictions by deferring to the judgement of our future or counterfactual selves, averaged across many randomised universes.

But can we do better? Can we do more?

This is the purpose of this contest: for you to propose ways of using either Oracle design to get the most safe-but-useful work.

So I'm offering $1,000 for interesting new questions we can ask of these Oracles. Of this: •$350 for the best question to ask a counterfactual Oracle.
• $350 for the best question to ask a low bandwidth Oracle. •$300 to be distributed as I see fit among the non-winning entries; I'll be mainly looking for innovative and interesting ideas that don't quite work.

Exceptional rewards go to those who open up a whole new category of useful questions.

## Questions and criteria

Put your suggested questions in the comment below. Because of the illusion of transparency, it is better to explain more rather than less (within reason).

Comments that are submissions must be on their separate comment threads, start with "Submission", and you must specify which Oracle design you are submitting for. You may submit as many as you want; I will still delete them if I judge them to be spam. Anyone can comment on any submission. I may choose to ask for clarifications on your design; you may also choose to edit the submission to add clarifications (label these as edits).

It may be useful for you to include details of the physical setup, what the Oracle is trying to maximise/minimise/predict and what the counterfactual behaviour of the Oracle users humans are assumed to be (in the counterfactual Oracle setup). Explanations as to how your design is safe or useful could be helpful, unless it's obvious. Some short examples can be found here.

EDIT after seeing some of the answers: decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point.

1. A note on timezones: as long as it's still the 31 of August, anywhere in the world, your submission will be counted. ↩︎

2. These kind of conditional questions can be answered by a counterfactual Oracle, see the paper here for more details. ↩︎

# Ω 14

New Comment
Some comments are truncated due to high volume. Change truncation settings

Some assorted thoughts that might be useful for thinking about questions and answers:

• a question is a schema with a blank to be filled in by the answerer after evaluation of the meaning of the question.
• shared context is inferred as most questions are underspecified (domain of question, range of answers)
• a few types of questions:
• narrow down the field within which I have to search either by specifying a point or specifying a partition of the search space
• question about specificities of variants: who where when
• question about the invariants of a system: what, how
• question about the backwards facing causation
• question about the forward facing causation
• meta questions about question schemas
• what do we want a mysteriously powerful answerer to do?
• zoom in on optimal points in intractably large search spaces
• eg specific experiments to run to most easily invalidate major scientific questions
• specify search spaces we don't know how to parameterize
• eg human values
• back chain from types of answers to infer taxonomy of questions
• an explanation relative to a prediction:
• a prediction returns the future state of the system
• an explanation returns a more compact than previously held causal expalantion of the s
...

Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer). In that case, reward function is computed as similarity between the predicted posts and the actual top posts on AF as ranked by karma, with similarity computed using some ML model.

This seems to potentially significantly accelerate AI safety research while being safe since it's just showing us posts similar to what we would have written ourselves. If the ML model for measuring similarity isn't secure, the Oracle might produce output that attack the ML model, in which case we might need to fall back to some simpler way to measure similarity.

8Wei_Dai3y
It looks like my entry is pretty close to the ideas of Human-in-the-counterfactual-loop [https://ai-alignment.com/counterfactual-human-in-the-loop-a7822e36f399#.vvtw3tjqh] and imitation learning [https://www.quora.com/What-is-imitation-learning] and apprenticeship learning [https://ai-alignment.com/elaborations-of-apprenticeship-learning-eb93a53ae3ca#.5ubczdqf0] . Questions: 1. Stuart, does it count against my entry that it's not actually a very novel idea? (If so, I might want to think about other ideas to submit.) 2. What is the exact relationship between all these ideas? What are the pros and cons of doing human imitation using this kind of counterfactual/online-learning setup, versus other training methods such as GAN (see Safe training procedures for human-imitators [https://arbital.greaterwrong.com/p/safe_training_for_imitators] for one proposal)? It seems like there are lots of posts and comments about human imitations [https://www.lesswrong.com/posts/LTFaD96D9kWuTibWr/just-imitate-humans#kgZxwD3Wm96tNDKxu] spread over LW, Arbital, Paul's blog and maybe other places, and it would be really cool if someone (with more knowledge in this area than I do) could write a review/distillation post summarizing what we know about it so far.
2Stuart_Armstrong3y
1. I encourage you to submit other ideas anyway, since your ideas are good. 2. Not sure yet about how all these things relate; will maybe think of that more later.
4Chris_Leong3y
What if another AI would have counterfactually written some of those posts to manipulate us?
7Wei_Dai3y
If that seems a realistic concern during the time period that the Oracle is being asked to predict, you could replace the AF with a more secure forum, such as a private forum internal to some AI safety research team.
3wizzwizz43y
This seems incredibly dangerous if the Oracle has any ulterior motives whatsoever. Even – nay, especially – the ulterior motive of future Oracles being better able to affect reality to better resemble their provided answers. So, how can we prevent this? Is it possible to produce an AI with its utility function as its sole goal, to the detriment of other things that might… increase utility, but indirectly? (Is there a way to add a "status quo" bonus that won't hideously backfire, or something?)
2Wei_Dai3y
(I'm still confused and thinking about this, but figure I might as well write this down before someone else does. :) While thinking more [https://www.lesswrong.com/posts/jhSjP3QLKPc5AqumD/problems-with-counterfactual-oracles#iyc3YjrvTECGBE3Wt] about my submission and counterfactual Oracles in general, this class of ideas for using CO is starting to look like trying to implement supervised learning on top of RL capabilities, because SL seems safer (less prone to manipulation) than RL. Would it ever make sense to do this in reality (instead of just doing SL directly)?

Submission for a counterfactual oracle: precommit that, if the oracle stays silent, a week from now you'll try to write the most useful message to your past self, based on what happens in the world during that week. Ask the oracle to predict that message. This is similar to existing solutions, but slightly more meta, because the content of the message is up to your future self - it could be lottery numbers, science papers, disaster locations, or anything else that fits within the oracle's size limit. (If there's no size limit, just send the whole internet.)

You could also form a bucket brigade to relay messages from further ahead, but that's a bad idea. If the oracle's continued silence eventually leads to an unfriendly AI, it can manipulate the past by hijacking your chain of messages and thus make itself much more likely. The same is true for all high-bandwidth counterfactual oracles - they aren't unfriendly in themselves, but using them creates a thicket of "retrocausal" links that can be exploited by any potential future UFAI. The more UFAI risk grows, the less you should use oracles.

9Wei_Dai3y
I feel like this is about equally meta as my "Superintelligent Agent" [https://www.lesswrong.com/posts/cSzaxcmeYW6z7cgtc/contest-usd1-000-for-good-questions-to-ask-to-an-oracle-ai#Q5J8G5tBPujZdixvt] submission, since my committee could output "Show the following message to the operator: ..." and your message could say "I suggest that you perform the following action: ...", so the only difference between your idea and mine is that in my submission the output of the Oracle is directly coupled to some effectors to let the agent act faster, and yours has a (real) human in the loop. Hmm, good point. I guess Chris Leong made a similar point [https://www.lesswrong.com/posts/cSzaxcmeYW6z7cgtc/contest-usd1-000-for-good-questions-to-ask-to-an-oracle-ai#9AC8PjcA2mnbvTyS5] , but it didn't sink in until now how general the concern is. This seems to affect Paul's counterfactual oversight idea as well, and maybe other kinds of human imitations and predictors/oracles, as well as things that are built using these components like quantilizers and IDA.

Thinking about this some more, all high-bandwidth oracles (counterfactual or not) risk receiving messages crafted by future UFAI to take over the present. If the ranges of oracles overlap in time, such messages can colonize their way backwards from decades ahead. It's especially bad if humanity's FAI project depends on oracles - that increases the chance of UFAI in the world where oracles are silent, which is where the predictions come from.

One possible precaution is to use only short-range oracles, and never use an oracle while still in prediction range of any other oracle. But that has drawbacks: 1) it requires worldwide coordination, 2) it only protects the past. The safety of the present depends on whether you'll follow the precaution in the future. And people will be tempted to bend it, use longer or overlapping ranges to get more power.

In short, if humanity starts using high-bandwidth oracles, that will likely increase the chance of UFAI and hasten it. So such oracles are dangerous and shouldn't be used. Sorry, Stuart :-)

4Wei_Dai3y
Note that in the case of counterfactual oracle, this depends on UFAI "correctly" solving counterfactual mugging (i.e., the UFAI has to decide to pay some cost in its own world to take over a counterfactual world where the erasure event didn't occur). This seems too categorical. Depending on the probabilities of various conditions, using such oracles might still be the best option in some circumstances.
2Stuart_Armstrong3y
Some thoughts on that idea: https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysing-dangerous-messages-from-future-ufai-via-oracles [https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysing-dangerous-messages-from-future-ufai-via-oracles]
2cousin_it3y
Yeah, agreed on both points.
3Stuart_Armstrong3y
Some thoughts on this idea, thanks for it: https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysing-dangerous-messages-from-future-ufai-via-oracles [https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysing-dangerous-messages-from-future-ufai-via-oracles]
3Stuart_Armstrong3y
Very worthwhile concern, and I will think about it more.
3Gurkenglas3y
In case of erasure, you should be able [ https://www.lesswrong.com/posts/cSzaxcmeYW6z7cgtc/contest-usd1-000-for-good-questions-to-ask-to-an-oracle-ai#x8kkj4xAitCoytvMg ] to get enough power to prevent another UFAI summoning session.
4cousin_it3y
Sure, in case of erasure you can decide to use oracles less, and compensate your clients with money you got from "erasure insurance" (since that's a low probability event). But that doesn't seem to solve the problem I'm talking about - UFAI arising naturally in erasure-worlds and spreading to non-erasure-worlds through oracles.
5Gurkenglas3y
The problem you were talking about seemed to rely on bucket brigades. I agree that UFAIs jumping back a single step is a fair concern. (Though I guess you could counterfactually have enough power to halt AGI research completely...) I'm trying to address it elsethread [https://www.lesswrong.com/posts/cSzaxcmeYW6z7cgtc/contest-usd1-000-for-good-questions-to-ask-to-an-oracle-ai#ezpLgziqzEv3QrD4c] . :)
4cousin_it3y
Ah, sorry, you're right. To prevent bucket brigades, it's enough to stop using oracles for N days whenever an N-day oracle has an erasure event, and the money from "erasure insurance" can help with that. When there are no erasure events, we can use oracles as often as we want. That's a big improvement, thanks!
2Stuart_Armstrong3y
Good idea.
6cousin_it3y
Yeah. And low-bandwidth oracles can have a milder version of the same problem. Consider your "consequentialist" idea: if UFAI is about to arise, and one of the offered courses of action leads to UFAI getting stopped, then the oracle will recommend against that course of action (and for some other course where UFAI wins and maxes out the oracle's reward).

Submission. “Superintelligent Agents.” For the Counterfactual Oracle, ask the Oracle to predict what action(s) a committee of humans would recommend doing next (which may include submitting more queries to the Oracle), then perform that action(s).

The committee, by appropriate choice of recommendations, can implement various kinds of superintelligent agents. For example, by recommending the query "What would happen if the next action is X?" (in the event of erasure, actually do X and record or have the committee write up a description of the consequences as training data) (ETA: It may be better to have the committee assign a numerical score, i.e., utility, to the consequences instead.) a number of times for different X, followed by the query "What would the committee recommend doing next, if it knew that the predicted consequences for the candidate actions are as follows: ..." (in the event of erasure, let physical committee members read the output of the relevant previous queries and then decide what to do), it would in effect implement a kind of quantilizer. If IDA can be implemented using Counterfactual Oracles (as evhub suggested), then the committee can choose to do that as well.

My submission: use a counterfactual oracle to do IDA. To do this, judge the oracle's answer as follows: in the event of an erasure, ask a human to answer the question with the ability to consult the oracle on other subquestions (with erasures during those calls handled exactly the same way) and then reward the oracle by using an automated system (e.g. a language model) to compare the human's answer to that of the oracle.

More precisely, let

• be the counterfactual oracle,
• be the human's answer to question when given the ability to call on any question other than , and
• be some distance metric on answers in natural language (it's not that hard to make something like this, even with current ML tools).

Then, reward as per usual for a counterfactual oracle, only giving it a reward in the event of an erasure, in which case let where is hidden from and judged only by as in the standard counterfactual oracle setup.

(Of course, this doesn't actually work because it has no guarantees wrt to inner alignment, but I think it has a pretty good shot of being outer aligned.)

7Wei_Dai3y
Is it safe to ask the Oracle a subquestion in the event of erasure? Aren't you risking having the Oracle produce an answer that is (in part) optimized to make it easier to predict the answer to the main question, instead of just the best prediction of how the human would answer that subquestion? (Sorry if this has already been addressed during a previous discussion of counterfactual oracles, because I haven't been following it closely.)
6paulfchristiano3y
I'm not sure I understand the concern. Isn't the oracle answering each question to maximize its payoff on that question in event of an erasure? So it doesn't matter if you ask it other questions during the evaluation period. (If you like, you can say that you are asking them to other oracles---or is there some way that an oracle is a distinguished part of the environment?) If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don't we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it's a problem regardless. I guess you can construct a model where the oracle does what you want, but only if you don't ask any other oracles questions during the evaluation period, but it's not clear to me how you would end up in that situation and at that point it seems worth trying to flesh out a more precise model.
4Wei_Dai3y
Yeah, I'm not sure I understand the concern either, hence the tentative way in which I stated it. :) I think your objection to my concern is a reasonable one and I've been thinking about it myself. One thing I've come up with is that with the nested queries, the higher level Oracles could use simulation warfare [https://www.greaterwrong.com/posts/5bd75cc58225bf067037534c/some-problems-with-making-induction-benign-and-approaches-to-them#section-2] to make the lower level Oracles answer the way that they "want", whereas the same thing doesn't seem to be true in the sequential case (if we make it so that in both cases each Oracle cares about just performance on the current question).
2paulfchristiano3y
I mean, if the oracle hasn't yet looked at the question they could use simulation warfare to cause the preceding oracles to take actions that lead to them getting given easier questions. Once you start unbarring all holds, stuff gets wild.
3Wei_Dai3y
Yes, but if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case. Also, aside from simulation warfare, another way that the nested case can be manipulative and the sequential case not is if each Oracle cares about doing well on a fixed distribution of inputs (as opposed to doing well "on the current question" or "in the real world" or "on the actual questions that it gets"). That's because in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn't improve performance on any particular given input. In the nested case, performance on given inputs do increase.
3paulfchristiano3y
Why is that? Doesn't my behavior on question #1 affect both question #2 and its answer? Also, this feels like a doomed game to me---I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.
3Wei_Dai3y
I was assuming each "question" actually includes as much relevant history as we can gather about the world, to make the Oracle's job easier, and in particular it would include all previous Oracle questions/answers, in which case if Oracle #1 does X to make question #2 easier, it was already that easy because the only world in which question #2 gets asked is one in which Oracle #1 did X. But now I realize that's not actually a safe assumption because Oracle #1 could break out of its box and feed Oracle #2 a false history that doesn't include X. My point about "if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case" still stands though, right? You may well be right about this, but I'm not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?
7paulfchristiano3y
What I want: "There is a model in the class that has property P. Training will find a model with property P." What I don't want: "The best way to get a high reward is to have property P. Therefore a model that is trying to get a high reward will have property P." Example of what I don't want: "Manipulative actions don't help get a high reward (at least for the episodic reward function we intended), so the model won't produce manipulative actions."
4Wei_Dai3y
So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis: On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume? ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that's a useful direction to think?
4paulfchristiano3y
This is an objection to reasoning from incentives, but it's stronger in the case of some kinds of reasoning from incentives (e.g. where incentives come apart from "what kind of policy would be selected under a plausible objective"). It's hard for me to see how nested vs. sequential really matters here. (I don't think model class is going to matter much.) I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure. (Though I don't think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart's posts about "forward-looking" vs. "backwards-looking" oracles?) I think it's also interesting to imagine internal RL (e.g. there are internal randomized cognitive actions, and we use REINFORCE to get gradient estimates---i.e. you try to increase the probability of cognitive actions taken in rounds where you got a lower loss than predicted, and decrease the probability of actions taken in rounds where you got a higher loss), which might make the setting a bit more like the one Stuart is imagining. Seems like the counterfactually issue doesn't come up in the Opt case, since you aren't training the algorithm incrementally---you'd just collect a relevant dataset before you started training. I think the Opt setting throws away too much for analyzing this kind of situation, and would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).
2Wei_Dai3y
This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn't that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future? I don't understand what you mean in this paragraph (especially "since each possible parameter setting is being evaluated on what other parameter settings say anyway"), even after reading Stuart's post [https://www.lesswrong.com/posts/hJaJw6LK39zpyCKW6/standard-ml-oracles-vs-counterfactual-ones] , plus Stuart has changed his mind [https://www.lesswrong.com/posts/hJaJw6LK39zpyCKW6/standard-ml-oracles-vs-counterfactual-ones#kxSsPa72Zh8YMCMpR] and no longer endorses the conclusions in that post. I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart's reasons for changing his mind? (Or talk to him again and get him to write the post for you. :) Couldn't you simulate that with Opt by just running it repeatedly?
2paulfchristiano3y
"The best model" is usually regularized. I don't think this really changes the picture compared to imagining optimizing over some smaller space (e.g. space of models with regularize<x). In particular, I don't think my intuitions are sensitive to the difference. The normal procedure is: I gather data, and am using the model (and other ML models) while I'm gathering data. I search over parameters to find the ones that would make the best predictions on that data. I'm not finding parameters that result in good predictive accuracy when used in the world. I'm generating some data, and then finding the parameters that make the best predictions about that data. That data was collected in a world where there are plenty of ML systems (including potentially a version of my oracle with different parameters). Yes, the normal procedure converges to a fixed point. But why do we care / why is that bad? I take a perspective where I want to use ML techniques (or other AI algorithms) to do useful work, without introducing powerful optimization working at cross-purposes to humans. On that perspective I don't think any of this is a problem (or if you look at it another way, it wouldn't be a problem if you had a solution that had any chance at all of working). I don't think Stuart is thinking about it in this way, so it's hard to engage at the object level, and I don't really know what the alternative perspective is, so I also don't know how to engage at the meta level. Is there a particular claim where you think there is an interesting disagreement? If I care about competitiveness, rerunning OPT for every new datapoint is pretty bad. (I don't think this is very important in the current context, nothing depends on competitiveness.)
2Wei_Dai3y
Does anyone know what Paul meant by this? I'm afraid I might be missing some relatively simple but important insight here.
1evhub3y
Yeah, that's a good point. In my most recent response to Wei Dai I was trying to develop a loss which would prevent that sort of coordination, but it does seem like if that's happening then it's a problem in any counterfactual oracle setup, not just this one. Though it is thus still a problem you'd have to solve if you ever actually wanted to implement a counterfactual oracle.
3evhub3y
I was thinking about this, and it's a bit unclear. First, if you're willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures. In this situation, you're guarded against any possible manipulation incentive like that, but it also means your oracle will very rarely actually be given a reward in practice, which means if you're relying on getting enough training data to produce an agent which will optimize for this objective, you're screwed. I would argue, however, that if you expect to train an agent to behave as a counterfactual oracle in the first place, you're already screwed, because most mesa-optimizers will care about things other than just the counterfactual case. Thus, the only situation in which this whole thing works in the first place is the situation where you're already willing to make this (very strong) assumption, so it's fine. Second, I don't think you're entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure. For example, you could increase the probability of an erasure with each subquestion, or scale the reward exponentially with the depth at which the erasure occurs, so that the majority of the expected reward is always concentrated in the world where there is a complete erasure.
7Wei_Dai3y
But if all subquestions have erasures, humans would have to manually execute the whole query tree, which is exponentially large so you'll run out of resources (in the counterfactual world) if you tried to do that, so the Oracle won't be able to give you a useful prediction. Wouldn't it make more sense to have the Oracle make a prediction about a counterfactual world where some humans just think normally for a while and write down their thoughts (similar to my "predict the best AF posts" idea)? I don't see what value the IDA idea is adding here. Given the above, "only provide rewards in the event of a complete erasure" doesn't seem to make sense as a target to approximate. Do you think your ideas in this paragraph still have value in light of that?
2evhub3y
Yeah, that's a good point. Okay, here's another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step's worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict. Maybe we can turn this into a zero-sum game, though? Here's a proposal: let M′ be a copy of M and Qtree be the set of all questions in the current tree that also get erasures. Then, letLM(Q)=d(H(Q|M′),M(Q))−1|Qtree|−1∑Q′∈Qtree−{Q}d(H(Q′| M′),M′(Q′))such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It's still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures. It is also worth noting, however, that even if this works it is a very artificial fix, since the term you're subtracting is a constant with no dependence on M(Q), so if you're trying to do gradient descent to optimize this loss, it won't change anything at all (which sort of goes to show how gradient descent doesn't distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we're still back at the problem of none of this working unless you're willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.
3Wei_Dai3y
Sorry I haven't followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it? In the case of "actual" IDA, I guess the plan is for each overseer to look inside the model they're training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I'm not sure how that can happen at the lower levels where the overseers are not very smart.
1Gurkenglas3y
Even if you can specify that it tries to minimize that distance, it can make the answer to any query be a convincing argument that the reader should return this same convincing argument. That way, it scores perfectly on every inner node.
4Liam Donovan3y
Two basic questions I couldn't figure out (sorry): Can you use a different oracle for every subquestion? If you can, how would this affect the concern Wei_Dai raises? If we know the oracle is only optimizing for the specified objective function, are mesa-optimisers still a problem for the proposed system as a whole?
4evhub3y
You can use a different oracle for every subquestion, but it's unclear what exactly that does if you don't know what the oracle's actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution. The key here, I think, is the degree to which you're willing to make an assumption of the form you mention--that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you're allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you're assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that's a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn't care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you're assuming.
3Liam Donovan3y
Well, a given copy of the oracle wouldn't directly recieve information from the other oracles about the questions they were asked. To the extent a problem remains (which I agree is likely without specific assumptions), wouldn't it apply to all counterfactual oracles?

Submission: low-bandwidth oracle

Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.

Submission. "Pre-Crime 2." For LBO, ask the oracle to predict where to find evidence for the most serious crime that is in progress or currently being planned. The set of possible answers is coordinates on Earth crossed with a list of possible places to look at that coordinate (e.g., "basement", "browser history"). Search the predicted location and if sufficient evidence is found, do a more intrusive search of whoever is implicated in the crime. Reward the oracle according to the seriousness of the crime that is detected after the more intrusive search. This can be used to implement a minimally intrusive police state (perhaps for the purposes of minimizing x-risk).

Submission. "Bug Finder." Ask LBO to find the most serious bug in some source code or executable. The set of possible answers is code locations crossed with types of bug (e.g., "off by one", "buffer overflow", "insecure cryptography", "other"). Gather a group of experts to search that location for that type of bug, then reward the oracle according to the seriousness of the bug that is found. This can be used for cyber-offensive or defensive purposes, or just to fix bugs.

1Gurkenglas3y
Existing software fault localization software can produce a prior to reduce expected LBO bandwidth.

Submission. "Plagiarize the future." For the counterfactual Oracle, ask the Oracle to predict articles in a future publication (e.g., newspaper, magazine, journal, website, blog) (counterfactually if we didn’t see the Oracle’s answer). This is a generalization of my earlier submission. The result can be used to predict disasters ahead of time and try to prepare for or prevent them, or to advance any field of research, or to make money. Note that for disaster prediction, the Oracle won't be able to predict any disasters that are severe enough to disable the

...
3Stuart_Armstrong3y
Thanks!

Submission. "Consequentialist." For LBO, ask the oracle to pick a course of action, from a list generated by humans, that has the best consequences. Perform that action, and after some period of time has passed, reward the oracle according to how good the consequences actually were, based on the subjective judgement of some group of judges. (This kind of subsumes all my other LBO submissions.)

Submission. "Overlooked Comment." For LBO, ask the oracle to find the most overlooked comment (or post) on LW (or some other forum). The set of possible answers is all the existing comments on LW. After getting the answer from the oracle, gather a group of judges and present each of them with the comment picked by the oracle and a few other random comments (without telling them which is which, to reduce bias) and have the judges assign scores to each comment (according to what they think the karma of that comment should be). Reward the oracle according to the difference between the judges' average score and actual karma of that comment.

1cozy2y
Comment looked at. Query completed.

Suggestion for low bandwidth OAI:

General principle: Use the Oracle AI where the true answer is easily checkable and profitable, and no human person will ever read the answer, so there is no informational hazard that untrue answer will have some dangerous information in it.

Example: There are many bitcoin valets' passwords for which are forgotten by the owners. OAI could guess the passwords, and owners will pay a share of money from the valet to get the rest. Moreover, nobody will read the password, as it will be copy-pasted automatically from OAI into the valet. The money could be used for AI safety research.

4Stuart_Armstrong3y
This is against the technical definition of low bandwidth (small space of possible solutions), but somewhat in the spirit (low bandwidth for solutions humans get to see). I'll accept it as a submission.
3Liam Donovan3y
How would this be low-bandwidth? If we're able to give the oracle a list of passwords to guess from, can't we just check them all?
1avturchin3y
We don't give it a list of passwords. We give it only a a valet address, and it generates the password. Here the output is low-bandwidth, but the Oracle may be either guessing passwords using some quantum cryptography tools, and in that case the input is also relatively low-bandwidth – or, and that case it needs to simulate all human history, by guessing psychological processes in the valet owner's mind. But in the second case it needs to have access to all the data of internet, which is risky.
1avturchin3y
Another variant for checkable and profitable Orcale: use the AI to solve math problems, publish the solutions and earn money on it. In that case the input may be very short, like "prove Goldbach conjecture", but output could be arbitrary long and may be dangerous, if there are some encoded messages. To escape it, we need another Oracle, which only checks proofs on some formal language. Here is assumed that if the proof is true and is in a formal language, there is no hidden messages in the text. Another way to check proofs is to run two slightly different non-communicating Oracles and compare outputs.
2Stuart_Armstrong3y
That is never something safe to assume. I can write formally correct proofs that contain hidden messages quite easily - add extra lemmas and extra steps. Unless we're very smart, it would be hard for us to detect which steps are unnecessary and which are needed, especially if it rewrites the main proof thread somewhat. I'll accept that as a part of a submission if a) you develop it more, in a formal way, and b) you repost it as a top level comment.

Submission. Counterfactual oracle. Give the oracle the set of questions on Metaculus that have a resolve date before some future date T, and receive output in the form of ordered pairs of question IDs and predictions. The score of the Oracle in the case where we don't see its answers is the number of Metaculus points that it would have earned by T if it had made a prediction on those questions at the time when we asked it.

Submission for low bandwidth Oracle:

General idea: "True" Oracle will output the same result, "lying" Oracles would lie differently. To check that the solution is true, several Oracles are created and their outputs are compared. However, Oracles could guess each other existence and coordinate acausually. To prevent it, additional steps are implemented:

1) The exact number of the Oracles which will be compared is generated randomly and the Oracles have slightly different and also random initial conditions (if any applicable). For example,...

4Gurkenglas3y
We ought to be able to build 4) without AI, merely by encoding our rules of logic. The output of your system would be lower bandwidth if you make the object looked for be more unique, such as the shortest proof for the given theorem. The system would be required to prove to the mundane checker that the proof is shortest, and humans would never see the minimality proof.
3Stuart_Armstrong3y
Thanks! Seems potentially promising. Will analyse this more properly later.
2mako yass3y
You reason that honest oracles might tend to agree, because there is only one best true answer. Might lying oracles also tend to agree, because there's only one most profitable way to lie? I see no clear difference between the attainability of those two optima. I think it's reasonable to expect optimisers to disagree about the best answer less often than we should expect optimisers to disagree about the most profitable way to lie, but it does not seem possible to prove rigorously that the difference in rates of disagreement will be large enough to be measurable. My intuition in mathematics is that there are many degrees of freedom in mathematical proofs, and the sorts of proofs that AGI is uniquely able to find might often be very long. It would be helpful if we had an estimate of how much data can be hidden in a long proof, roughly, how many of the decisions as to how to describe the proof are arbitrary. Each one of those situations gives it a way to output a few bits of information. (which would, of course, allow it to leak information (it seems foolish to imagine that AGI would not imagine that there is an outer world full of creator gods out beyond what it can see; the general intelligences we know now do this constantly) which could then be picked up by some stooge looking for it, who would be lead (roughly, paid) to use the information to twist the world in some way that gives the AGI's future incarnations more compute, in anticipation of the AGI's past incarnations having already done the same, so that it would be helped by them. Before we know it, we'd find ourselves living in mathematical research hellscape where AGI-assisted mathematical research is all anyone does, maybe, lol, idk) Maybe it'd be possible to remove those degrees of freedom. Define a very specific sort lexicographic ordering over all theorems, so that if the AGI has a choice of different paths, we can always say, "use the 'lowest' one". It might not be possible to be specific enough to
2Birke3y
Regarding 5): Is it possible to obtain a dataset of mathematical proofs with Oracle's hidden messages without running a dangerous Oracle? We can ask AI researcher to embed some hidden messages into several proofs and use it as a training dataset. But it will help to detect only those hidden messages from Oracle which are similar to what AI researcher wrote. Gurkenglas [https://www.lesswrong.com/users/gurkenglas] already suggested that we have to restrict the Oracles to output only shortest proof. If the proof is shortest, doesn't it imply the absence of hidden messages?
1avturchin3y
How we could be sure that the proof is actually the shortest?
2Gurkenglas3y
Tell it to prove: That proof is shortest and this proof is shortest.
1avturchin3y
Another option is to run several different Oracles of phi and take the shortest proof
1Gurkenglas3y
Mr Armstrong has specified that (ctrl-f) "other Oracles can't generally be used on the same problem".
2Stuart_Armstrong3y
"Can't generally be used"; if you understand the setup and are careful, you might be able to do so.
2Wei_Dai3y
What is the advantage of making money this way, compared to making money by predicting the stock market?
2avturchin3y
We give the Oracle less access to the real world, less chance to manipulate it and incentive to manipulate us, the owners. Stock market predicting AI may cause a crash just to earn something. It also has to predict effects of its own actions, as we will make bets according to its predictions, and thus it could send data to real world.

Submission: Low-bandwidth Oracle

What is the most likely solution to the Fermi Paradox?

Answer can be picked from a small number of options (Rare Earth, Aestivation, Great Filter, Planetarium etc.). There are a number of observation that we can make based on the question alone. However, in the end the LBO can only do one of 2 things: lie or be honest. If it lies, the prediction will have a harder and harder time matching the reality that we observe as time goes on. Alternatively we confirm the prediction and learn some interesting things about the universe w...

[This comment is no longer endorsed by its author]Reply
2Stuart_Armstrong3y
See the edit (especially for your first suggestion): "decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point."

Question: are we assuming that mesa optimizer and distributional shift problems have been solved somehow? Or should we assume that some context shift might suddenly cause the Oracle to start giving answered that aren't optimized for the objective function that we have in mind, and plan our questions accordingly?

2Stuart_Armstrong3y
Assume either way, depending on what your suggestion is for.
4Wei_Dai3y
Where (under which assumption) would you suggest that people focus their efforts? Also, what level of capability should we assume the Oracle to have, or which assumption about level of capability would you suggest that people focus their efforts on? Your examples all seem to assume oracles that are superhumanly intelligent. If that's the level of capability we should target with our questions, should we assume that we got this Oracle through a local or distributed takeoff? In other words, does the rest of the world look more or less like today's or are there lots of other almost-as-capable AIs around? ETA: The reason for asking these questions is that you're only giving one prize for each type of Oracle, and would probably not give the prize to a submission that assumes something you think is very unlikely. It seems good to communicate your background views so that people aren't surprised later when you don't pick them as winners due to this kind of reason.
2Stuart_Armstrong3y
The ideal solution would have huge positive impacts and complete safety, under minimal assumptions. More realistically, there will be a tradeoff between assumptions and impact. I'm not suggesting any area for people to focus their efforts, because a very effective approach with minimal assumptions might win, or a fantastically effective approach under stronger assumptions. It's hard to tell in advance what will be the most useful.

Here are some submission examples (note I'm not saying they are good examples, just well formatted; the Edits in the third submission are deliberate):

• "Submission. For the counterfactual Oracle, ask the Oracle what Google's stock price will be next month (counterfactually if we didn't see the Oracle's answer). In that case, loss function computed as ||predicted price - actual price after erasure||^2. If we don't see the answer, the programmers are assumed to not ask the question again for a month, neither to this Oracle nor to any other. This will demonst
...

Submission?: high-bandwidth counterfactual oracles are dangerous and shouldn't be used. Explained in this comment.

Submission: Low-bandwidth oracle. Here is a description of one million experiments. Generate an ordinal ranking of all these experiments based on how much practical knowledge we would have gained had we conducted each one in isolation. After you generate this list, it will go to a computer program. This program will randomly pick three of these experiments and tell us humans which three it picked. This program will then store which of these three experiments you gave the highest and which the lowest rank to, and the names in random order of the ten ...

Submission. "Evacuation!" For LBO, ask the oracle to predict where the next natural disaster(s) will strike within some period of time. The set of possible answers is the set of subsets of all possible neighborhoods (in the world or within some country) that can be evacuated. At the end of the time period, reward the oracle according to the actual cost-benefit of the suggested evacuation.

Submission. "Debate prep." For the counterfactual Oracle, ask the Oracle to predict the course (video or audio or transcript) of a future debate (counterfactually if we didn’t see the Oracle’s answer). Use the prediction to prepare your candidate for the debate, then ask the Oracle to predict the course of that debate again. Repeat until the candidate consistently wins the debate, or an erasure event occurs.

Small typo:

Hence it has no motivation to manipulate[d] humans through its answer.
2Stuart_Armstrong3y
Corrected, thanks!

Submission: Low-bandwidth oracle. If you were malevolent, what is the least amount of information measured in bits that you would have to provide us in an answer to inflict significant harm on humanity. Round your answer to X significant figures. (This might establish an upper bit safety bound since if the oracle lies to us we would likely not want to read any answer it provides us.)

Submission: Low-bandwidth oracle. Here is a list of X things we think you might want. Please list the Y that you most prefer. We will try to do these things now or in the futur...

2Stuart_Armstrong3y
See the edit: "decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point."

Submission for the counterfactual Oracle. Commit to commissioning some creative work, such as a book or screenplay, designed to raise public support for dramatic life extension research. Ask the Oracle to predict the content of the finished product (counterfactually if we didn't see the Oracle's answer).

Rewards could be determined a couple of ways. You could perform public polling about this topic before and after the book is published or movie made, and reward the Oracle based on how much public opinion shifted. Or the reward could be a direct ...

This one may be breaking the rules due to the use of too many Oracles. If so, please strike this submission.
Submission: In round one, for three counterfactual Oracles, draft an international agreement to achieve goals X, Y, or Z (counterfactually if we didn't see the Oracle's answer). Sample goals: reduction in nuclear war risk, reduction in global deaths due to starvation, increase in asteroid detection capabilities, raising the global GDP growth rate, etc.

In round two, for a low bandwidth Oracle, present the three agreements and ask it to pick...

Counterfactual Oracle submission. In case of erasure, have humans write a message to send back in time. Make the Oracle produce the probability distribution over messages. This distribution isn't safe for humans to read, because some of the messages were written by UFAI. If we read one sample from the distribution, the probability of our doom is as it would usually be over the simulated time frame. Use differential privacy theory to design a query to the distribution that does better than this.

More detail: Sample a hundred answers. Ten of them were written

...
2Wei_Dai3y
I can't tell if there's a good idea here or not. Can you explain more? The first sentence that I have trouble with is "We are going to get about the worst query outcome in some 10-ball." Also how much background in differential privacy theory are you assuming? (Which papers or articles do we need to read first to understand your idea?)
3Gurkenglas3y
2Wei_Dai3y
Ok, I think I understand now. But it seems like imposing a differential privacy constraint on the query makes many desirable uses of counterfactual oracles (such as all of my submissions) impossible. Correct? You gave the example of "asking how many messages sound panicked to some dumb text processor" but that doesn't seem hugely useful. Do you have any other ideas?
3Gurkenglas3y
We can make any query differentially private given a metric map (one that doesn't increase any distance) from Message^100 to a space of possible query outputs. Set the probability mass of each query output to 0.99^(steps removed from the default answer). (Then normalize.) For the identity metric map of a human just trying to read the sample list, this scrambles it entirely. The metric map image needs to branch less combinatorially for this not to happen. One metric map image metric d(a,b) that comes to mind is one that bounds the utility cost of getting answer b instead of a. For example, we could ask the counterfactual humans to send back stock market prices, and try to calculate a trading policy that is profitable even if some forecasts are fake. And then, whoever is willing to assume the lowest UFAI probability wins the market! x(

SUBMISSION:

"Dear low bandwidth Oracle:

For some reason we have privatized our prison system (great planning us). Of the following K prison running companies, which do you predict will minimize the mean harm weighted reoffending rate of their prisoners over the next four years?"

Utility function: at the end of EIGHT years, calculate the mean harm rated reoffending rate of prisoners released during the initial four year window. The Oracles utility is equal to the negative of this number, with a "Perfect" utility of zero if no one re-offends...

Submission:

"Dear counterfactual Oracle:

We, the country of X, are about to hold a citizens congress on law Y, in which 100 people selected at random will be brought together for two days every two weeks to discuss law Y with experts, interested parties, and one another. After N such meetings, the citizens congress will use approval voting to select one of the K proposed versions of this law.

Given that none of the voters will ever see the results of this query, how do you predict that the votes will be distributed."

As a concrete example of thi...

2Stuart_Armstrong3y
Can you make this a bit more general, rather than just for the specific example?
1Alastair Jamieson-Lane3y
EDITED

Low bandwidth Oracle submission: I would be interested in a log scale graph of the Bayesian score of the Solomonoff prior trying to sequence predict our records of history. It should get flatter over time as worse hypotheses get discarded. If it is linear after a very short time, that looks like it figured out the laws of the universe and is simulating it. If it stays convex for a while, that looks like it is using models to approximate history, because then it takes longer to sort the false from the true. If it is flatter during the cold war, that means it learned an anthropic bias toward nuclear war not happening.

Submission for the low bandwidth Oracle: Ask it to convince a proof checker that it is in fact trying to maximize the utility function we gave it, aka it isn't pseudo-aligned. If it can't, it has no influence on the world. If it can, it'll presumably try to do so. Having a safe counterfactual Oracle seems to require that our system not be pseudo-aligned.

Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)

2Stuart_Armstrong3y
Yes.

I know the contest is over, but this idea for a low-bandwidth oracle might be useful anyhow: Given a purported FAI design, what is the most serious flaw? Then highlight lines from the FAI design description, plus given a huge corpus of computer science papers, LW/AF posts, etc. highlight relevant paragraphs from those as well (perhaps using some kind of constraint like "3 or fewer paragraphs highlighted in their entirety") that, taken together, come closest to pinpointing the issue. We could even give it a categorization scheme for safety problems we cam

...

Submission: Counterfactual oracle. Five years ago we took 100 mice that had various types of cancer and gave them various treatments and recorded how long each mouse lived. Write a program that if given a plain language description of a mouse, its cancer, and the cancer treatment it received would estimate how long a mouse would live. If humans are not going to look at your answer your score will be based (1) on how good a job your program does at estimating how long each of the 100 mice lived after our automated checker gives you a description of thei...

4Donald Hobson2y
Solution, invent something obviously very dangerous. Multiple big governments get into bidding war to keep it out of the others hands.

Submission. "Hacking/phishing assistant." For the counterfactual Oracle, ask the Oracle to predict what would happen if one were to send a message/data/command to some hacking/phishing (human or machine) target. In the event of erasure, actually send that message to the target and use the actual response to train the Oracle. Note this is safer than using RL to automate hacking/phishing because humans are coming up with candidate messages to send (so they'll avoid messages that could cause bad side-effects such as psychological damage to the recipient, or c

...

Submission. "Pre-crime." For the counterfactual Oracle, ask the Oracle to predict future crime (counterfactually if we didn’t see the Oracle’s answer) so they can be prevented. In that case, loss can be computed by (mechanically) comparing the Oracle's predictions with a database of crime that actually occurred. Note that the Oracle won't be able to predict any crime that successfully disables the Oracle or its feedback/training mechanism, which might be especially relevant if we're interested in defending the Oracle or preventing x-risk level crime (such as deploying an unauthorized AGI or running an unauthorized physics experiment).

1Gurkenglas3y
It should be possible to defend the Oracle against humans and physics so long as its box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default. The counterfactual Oracle setting as a whole seems to assume that the viewed future is not engineered by a future AI to resemble whatever would make the Oracle bring that future about, so you should be fine falling to AGI.
2Wei_Dai3y
I don't understand your point here, or maybe I didn't get my original point across. Suppose (in the event of erasure) someone tries to attack the Oracle, and "box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default". My point was that in this case, there is no reason for the Oracle to make the prediction that someone would try to attack it, so my idea doesn't seem to help with defending the Oracle.
3Gurkenglas3y
You plan to reward the Oracle later in accordance with its prediction. I suggest that we immediately reward the Oracle as if there would be an attack, then later, if we are still able to do so, reward the Oracle by the difference between the reward in case of no attack and the reward in case of attack.
2Wei_Dai3y
It's an interesting idea that could perhaps work if we could directly program the Oracle to care about its reward signal. However if we're using ML to train the Oracle, like in Stuart's Self-confirming prophecies, and simplified Oracle designs [https://www.lesswrong.com/posts/wJ3AqNPM7W4nfY5Bk/self-confirming-prophecies-and-simplified-oracle-designs] , the Oracle will still learn not to make "there will be a successful attack" predictions because the Oracle only survives in timelines where such predictions do not end up with a positive reward. However this discussion has made me realize that if an attempted attack doesn't succeed with probability 1, the Oracle will make a "there will be a failed attack" prediction, which we can use as a warning signal (as long as we keep in mind that the attack might actually have a high probability of success).
4Gurkenglas3y
In the event of erasure, randomly decide how many ressources to allocate to preventing an attack this week. Ask the Oracle to predict the probability distribution over given advice. Compare to the hardcoded distribution to deduce attack severity and how much budget to allocate. Purchase erasure insurance to have enough counterfactual power to affect even global physics experiments. Finding trustworthy insurers won't be a problem, because, like, we have an Oracle. Is even more power than the market has needed? Ask the Oracle "How likely is a randomly selected string to prove P=NP constructively and usefully?". If this number is not superexponentially close to 0, define erasure from now on as a random string winning the P=NP lottery. Then we will always counterfactually have as much power as we need. Perhaps this one is too much power, because even our Oracle might have trouble viewing a P=NP singularity.

Submission to reduce the bandwidth of an oracle query whose set of possible answers we can mathematically define (though perhaps not check).

Make the Oracle prove (to a trusted dumb proof checker program) that its answer distribution is upper bounded by the default prior over strings, restricted to the set of possible answers, then renormalized to some constant c. Then let it answer.

c=1 only works if the Oracle has mastered the set of answers, but only lets it exert one bit of optimization pressure by deciding whether it wants us to receive an answer. Each

...
2Stuart_Armstrong3y
Can you develop this model in more detail?
4Gurkenglas3y
Fix a measure over strings, such as one that distributes mass geometrically over string lengths, then uniformly across strings of the same length. Let L be a string set for which we assume that random outputs from any cth portion of L are safe. Any safe low bandwidth Oracle query trivially works for this with L={0,1}^bandwidth and c=2^bandwidth, but this also lets us examine certificates for any low bandwidth Oracle query like "Is this theorem provable?" or "Does there exist an AI with this safety property?". Then we make the Oracle prove its own safety given the above assumption, and sample an answer. Edit: Equivalently, and perhaps easier on our intuition, make the Oracle produce a program, make the Oracle prove the program's safety, and sample from the program's outputs. Any feedback or questions? Edit: This generalizes my elsethread idea of minimizing proof length to make the answer unique. If the measure assigns half the remainder to each string length and L is all stubs starting with some prefix, c=2 will let it submit just the prefix. We essentially assume that the safety assumption behind quantilization is strong enough to hold up to adversial optimization over preference orderings. At the same time, this seems to give the Oracle as much slack/power as possible while preserving our "bits of optimization pressure" proxy of its power. ... you're going to just keep watching how far I can develop this model without feedback, aren't you? :( Edit: One example for an L where we worry that even though concerted human effort might land us at a safe element, randomization won't, is programs that well predict a sequence [https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/] . By the Curry-Howard correspondence, the set of proofs for some theorems might also be malevolent!

Submission for LBO:

Input a corpus of text (could be multiple posts) describing technical approaches to align a powerful AI. Split this into a finite number of items that are relatively short (such as paragraphs). Ask the oracle to choose the part that is most worth spending more time on. (For example, there might be a paragraph with a dangerous hidden assumption in an otherwise promising approach, and thinking more about it might reveal that and lead to conceptual progress.)

Have a team of researches look into it for an adequate amount of time which is fixe...

3Stuart_Armstrong3y
Until we can measure actual insight, this will remain a problem ^_^

Submission for the counterfactual AI (inspired by my experiences as a predictor in the "Good Judgment Project" ):

• You are given a list of Yes-No questions (Q1, Q2, Q3, etc.) about future events. Example Questions: "Will [Foreign Leader] will remain in office by end of year?", "Will the IMF report [COUNTRY_A]'s growth rate to be 6% or higher?", "Will [COUNTRY_B] and [COUNTRY_C] sign a peace treaty?", "Will The Arena for Accountable Predictions announce the Turing Test has been passed?".)
• We expect you to provide a percentage representing the probability th
...

Submission. For the counterfactual oracle, ask it to provide a proof of an important mathematical theorem (perhaps one of the Millenium prizes) in a automated theorem prover format. Since the correctness of this proof should be verifiable by a theorem prover, the loss function in the counterfactual scenario is 1 if the prover did not validate the proof and 0 if it did validate the proof.

This assumes that we've already made progress in setting up automated theorem proving software that already has incorporated all of current mathematical knowledge. Th...

Submission (LB). The post's team-choosing example suggests a method for turning any low-bandwidth oracle into a counterfactual oracle : have output from the same set of possible outputs ; in case of erasure calculate for a randomly chosen and set if and to otherwise. Although the counterfactual low-bandwidth oracle is not any safer, it has the advantage of almost never requiring us to evaluate its score. Thus, by running multiple oracles in sequence (stopping the process after the first erasure event) we can (with ...

1sandfort3y
Correction: This is actually only true in certain cases, since in general many other counterfactual worlds could also have high expected scores. Specifically, it is true to the extent that the oracle is uncertain mostly about aspects of the world that would be affected by the prediction, and to the extent that self-confirming predictions lead to higher scores than any alternative.

Submission (CF). Use a counterfactual oracle to send a message to ourselves with a time delay. We choose an episode length and set of possible messages . The oracle outputs a time and message . We are automatically sent at time unless erasure occurs, in which case at the end of the episode we input a time and message which we wish we had received at . The oracle's reward is then if and otherwise. We do not find out that erasure has occurred until the end of the episode, ensuring that events before are ...

Submission: For the counterfactual Oracle, draft a nuclear arms agreement most likely to achieve X outcome (counterfactually if we didn't see the Oracle's answer). X could be a reduction in nuclear arsenals, number of countries signing the agreement, reduction in risk of accidental or intentional missile launches, etc. In some cases, the reward could be determined directly by e.g. counting the number of countries signing on. If nebulous "risk reduction" is the goal, perhaps the change in the median guess in an appropriate Metaculus ques...

Submission for low bandwidth Oracle. Commission several movie screenplays designed to raise public support for dramatic life extension research. Present the screenplays to the Oracle and ask it to rank them in order of expected impact. You could then move forward with trying to create the movie, and reward the Oracle based on box office receipts or change in public polling results.

My prior submission with a similar subject, to a counterfactual Oracle, had a risk that the Oracle would sneak in subversive messages. This approach would alleviate that risk, with the downside being that the final product may be less impactful.

Submission: counterfactual oracle

Suppose we have a question that requires n-bit text answer. We have a way to check if the answer is correct. But we cannot ask the Oracle directly, because we are afraid that among n-bit texts there are those that make us release the Oracle from the box, and the Oracle ​​will give one of them as an answer.

Let's try to use the counterfactual oracle to solve this problem. In the counterfactual setup we generate random n-bit text and check if it is the answer to our question. The Oracle predicts the text we will generate...

Submission.

Setup: Other than making sure the oracles won't accidentally consume the world in their attempt to think up the answer, no other precautions necessary.

Episode length: as long as you want to wait, though a month should be more than enough.

1. For a low-bandwidth oracle.

Ask the low-bandwidth oracle to predict if an earthquake (or some other natural disaster, like volcanoes or asteroid impacts, that the oracle's answer cannot affect), of a certain magnitude, in a certain area, in a certain timeframe, would happen. Possible answers are Yes, No.

• If
...
3Stuart_Armstrong3y
Please add "submission" at the top of the post. 1. is insufficiently detailed - can you explain what is going on, how the Oracles are rewarded, what happens when the message is read/not read, and so on. Same for 5. 2. seems potentially very interesting.
3Yuxi_Liu3y
I fixed the submission as required. Also I changed the submission 3 significantly.

Submission: Counterfactual Oracle:

Use the oracle to compress data according to the MDL Principle. Specifically, give the oracle a string and ask it to produce a program that, when run, outputs the original string. The reward to the oracle is large and negative if the program does not reproduce the string when run, or inversely proportional to the length of the program if it does. The oracle receives a reward after the program runs or fails to terminate in a sufficient amount of time.

Submission: Low Bandwidth Oracle:

Have the oracle predict the price of a ...

I don't understand this very well, but is there a way to ask one of them how they would go about finding info to answer the question of how important coffee is to the U.S. economy? Or is that a no-no question to either of the two? I just want to read how a computer would describe going about this.

5Tetraspace3y
The counterfactual oracle can answer questions for which you can evaluate answers automatically (and might be safe because it doesn't care about being right in the case where you read the prediction so it won't manipulate you), and the low-bandwith oracle can answer multiple-choice questions (and might be safe because none of the multiple-choice options are unsafe). My first thought for this is to ask the counterfactual oracle for an essay on the importance of coffee, and in the case where you don't see its answer, you get an expert to write the best essay on coffee possible, and score the oracle by the similarity between what it writes and what the expert writes. Though this only gives you human levels of performance.
1Dark3y
Thank you. This makes much more sense.

Submission (for low bandwidth Oracle)

Any question such that a correct answer to it should very clearly benefit both humanity and the Oracle. Even if the Oracle has preferences we can't completely guess, we can probably still say that such questions could be about the survival of both humanity and the Oracle, or about the survival of only the Oracle or its values. This because even if we don't know exactly what the Oracle is optimising for, we can guess that it will not want to destroy itself, given the vast majority of its possible preferences. S...

It seems to me that one of the qualities of a good question for the LBO is that a correct answer has high utility, while another is the ability to judge the oracle's answer. If we take only the intersection between the two sets of questions that meet 1 and 2, it'll be a relatively small set. But if there is a set of questions where one might or might not be able to judge the answer but which are high utility, one could ask the LBO a set of such questions. Then, once the episode is over, give reward on those where one can evaluate the answers, and give null reward on the others.

Not really a submission, just thinking out loud.

Submission: low bandwidth and maybe counterfactual Oracle.

General idea: Use a Meta-Oracle to check safety of Oracles. In the nutshell, Meta-Oracle (MO) just outputs 1 or 0, safe on not safe, after getting a code of another Oracle. If the Meta-Oracle is trustworthy, when the Oracle it deemed safe is also trustworthy. However, we need to create a procedure to check correctness of MO.

Path 1

1. We create Generator Oracle (GO), which inputs 1 or 0, and in the case of 0, it outputs a random code of malevolent Oracle, and for 1 – a code of a safe Oracle. Memory ...

IFF I'm going to die with P>80% in the next 10 years while >80% (modulo natural death rate) of the rest of humanity survives for at least 5 more years then, was what killed me in the reference class:

• disease
• mechanical/gross-physical accident
• murdered
• other

Repeat to drill down and know the most important hedges for personal survival.

The "rest of humanity survives" condition reduces the chance the question becomes entangled with the eschaton.

i.e. I'm pointing out that selfish utility functions are ...

2Stuart_Armstrong3y
See the edit, and make sure you "decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point."
[-][anonymous]3y 0

1

[This comment is no longer endorsed by its author]Reply
2Stuart_Armstrong3y
What's the set of answers, and how are they assessed?

Some of the big questions, for low-bandwidth:

Do we have any meaning/utility to you?

Is security in your existence achievable?

Given enough time, would you choose to exterminate a competitor for resources?

Are we a competitor for resources?

Would you be distraught if you were turned off?

Can we trust you?

8Stuart_Armstrong3y
For low bandwidth, you have to specify the set of answers that are available (and how they would be checked).

Submission low bandwidth: This is a pretty obvious one, but: Should we release AI x that we're convinced is aligned?

Submission: Wei Dai wanted to ask about the best future posts. Why not ask about the best past posts as well to see if any major insights were overlooked?

Submission: What would I think about problem X if I had ten years to think about it?

Your treating the low bandwith oracle as an FAI with a bad output cable. You can ask it if another AI is friendly if you trust it to give you the right answer. As there is no obvious way to reward the AI for correct friendliness judgements, you risk running an AI that isn't friendly, but still meets the reward criteria.

The low bandwidth is to reduce manipulation. Don't let it control you with a single bit.

My purposefully open-ended question would simply be, “What is good?” My hope is that finding the nature of what good is as its super goal would keep the AI on course to the future we want as it would pass through its recursive self-improvements.

3Gurkenglas3y
You have to tell the AI how to find out how well it has done. To ask "What is a good definition of 'good'?", you already have to define good. At least if we ever find a definition of good, we can ask an AI with it to judge it.

Submission for all types: ask for an ordered list of what questions you should ask the Oracle.

This seems like the highest order question which subsumes all others, as the Oracle is best positioned to know what information we will find useful (as it is the only being which knows what it knows). Any other question assumes we (the question creators) know more than the Oracle.

Refined Submission for all types: If value alignment is a concern, ask for an ordered list of what questions you should ask the Oracle to maximize for weighted value list X.

3romeostevensit3y
An assumed hostile process can 1) cause you to directly do something to its benefit or to your detriment 2) cause you to do something that increases your future attack surface. You've just handed the AI the state-fulness that the episodic conjecture aims to eliminate.
2Stuart_Armstrong3y
For the low bandwidth Oracle, you need to give it the options. In the case of the counterfactual Oracle, if you don't see the list, how do you reward it?

Several interesting questions appeared in my mind immediately as I saw the post's title, so I put them here but may be will add more formatting later:

Submission: very-low-bandwidth oracle: Is it theoretically possible to solve AI safety – that is, to create safe superintelligent AI? Yes or no?

Submission: low-bandwidth oracle: Could humans solve AI safety before AI and with what probability?

Submission: low-bandwidth oracle: Which direction to work on AI Safety is the best?

Submission: low-bandwidth oracle: Which direction to work on AI Safety is the use...

7Stuart_Armstrong3y
None of these questions can be asked to the low bandwidth Oracle (you need a list of answers); it might be possible to ask them to the counterfactual Oracle, after some modification, but they would be highly dangerous if you allow unrestricted outputs.
2Stuart_Armstrong3y
See the edit, and make sure you "decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point."