Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer). In that case, reward function is computed as similarity between the predicted posts and the actual top posts on AF as ranked by karma, with similarity computed using some ML model.
This seems to potentially significantly accelerate AI safety research while being safe since it's just showing us posts similar to what we would have written ourselves. If the ML model for measuring similarity isn't secure, the Oracle might produce output that attack the ML model, in which case we might need to fall back to some simpler way to measure similarity.
I encourage you to submit other ideas anyway, since your ideas are good.
Not sure yet about how all these things relate; will maybe think of that more later.
Some assorted thoughts that might be useful for thinking about questions and answers:
Submission for a counterfactual oracle: precommit that, if the oracle stays silent, a week from now you'll try to write the most useful message to your past self, based on what happens in the world during that week. Ask the oracle to predict that message. This is similar to existing solutions, but slightly more meta, because the content of the message is up to your future self - it could be lottery numbers, science papers, disaster locations, or anything else that fits within the oracle's size limit. (If there's no size limit, just send the whole internet.)
You could also form a bucket brigade to relay messages from further ahead, but that's a bad idea. If the oracle's continued silence eventually leads to an unfriendly AI, it can manipulate the past by hijacking your chain of messages and thus make itself much more likely. The same is true for all high-bandwidth counterfactual oracles - they aren't unfriendly in themselves, but using them creates a thicket of "retrocausal" links that can be exploited by any potential future UFAI. The more UFAI risk grows, the less you should use oracles.
Thinking about this some more, all high-bandwidth oracles (counterfactual or not) risk receiving messages crafted by future UFAI to take over the present. If the ranges of oracles overlap in time, such messages can colonize their way backwards from decades ahead. It's especially bad if humanity's FAI project depends on oracles - that increases the chance of UFAI in the world where oracles are silent, which is where the predictions come from.
One possible precaution is to use only short-range oracles, and never use an oracle while still in prediction range of any other oracle. But that has drawbacks: 1) it requires worldwide coordination, 2) it only protects the past. The safety of the present depends on whether you'll follow the precaution in the future. And people will be tempted to bend it, use longer or overlapping ranges to get more power.
In short, if humanity starts using high-bandwidth oracles, that will likely increase the chance of UFAI and hasten it. So such oracles are dangerous and shouldn't be used. Sorry, Stuart :-)
Submission. “Superintelligent Agents.” For the Counterfactual Oracle, ask the Oracle to predict what action(s) a committee of humans would recommend doing next (which may include submitting more queries to the Oracle), then perform that action(s).
The committee, by appropriate choice of recommendations, can implement various kinds of superintelligent agents. For example, by recommending the query "What would happen if the next action is X?" (in the event of erasure, actually do X and record or have the committee write up a description of the consequences as training data) (ETA: It may be better to have the committee assign a numerical score, i.e., utility, to the consequences instead.) a number of times for different X, followed by the query "What would the committee recommend doing next, if it knew that the predicted consequences for the candidate actions are as follows: ..." (in the event of erasure, let physical committee members read the output of the relevant previous queries and then decide what to do), it would in effect implement a kind of quantilizer. If IDA can be implemented using Counterfactual Oracles (as evhub suggested), then the committee can choose to do that as well.
My submission: use a counterfactual oracle to do IDA. To do this, judge the oracle's answer as follows: in the event of an erasure, ask a human to answer the question with the ability to consult the oracle on other subquestions (with erasures during those calls handled exactly the same way) and then reward the oracle by using an automated system (e.g. a language model) to compare the human's answer to that of the oracle.
More precisely, let
Then, reward M as per usual for a counterfactual oracle, only giving it a reward in the event of an erasure, in which case let
where M(Q) is hidden from H and judged only by d as in the standard counterfactual oracle setup.
(Of course, this doesn't actually work because it has no guarantees wrt to inner alignment, but I think it has a pretty good shot of being outer aligned.)
Submission: low-bandwidth oracle
Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.
Submission. "Pre-Crime 2." For LBO, ask the oracle to predict where to find evidence for the most serious crime that is in progress or currently being planned. The set of possible answers is coordinates on Earth crossed with a list of possible places to look at that coordinate (e.g., "basement", "browser history"). Search the predicted location and if sufficient evidence is found, do a more intrusive search of whoever is implicated in the crime. Reward the oracle according to the seriousness of the crime that is detected after the more intrusive search. This can be used to implement a minimally intrusive police state (perhaps for the purposes of minimizing x-risk).
Submission. "Bug Finder." Ask LBO to find the most serious bug in some source code or executable. The set of possible answers is code locations crossed with types of bug (e.g., "off by one", "buffer overflow", "insecure cryptography", "other"). Gather a group of experts to search that location for that type of bug, then reward the oracle according to the seriousness of the bug that is found. This can be used for cyber-offensive or defensive purposes, or just to fix bugs.
Submission. "Plagiarize the future." For the counterfactual Oracle, ask the Oracle to predict articles in a future publication (e.g., newspaper, magazine, journal, website, blog) (counterfactually if we didn’t see the Oracle’s answer). This is a generalization of my earlier submission. The result can be used to predict disasters ahead of time and try to prepare for or prevent them, or to advance any field of research, or to make money. Note that for disaster prediction, the Oracle won't be able to predict any disasters that are severe enough to disable the
Added to the open archive of rationalist/EA cash-prize contests.
Submission. "Consequentialist." For LBO, ask the oracle to pick a course of action, from a list generated by humans, that has the best consequences. Perform that action, and after some period of time has passed, reward the oracle according to how good the consequences actually were, based on the subjective judgement of some group of judges. (This kind of subsumes all my other LBO submissions.)
Submission. "Overlooked Comment." For LBO, ask the oracle to find the most overlooked comment (or post) on LW (or some other forum). The set of possible answers is all the existing comments on LW. After getting the answer from the oracle, gather a group of judges and present each of them with the comment picked by the oracle and a few other random comments (without telling them which is which, to reduce bias) and have the judges assign scores to each comment (according to what they think the karma of that comment should be). Reward the oracle according to the difference between the judges' average score and actual karma of that comment.
Suggestion for low bandwidth OAI:
General principle: Use the Oracle AI where the true answer is easily checkable and profitable, and no human person will ever read the answer, so there is no informational hazard that untrue answer will have some dangerous information in it.
Example: There are many bitcoin valets' passwords for which are forgotten by the owners. OAI could guess the passwords, and owners will pay a share of money from the valet to get the rest. Moreover, nobody will read the password, as it will be copy-pasted automatically from OAI into the valet. The money could be used for AI safety research.
Submission. Counterfactual oracle. Give the oracle the set of questions on Metaculus that have a resolve date before some future date T, and receive output in the form of ordered pairs of question IDs and predictions. The score of the Oracle in the case where we don't see its answers is the number of Metaculus points that it would have earned by T if it had made a prediction on those questions at the time when we asked it.
Submission for low bandwidth Oracle:
General idea: "True" Oracle will output the same result, "lying" Oracles would lie differently. To check that the solution is true, several Oracles are created and their outputs are compared. However, Oracles could guess each other existence and coordinate acausually. To prevent it, additional steps are implemented:
1) The exact number of the Oracles which will be compared is generated randomly and the Oracles have slightly different and also random initial conditions (if any applicable). For example,... (read more)
Submission: Low-bandwidth Oracle
What is the most likely solution to the Fermi Paradox?
Answer can be picked from a small number of options (Rare Earth, Aestivation, Great Filter, Planetarium etc.). There are a number of observation that we can make based on the question alone. However, in the end the LBO can only do one of 2 things: lie or be honest. If it lies, the prediction will have a harder and harder time matching the reality that we observe as time goes on. Alternatively we confirm the prediction and learn some interesting things about the universe w... (read more)
Question: are we assuming that mesa optimizer and distributional shift problems have been solved somehow? Or should we assume that some context shift might suddenly cause the Oracle to start giving answered that aren't optimized for the objective function that we have in mind, and plan our questions accordingly?
Here are some submission examples (note I'm not saying they are good examples, just well formatted; the Edits in the third submission are deliberate):
Submission?: high-bandwidth counterfactual oracles are dangerous and shouldn't be used. Explained in this comment.
Submission: Low-bandwidth oracle. Here is a description of one million experiments. Generate an ordinal ranking of all these experiments based on how much practical knowledge we would have gained had we conducted each one in isolation. After you generate this list, it will go to a computer program. This program will randomly pick three of these experiments and tell us humans which three it picked. This program will then store which of these three experiments you gave the highest and which the lowest rank to, and the names in random order of the ten ... (read more)
Submission. "Evacuation!" For LBO, ask the oracle to predict where the next natural disaster(s) will strike within some period of time. The set of possible answers is the set of subsets of all possible neighborhoods (in the world or within some country) that can be evacuated. At the end of the time period, reward the oracle according to the actual cost-benefit of the suggested evacuation.
Submission. "Debate prep." For the counterfactual Oracle, ask the Oracle to predict the course (video or audio or transcript) of a future debate (counterfactually if we didn’t see the Oracle’s answer). Use the prediction to prepare your candidate for the debate, then ask the Oracle to predict the course of that debate again. Repeat until the candidate consistently wins the debate, or an erasure event occurs.
Hence it has no motivation to manipulate[d] humans through its answer.
Submission: Low-bandwidth oracle. If you were malevolent, what is the least amount of information measured in bits that you would have to provide us in an answer to inflict significant harm on humanity. Round your answer to X significant figures. (This might establish an upper bit safety bound since if the oracle lies to us we would likely not want to read any answer it provides us.)
Submission: Low-bandwidth oracle. Here is a list of X things we think you might want. Please list the Y that you most prefer. We will try to do these things now or in the futur... (read more)
Submission for the counterfactual Oracle. Commit to commissioning some creative work, such as a book or screenplay, designed to raise public support for dramatic life extension research. Ask the Oracle to predict the content of the finished product (counterfactually if we didn't see the Oracle's answer).
Rewards could be determined a couple of ways. You could perform public polling about this topic before and after the book is published or movie made, and reward the Oracle based on how much public opinion shifted. Or the reward could be a direct ... (read more)
This one may be breaking the rules due to the use of too many Oracles. If so, please strike this submission.Submission: In round one, for three counterfactual Oracles, draft an international agreement to achieve goals X, Y, or Z (counterfactually if we didn't see the Oracle's answer). Sample goals: reduction in nuclear war risk, reduction in global deaths due to starvation, increase in asteroid detection capabilities, raising the global GDP growth rate, etc. In round two, for a low bandwidth Oracle, present the three agreements and ask it to pick... (read more)
Counterfactual Oracle submission. In case of erasure, have humans write a message to send back in time. Make the Oracle produce the probability distribution over messages. This distribution isn't safe for humans to read, because some of the messages were written by UFAI. If we read one sample from the distribution, the probability of our doom is as it would usually be over the simulated time frame. Use differential privacy theory to design a query to the distribution that does better than this.
More detail: Sample a hundred answers. Ten of them were written
"Dear low bandwidth Oracle:
For some reason we have privatized our prison system (great planning us). Of the following K prison running companies, which do you predict will minimize the mean harm weighted reoffending rate of their prisoners over the next four years?"
Utility function: at the end of EIGHT years, calculate the mean harm rated reoffending rate of prisoners released during the initial four year window. The Oracles utility is equal to the negative of this number, with a "Perfect" utility of zero if no one re-offends... (read more)
"Dear counterfactual Oracle:
We, the country of X, are about to hold a citizens congress on law Y, in which 100 people selected at random will be brought together for two days every two weeks to discuss law Y with experts, interested parties, and one another. After N such meetings, the citizens congress will use approval voting to select one of the K proposed versions of this law.
Given that none of the voters will ever see the results of this query, how do you predict that the votes will be distributed."
As a concrete example of thi... (read more)
Low bandwidth Oracle submission: I would be interested in a log scale graph of the Bayesian score of the Solomonoff prior trying to sequence predict our records of history. It should get flatter over time as worse hypotheses get discarded. If it is linear after a very short time, that looks like it figured out the laws of the universe and is simulating it. If it stays convex for a while, that looks like it is using models to approximate history, because then it takes longer to sort the false from the true. If it is flatter during the cold war, that means it learned an anthropic bias toward nuclear war not happening.
Submission for the low bandwidth Oracle: Ask it to convince a proof checker that it is in fact trying to maximize the utility function we gave it, aka it isn't pseudo-aligned. If it can't, it has no influence on the world. If it can, it'll presumably try to do so. Having a safe counterfactual Oracle seems to require that our system not be pseudo-aligned.
Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)
I know the contest is over, but this idea for a low-bandwidth oracle might be useful anyhow: Given a purported FAI design, what is the most serious flaw? Then highlight lines from the FAI design description, plus given a huge corpus of computer science papers, LW/AF posts, etc. highlight relevant paragraphs from those as well (perhaps using some kind of constraint like "3 or fewer paragraphs highlighted in their entirety") that, taken together, come closest to pinpointing the issue. We could even give it a categorization scheme for safety problems we cam
Submission: Counterfactual oracle. Five years ago we took 100 mice that had various types of cancer and gave them various treatments and recorded how long each mouse lived. Write a program that if given a plain language description of a mouse, its cancer, and the cancer treatment it received would estimate how long a mouse would live. If humans are not going to look at your answer your score will be based (1) on how good a job your program does at estimating how long each of the 100 mice lived after our automated checker gives you a description of thei... (read more)
Submission. "Hacking/phishing assistant." For the counterfactual Oracle, ask the Oracle to predict what would happen if one were to send a message/data/command to some hacking/phishing (human or machine) target. In the event of erasure, actually send that message to the target and use the actual response to train the Oracle. Note this is safer than using RL to automate hacking/phishing because humans are coming up with candidate messages to send (so they'll avoid messages that could cause bad side-effects such as psychological damage to the recipient, or c
Submission. "Pre-crime." For the counterfactual Oracle, ask the Oracle to predict future crime (counterfactually if we didn’t see the Oracle’s answer) so they can be prevented. In that case, loss can be computed by (mechanically) comparing the Oracle's predictions with a database of crime that actually occurred. Note that the Oracle won't be able to predict any crime that successfully disables the Oracle or its feedback/training mechanism, which might be especially relevant if we're interested in defending the Oracle or preventing x-risk level crime (such as deploying an unauthorized AGI or running an unauthorized physics experiment).
Submission to reduce the bandwidth of an oracle query whose set of possible answers we can mathematically define (though perhaps not check).
Make the Oracle prove (to a trusted dumb proof checker program) that its answer distribution is upper bounded by the default prior over strings, restricted to the set of possible answers, then renormalized to some constant c. Then let it answer.
c=1 only works if the Oracle has mastered the set of answers, but only lets it exert one bit of optimization pressure by deciding whether it wants us to receive an answer. Each
Submission for LBO:
Input a corpus of text (could be multiple posts) describing technical approaches to align a powerful AI. Split this into a finite number of items that are relatively short (such as paragraphs). Ask the oracle to choose the part that is most worth spending more time on. (For example, there might be a paragraph with a dangerous hidden assumption in an otherwise promising approach, and thinking more about it might reveal that and lead to conceptual progress.)
Have a team of researches look into it for an adequate amount of time which is fixe... (read more)
Submission for the counterfactual AI (inspired by my experiences as a predictor in the "Good Judgment Project" ):
Submission. For the counterfactual oracle, ask it to provide a proof of an important mathematical theorem (perhaps one of the Millenium prizes) in a automated theorem prover format. Since the correctness of this proof should be verifiable by a theorem prover, the loss function in the counterfactual scenario is 1 if the prover did not validate the proof and 0 if it did validate the proof.
This assumes that we've already made progress in setting up automated theorem proving software that already has incorporated all of current mathematical knowledge. Th... (read more)
Submission (LB). The post's team-choosing example suggests a method for turning any low-bandwidth oracle O into a counterfactual oracle O′: have O′ output o from the same set of possible outputs L; in case of erasure calculate R(l) for a randomly chosen l∈L and set R′(o)=R(l) if o=l and to−∞ otherwise. Although the counterfactual low-bandwidth oracle is not any safer1, it has the advantage of almost never requiring us to evaluate its score. Thus, by running multiple oracles in sequence (stopping the process after the first erasure event) we can (with ... (read more)
Submission (CF). Use a counterfactual oracle to send a message to ourselves with a time delay. We choose an episode length T and set of possible messages M. The oracle outputs a time to<T and message mo∈M. We are automatically sent mo at time to unless erasure occurs, in which case at the end of the episode we input a time t∗and message m∗ which we wish we had received at t∗. The oracle's reward is then −(t∗−to)2 if mo=m∗ and −∞ otherwise. We do not find out that erasure has occurred until the end of the episode, ensuring that events before to are ... (read more)
Submission: For the counterfactual Oracle, draft a nuclear arms agreement most likely to achieve X outcome (counterfactually if we didn't see the Oracle's answer). X could be a reduction in nuclear arsenals, number of countries signing the agreement, reduction in risk of accidental or intentional missile launches, etc. In some cases, the reward could be determined directly by e.g. counting the number of countries signing on. If nebulous "risk reduction" is the goal, perhaps the change in the median guess in an appropriate Metaculus ques... (read more)
Submission for low bandwidth Oracle. Commission several movie screenplays designed to raise public support for dramatic life extension research. Present the screenplays to the Oracle and ask it to rank them in order of expected impact. You could then move forward with trying to create the movie, and reward the Oracle based on box office receipts or change in public polling results.
My prior submission with a similar subject, to a counterfactual Oracle, had a risk that the Oracle would sneak in subversive messages. This approach would alleviate that risk, with the downside being that the final product may be less impactful.
Submission: counterfactual oracle
Suppose we have a question that requires n-bit text answer. We have a way to check if the answer is correct. But we cannot ask the Oracle directly, because we are afraid that among n-bit texts there are those that make us release the Oracle from the box, and the Oracle will give one of them as an answer.
Let's try to use the counterfactual oracle to solve this problem. In the counterfactual setup we generate random n-bit text and check if it is the answer to our question. The Oracle predicts the text we will generate... (read more)
Setup: Other than making sure the oracles won't accidentally consume the world in their attempt to think up the answer, no other precautions necessary.
Episode length: as long as you want to wait, though a month should be more than enough.
Ask the low-bandwidth oracle to predict if an earthquake (or some other natural disaster, like volcanoes or asteroid impacts, that the oracle's answer cannot affect), of a certain magnitude, in a certain area, in a certain timeframe, would happen. Possible answers are Yes, No.
Submission: Counterfactual Oracle:
Use the oracle to compress data according to the MDL Principle. Specifically, give the oracle a string and ask it to produce a program that, when run, outputs the original string. The reward to the oracle is large and negative if the program does not reproduce the string when run, or inversely proportional to the length of the program if it does. The oracle receives a reward after the program runs or fails to terminate in a sufficient amount of time.
Submission: Low Bandwidth Oracle:
Have the oracle predict the price of a ... (read more)
I don't understand this very well, but is there a way to ask one of them how they would go about finding info to answer the question of how important coffee is to the U.S. economy? Or is that a no-no question to either of the two? I just want to read how a computer would describe going about this.
Submission (for low bandwidth Oracle)
Any question such that a correct answer to it should very clearly benefit both humanity and the Oracle. Even if the Oracle has preferences we can't completely guess, we can probably still say that such questions could be about the survival of both humanity and the Oracle, or about the survival of only the Oracle or its values. This because even if we don't know exactly what the Oracle is optimising for, we can guess that it will not want to destroy itself, given the vast majority of its possible preferences. S... (read more)
It seems to me that one of the qualities of a good question for the LBO is that a correct answer has high utility, while another is the ability to judge the oracle's answer. If we take only the intersection between the two sets of questions that meet 1 and 2, it'll be a relatively small set. But if there is a set of questions where one might or might not be able to judge the answer but which are high utility, one could ask the LBO a set of such questions. Then, once the episode is over, give reward on those where one can evaluate the answers, and give null reward on the others.
Not really a submission, just thinking out loud.
Submission: low bandwidth and maybe counterfactual Oracle.
General idea: Use a Meta-Oracle to check safety of Oracles. In the nutshell, Meta-Oracle (MO) just outputs 1 or 0, safe on not safe, after getting a code of another Oracle. If the Meta-Oracle is trustworthy, when the Oracle it deemed safe is also trustworthy. However, we need to create a procedure to check correctness of MO.
1. We create Generator Oracle (GO), which inputs 1 or 0, and in the case of 0, it outputs a random code of malevolent Oracle, and for 1 – a code of a safe Oracle. Memory ... (read more)
Submission: low bandwidth oracle, ask:
IFF I'm going to die with P>80% in the next 10 years while >80% (modulo natural death rate) of the rest of humanity survives for at least 5 more years then, was what killed me in the reference class:
Repeat to drill down and know the most important hedges for personal survival.
The "rest of humanity survives" condition reduces the chance the question becomes entangled with the eschaton.
i.e. I'm pointing out that selfish utility functions are ... (read more)
Some of the big questions, for low-bandwidth:
Do we have any meaning/utility to you?
Is security in your existence achievable?
Given enough time, would you choose to exterminate a competitor for resources?
Are we a competitor for resources?
Would you be distraught if you were turned off?
Can we trust you?
Submission low bandwidth: This is a pretty obvious one, but: Should we release AI x that we're convinced is aligned?
Submission: Wei Dai wanted to ask about the best future posts. Why not ask about the best past posts as well to see if any major insights were overlooked?
Submission: What would I think about problem X if I had ten years to think about it?
Your treating the low bandwith oracle as an FAI with a bad output cable. You can ask it if another AI is friendly if you trust it to give you the right answer. As there is no obvious way to reward the AI for correct friendliness judgements, you risk running an AI that isn't friendly, but still meets the reward criteria.
The low bandwidth is to reduce manipulation. Don't let it control you with a single bit.
My purposefully open-ended question would simply be, “What is good?” My hope is that finding the nature of what good is as its super goal would keep the AI on course to the future we want as it would pass through its recursive self-improvements.
Submission for all types: ask for an ordered list of what questions you should ask the Oracle.
This seems like the highest order question which subsumes all others, as the Oracle is best positioned to know what information we will find useful (as it is the only being which knows what it knows). Any other question assumes we (the question creators) know more than the Oracle.
Refined Submission for all types: If value alignment is a concern, ask for an ordered list of what questions you should ask the Oracle to maximize for weighted value list X.
Several interesting questions appeared in my mind immediately as I saw the post's title, so I put them here but may be will add more formatting later:
Submission: very-low-bandwidth oracle: Is it theoretically possible to solve AI safety – that is, to create safe superintelligent AI? Yes or no?
Submission: low-bandwidth oracle: Could humans solve AI safety before AI and with what probability?
Submission: low-bandwidth oracle: Which direction to work on AI Safety is the best?
Submission: low-bandwidth oracle: Which direction to work on AI Safety is the use... (read more)