QACI: question-answer counterfactual intervals

posted on 2022-10-24 — also cross-posted on lesswrong, see there for comments

edit: see also the QACI table of contents.

PreDCA (not a dependency/component, but an inspiration for this) attempts to build a framework in which the AI tries to determine its predecessor's utility function using a bunch of math, to figure out who the user is and what their utility function is. it seems hard to predict whether the math would accurately capture a subset of the user's mind whose utility function we'd like, so in this post i offer an alternative which i feel has a higher chance of being useful.

just like PreDCA, the question-answer counterfactual intervals (QACI) proposal utilizes the ability for an AI to ponder counterfactuals of what can happen in the world, possibly using infra-bayesian physicalism. it proceeds as follows:

have the AI's user stand in front of a computer
the AI is hardcoded to first generate a large random text file, and send it to the user's computer — we call this file the "question"
the user opens the text file, ponders what it says for a day, and then at the end of the day sends a text file with its answer back to the AI
the AI, which was hardcoded to do literally nothing until it got that answer, starts running the rest of its code which would consist of an inner-aligned system following a particular goal

we can make the AI's goal dependent on "what answer would i have gotten if i'd sent a different question?" — we'll call "queries" such instances of counterfactually considering the question-answer interval. this doesn't immediately solve alignment, but just like goal-program bricks, it's a device that we can use to build more complex decision processes which would serve to guide the AI's actions.

note that the AI might run a relatively high detail simulation of what the user would answer, or it could just make rough guesses; if it's properly designed, it should allocate its computational resources to guessing what the user would answer to whatever degree of detail it needs. nevertheless, its choice would be guided by a fundamental intent to satisfy its eventual goal, so it shouldn't manipulate the simulation to give answers that would make its job easier — it would ultimately strive for however much accuracy it thinks it can afford.

and just like in PreDCA, because we make the AI point to a user who preceeds its existence, it can't (barring weird acausal stuff) hack the user to affect what it'd predict the user to say; the user's values are locked in, which is desirable anyways.

here are some ideas as to how to use such queries to hopefully guide the AI's actions towards good worlds:

we tell the AI to maximize the utility function that a sequence of queries would end at, where the first one is asked "what's a utility function that represents human values?" and each next query is asked to improve on the answer of the query, until one of them sends an answer starting with the magic string "okay AI, i'm done. the answer is:" followed by a piece of formal math which points to human values, such as a highly refined version of PreDCA.
we tell the AI to, each time it's making a decision, have such a sequence determine which outcome it would prefer given observations by the AI so far — in a sense, extrapolating the volition of the user if they had a lot more time to ponder each decision.
something like the above except it's the AI's own model which determines consequences, and sequences of queries are ran on that model, to figure out what it entails and then which action is preferable
any of the suggestions above except sequences of queries are replaced with DAGs of queries, each able to say what kind of query graph they'd like to be ran — such as "i'd like fifty query sequences to ponder this question but with the following fifty different thought prompts, and then for a single query sequences to get all of their results and figure out the result

these ideas don't involve the AI interpreting natural language — the utility function could be written in trivially parseable python math code, decisions or requests for running multiple new copies could be asked for using magic strings followed by formal code, and so on. for example, in the case of a sequence of queries, the AI is told to predict what happens when the text file is just passed verbatim from one query to the next, until a particular magic string is detected verbatim at the start of a query.

notice that, because there is no inherent limit on the text file's size, it can start with a #!/bin/bash shebang and be a script that builds a large piece of software that each query is able to develop and use to more efficiently transmit knowledge to the next query, for only very minimal overhead to each of those queries.

finally, this proposal should be not too difficult to expand upon:

start with longer question-response intervals
start with a large number of question-response intervals which are picked from at random, to select a wider range of "views" of the user
start with many question-response intervals sent to a bunch of different people who can work together on alignment
allow question-response intervals to communicate with one another, perhaps with as much as video chat, using a command that each user could theoretically send to the AI — but which either does nothing or isn't sent in the original, "real" question-response interval
give question-response intervals access to a supercalculator, which would have the AI run computationally costly programs and send the result — again, such a capability would not be usable in the original "real" instances of the user answering a question

note that with a proper implementation of embedded agency, such an AI would care about its own internal computations just as much as what happens in the rest of the world; so that if this scheme indeed leads to an aligned AI, then that alignment would cover taking care of risks caused by running simulations of queries in such a high level of detail that its inhabitant(s) are moral patients would. in fact, thanks to embedded agency, perhaps the whole "let sequences of queries decide how to make decisions" could apply naturally to how queries are used to make decisions.

posted on 2022-10-24 — also cross-posted on lesswrong, see there for comments