edit: for a more easily understood explanation, see also a narrative explanation of QACI; for context, see formal alignment and clarifying formal alignment implementation. for some math ideas relating to this plan, see a rough sketch of formal aligned AI using QACI and blob location. see also the state of my alignment research.
PreDCA (not a dependency/component, but an inspiration for this) attempts to build a framework in which the AI tries to determine its predecessor's utility function using a bunch of math, to figure out who the user is and what their utility function is. it seems hard to predict whether the math would accurately capture a subset of the user's mind whose utility function we'd like, so in this post i offer an alternative which i feel has a higher chance of being useful.
just like PreDCA, the question-answer counterfactual intervals (QACI) proposal utilizes the ability for an AI to ponder counterfactuals of what can happen in the world, possibly using infra-bayesian physicalism. it proceeds as follows:
we can make the AI's goal dependent on "what answer would i have gotten if i'd sent a different question?" — we'll call "queries" such instances of counterfactually considering the question-answer interval. this doesn't immediately solve alignment, but just like goal-program bricks, it's a device that we can use to build more complex decision processes which would serve to guide the AI's actions.
note that the AI might run a relatively high detail simulation of what the user would answer, or it could just make rough guesses; if it's properly designed, it should allocate its computational resources to guessing what the user would answer to whatever degree of detail it needs. nevertheless, its choice would be guided by a fundamental intent to satisfy its eventual goal, so it shouldn't manipulate the simulation to give answers that would make its job easier — it would ultimately strive for however much accuracy it thinks it can afford.
and just like in PreDCA, because we make the AI point to a user who preceeds its existence, it can't (barring weird acausal stuff) hack the user to affect what it'd predict the user to say; the user's values are locked in, which is desirable anyways.
here are some ideas as to how to use such queries to hopefully guide the AI's actions towards good worlds:
these ideas don't involve the AI interpreting natural language — the utility function could be written in trivially parseable python math code, decisions or requests for running multiple new copies could be asked for using magic strings followed by formal code, and so on. for example, in the case of a sequence of queries, the AI is told to predict what happens when the text file is just passed verbatim from one query to the next, until a particular magic string is detected verbatim at the start of a query.
notice that, because there is no inherent limit on the text file's size, it can start with a #!/bin/bash shebang and be a script that builds a large piece of software that each query is able to develop and use to more efficiently transmit knowledge to the next query, for only very minimal overhead to each of those queries.
finally, this proposal should be not too difficult to expand upon:
note that with a proper implementation of embedded agency, such an AI would care about its own internal computations just as much as what happens in the rest of the world; so that if this scheme indeed leads to an aligned AI, then that alignment would cover taking care of risks caused by running simulations of queries in such a high level of detail that its inhabitant(s) are moral patients would. in fact, thanks to embedded agency, perhaps the whole "let sequences of queries decide how to make decisions" could apply naturally to how queries are used to make decisions.