a rough sketch of formal aligned AI using QACI

posted on 2022-12-11 — also cross-posted on lesswrong, see there for comments

in this post, i put forth some of my current thoughts about the shape of a formal aligned AI using QACI for its decision — "decision" in the singular here, as this is sufficient when the AI's decision can be "run me again but with these diffirent inputs". as it turns out, this doesn't require solving as many things as i'd thought — it seems like QACI might be general enough to delegate picking a decision theory and solving embedded agency to the counterfactual consideration of the past user.

we'll posit:

as a convention, we'll use a prime $'$ to denote counterfactual values, and we'll be denoting questions $q$ , and answers $r$ (for "response") to avoid confusion with $a$ for actions.
the AI is denoted $G : ℕ \times ℕ \times ℕ \to A$ , taking as input an observation as well as the user's original question and answer, denoted $q$ and $r$ . it returns an action from the set of all possible actions, $A$ .
$K ⁻ : ℕ \to [0; 1]$ is gonna be a simplicity measure based on kolmogorov complexity, where $\sum_{x \in ℕ} K ⁻ (x) = 1$ .
all sets of functions $A \to B$ will be countable sets of computable functions. in particular, $W ≔ () \to ⊥$ will be the set of computable hypotheses for worlds, represented as non-halting programs taking no input with $()$ and $⊥$ being respectively the unit and bottom types.
finally, we'll implicitely "cast" mathematical objects as natural numbers wherever appropriate, given that they're all sampled from countable sets anyways. when they're cast to or from natural numbers, assume a reasonable bijection between their type and $ℕ$ .

we'll define the following:

a "carver" function $C : W \times ℕ \to 2^{(ℕ \times ℕ \to W) \times ℕ \times (W \to ℕ)}$ which returns a set of tuples of:

a function $t_{x}$ for extracting a piece of data "in the same way" as $x$ is in $w$ but from any other world
a piece of data $t_{w \ x}$ that represents "everything else" than $x$ in the world
a function $t_{w}$ for counterfactually putting another piece of data $x'$ back in $w$ , alongside $t_{w \ x}$

this is done by splitting a world $w$ into the piece of data $x$ , and "everything else", denoted $t_{w \ x}$ . in practice with arbitrary other worlds, $t_{x}$ would return "garbage" the immense majority of the time, but the hope is that given a same carving $t_{x}^{q}, t_{w}^{q}$ for the question $q$ , a same carving $t_{q}^{r}, t_{w}^{r}$ for the answer would work enough time to give a signal that would tend to beat the overall noise of the failing cases.

$\begin{matrix} C (w, x) ≔ { & (t_{w}, t_{w \ x}, t_{x}) | \\ t_{w} \in ℕ \times ℕ \to W, t_{w \ x} \in ℕ, t_{x} \in W \to ℕ, \\ \forall w' \in ℕ : t_{x} (w') = x \Leftrightarrow w' = w, \\ \forall x' \in ℕ : t_{w} (t_{w \ x}, x') = w \Leftrightarrow x' = x} \end{matrix}$

we'll define $QACI : W \times ℕ \times ℕ \times ℕ \to 2^{ℕ \times ℝ}$ , the question-answer counterfactual interval device used to consider answers to counterfactual questions $q'$ , given a world hypothesis $w$ and a known question $q$ and answer $r$ :

$QACI (w, q, r, q') ≔ {(t_{x}^{r} (t_{w}^{q} (t_{w \ x}^{q}, q')), K ⁻ ((t_{w}^{q}, t_{x}^{q}, t_{w}^{r}, t_{x}^{r}))) | (t_{w}^{q}, t_{w \ x}^{q}, t_{x}^{q}) \in C (w, q), (t_{w}^{r}, t_{w \ x}^{r}, t_{x}^{r}) \in C (w, r)}$

note how $K ⁻$ measures the simplicity of all four $t$ functions together so as to favor them being simple but also similar, but ignores the simplicity of the "rest of the world" $t_{w \ x}$ values..

finally, we can define our AI $G : ℕ \times ℕ \times ℕ \to A$ as a function of $q, r$ but also an observation $o$ which could, in practice, be anything that lets the AI as well as the user better locate themselves in the set of possible worlds.

$G (q, r, o) ≔ \underset{a \in A}{argmax} \sum_{\begin{matrix} w \in W \\ (r', α) \in QACI (w, q, r, (w, a)) \end{matrix}} K ⁻ (w) \cdot α \cdot max (0, min (1, r' (q, r, o)))$

what's going on here, is for each world hypothesis $w$ weighed by its simplicity $K ⁻ (w)$ we're finding carvings for $q$ and $r$ that let us inject a counterfactual question and get a counterfactual answer, favoring simpler carvings using $α$ . the counterfactual rerun of the past user is given the world hypothesis $w$ being considered, as well as the action $a$ — we're not giving the user $q, r, o$ because that would make the counterfactual question guaranteed to be strictly larger than the original question, which seems like it might break things, and is not absolutely necessary. the user gets to return an arbitrary function to be evaluated, to which $q, r, o$ are actually passed. we are, however, clamping the result of $r'$ inside $[0; 1]$ and then multiplying it by $K ⁻ (w) \cdot α$ ourselves, so as to avoid erroneous carvings getting to influence the decision process too much by returning arbitrary large values. the hope here is merely that desired carvings generally outweigh undesired carvings, and especially adverserial carvings.

ways this could go wrong include:

the carvings could fail to sufficiently select the actual instances of the past-user in the world; in particular, $t_{x}^{r}$ could be returning so much "garbage" when given counterfactual worlds $t_{w}^{q} (t_{w \ x}^{q}, q')$ that are different from $w$ that the signal from the carvings that do work end ups completely drowned out.
the carvings locate question-answering users anywhere in the world, including in the future. this allows for adverserial intelligences to occupy most of the set of simple and coherently-answering carvings, thus steering our AI away from following the actual user's decisions. a solution to this would be to sufficiently strongly favor carvings that select question-answering processes that are causally upstream of the AI itself; that is to say, causally upstream of $q, r, o$ .
weird implementation details as for how turing machines are encoded could dominate most of the signal of values returned by $r'$ . in addition, the user could be unsure as to how to return a meaningful scalar given a particular potential action. these points could both be partially addressed by passing to the carvings pairs of actions $a_{1}, a_{2}$ for all possible pairs of different actions from $A$ , and have the counterfactual user select a preferred action, rather than relying on a scalar returned about a single action. if $r' (\dots, a_{1}, a_{2})$ and $r' (\dots, a_{2}, a_{1})$ agree as to which of $a_{1}$ and $a_{2}$ is the preferable action, then that's an actual signal that it is preferable. adding more "symmetries" might make noise cancel itself out.
i don't think we get to require $C$ to return carvings that work for arbitrarily large counterfactual payloads, nor do we need to. in practice, i expect a constant finite (though large) bit length is to be used for $x$ .
there's a lot of sets-of-all-programs being sampled from here, leaving a lot of room for demons in the solomonoff prior if those are a concern. typically i tend to favor decision theoritic solutions to those, and maybe a correct $QACI$ implementation would return action-functions $r'$ which would depend on whatever decision theory is correct, such that this can be delegated? but it feels like this system has ways to go wrong before that, in what programs get to control most of the "mass" returned by $QACI$ to begin with.

this is of course highly uncomputable. the intent here, is to use something like logical induction to approximate good results to this function. what makes me hopeful that a powerful AI can make helpful guesses as to what actions this process would find, if it is indeed aligned, is that even i, a mere human mind, feel like i can make some helpful guesses as to what actions this process would find.

posted on 2022-12-11 — also cross-posted on lesswrong, see there for comments