in my recent rough sketch for aligned AI, i mention that my solution doesn't look like it needs to solve embedded agency or decision theory to work. how could this be?

first, i think it's important to distinguish two different types of AI:

  • one-shot AI: an AI program that runs once, outputs one action, and then stops.
  • continuous AI: an AI program which takes actions and makes observations over time, typically while learning things about the world (including itself) as it goes.

i typically focus on building a one-shot aligned AI, for the following two reasons.

first, note that one-shot AI is actually complete. that is to say, it can produce the same behavior as continuous AI: it simply has to make its action be "here's a bunch of code for a continuous AI; run it." it is true that it might take more work to get an AI that is smart enough to know to do this, rather than an AI that merely updates its beliefs over time, but i think it might not be that hard to give an AI priors which will point it to the more specific action-set of "design an AI to run". or at least, sufficiently not-hard that we'd like being only that problem away from alignment.

second, one-shot AI is much simpler. this lets you do something like asking "hey, what would be an AI which, when placed in a world, would maximize this formal goal we'd like in that world?" and then our one-shot AI, even if it has no notion that it exists in a world, will realize that because it's outputting a program which will itself later be ran in a world and subject to its physical laws, it must solve embedded agency. in a sense, we have delegated embedded agency to the one-shot AI — and that does seem easier to me, because we can ask our one-shot AI to consider the utility of the world "from the top level". the question we'd ask our one-shot AI would be something like:

where our one-shot AI, given that its output will be a string of bits, is asked what output to give to worlds-except-for- such that the resulting non-halting-computation () will be preferred by our aligned utility function (all of this weighed, as usual, by which is the simplicity of each world ).

(our one-shot AI must still be either inner-aligned; and even if it's inner-aligned, it might still need to be boxed with regards to everything other than that output, so it doesn't for example hack its way out while it's improving itself and tiles the universe with compute dedicated to better answering this question we've asked it. if it is inner-aligned and we asked it the right question, however, then running its output should be safe, i think.)

does this also let us delegate decision theory, in a way that gets us the real decision theory we want and not some instrumentally convergent proxy decision theory? i'm not as sure about this, but i think it depends not just on our one-shot AI, but also on properties of the question being asked including the presumably aligned utility function. for example, if we use the QACI device, then we just need the counterfactual user-intervals being ran to decide that their preferred actions must be taken under their preferred decision theory.

this brings me to one-shot QACI: at the moment, i believe QACI is best designed as a device to bootstrap aligned AI, rather than as the device that aligned AI should use to make every discrete decision. for this purpose, it might be good to use something like one-shot QACI in our one-shot AI: a single, giant (hypothetical) graph of counterfactual user-intervals deciding on what a better design for an aligned utility function or decision process or aligned AI would be, which our one-shot AI would execute.

this isn't necessarily to say that our one-shot AI would figure out the exact answer to that QACI graph; but the answer to the QACI graph would be deterministic — like in insulated goal-programs, except without everyone being killed for sure. maybe the one-shot AI would decide that the best way to figure out the answer to that QACI is to put in the world an AI which would acquire compute and use that to figure it out, but the point is that it would be figuring out the exact same question, with the same theoretical exact answer.

New to LessWrong?

New Comment