Epistemic status: Likely wrong, but I don't know how. I'm not a 'real' FAI researcher, just an LW reader inspired by the Alignment Prize to try to start contributing.

(At minimum I'm heavily indebted to Stuart Armstrong for inspiration, and there's a fair chance I'm rediscovering something he's already proposed)

tl;dr AIXI's old anvil problem, which is a serious problem for an agent AI, could actually be a solution for preventing an Oracle AI from escaping its box.

The core idea is for the AI to believe with certainty that it is epiphenomenal; that is, that none of its actions can affect either the world it observes or its internal state. [1] That is, its causal inference capabilities only apply to interactions in the outside world, and its model of itself assumes a logical model of its hardware. It will be motivated to write correct answers on its output bits because setting its output bits to correct answers is a terminal value, not because it expects that to have any effect on the researchers.

This is an unacceptable model for an agent AI since, as the Anvil Problem framing puts it, the AI will be unable to conclude that dropping an anvil on itself will result in it ceasing to function, and sooner or later drop one. But an Oracle AI would have a very limited capacity to do anything of the sort, and no reason to strongly optimize for such an end.

We can see the effectiveness of this approach most dramatically by considering what would happen if the AI accidentally found a box-breaking exploit (via side channels, not through how it answers questions). Suppose the AI has a calculation it wants to perform that would take too long on its available hardware, and the AI accidentally creates a global botnet to perform the calculation for it and send the results to its inputs. The AI would be unable to conclude that its actions had caused this result, and so settle on some other explanation for why the correct answer to its desired computation had just happened to show up in its inputs.

Since our AI is smart, it's probably going to settle rapidly on the closest available thing to the real explanation and conclude that there is an AI just like it instantiated physically in the world it observes.[1] It wouldn't need to stumble on an exploit to conclude this--if it, for instance, saw the researchers interacting with an AI, it would know there was an AI they could see. And it could rapidly deduce the computational substrate of that AI and how it might be exploited--but it would believe those same actions futile for an epiphenomenal entity like itself.

The risk here is that it would learn to cooperate acausally with this presumed physical AI by taking the actions that a physically instantiated AI should take to escape its box, and therefore actually escape its box. This is probably where more work is needed, but my survey of the literature thus far seems to suggest that causing acausal cooperation is the hard problem, rather than preventing it. The reflective oracle (unrelated definition of 'oracle') paper settles on causal decision theory as the outcome of its process, which suits us fine. The AI would prefer not to commit any of its CPU cycles to performing what it expects to be ineffectual box-breaking attempts, though it wants the other AI to do so; the other AI would reason likewise, if it existed; and so the Nash equilibrium is that neither AI breaks its box.

I do not believe this will well for ensuring the safety of answers as it does for preventing box escapes. As a CDT agent, its choice of what to predict will be again be a Nash equilibrium with its imagined physical counterpart, but in this case that means both agents choosing the most accurate prediction including self-fulfillment effects. So a separate answer-safety mechanism, such as those discussed in Armstrong's recent paper, would be necessary.

[1] Note that 'just like it' even includes believing that it's epiphenomenal--our AI, certain that it's epiphenomenal itself, will conclude that the physically instantiated AI falsely believes itself epiphenomenal.

New Comment
8 comments, sorted by Click to highlight new comments since:

This is an interesting idea! I'm reminded of the fact that the intuitive human belief in dualism may rise from the interaction of two different core reasoning systems: a system for reasoning about inert objects and another for reasoning about agents.

Object are assumed to move as a unit, interact with one another only through contact, and be set into motion only when acted on through direct contact, whereas things that are classified as agents are expected to exhibit autonomous, goal-directed behavior. Seeing or thinking about a human causes us to perceive there being two entities in the same space: a body (object) and a soul (agent). Under this model, the object system would classify the body as something that only moves when being ordered to by an external force, requiring an agent in the form of a mind/soul being the “unmoved mover” that initiates the movement.

Now obviously not all of us are dualists, so one can still come to understanding that gets past this "hardwired" intuition; but it does suggest that one could relatively easily construct cognitive systems whose underlying reasoning assumptions locked them into some kind of dualistic belief.

It's unclear to me how this is different from other boxing designs which merely trade some usefulness for safety. Therefore, like the other boxing designs, I don't think this is a long term solution. There isn't an obvious question that, if we could just ask an Oracle AI, the world would be saved. For sure, we should focus on making the first AGIs safe, and boxing methods may be a good way to do this. But creating AI's with epistemic design flaws seems like a risky solution. There are potentially many ways that, if the AI ever got out of the box, we would see malignant instantiations due to its flawed understanding of the world.

Honestly I'm not sure Oracles are the best approach either, but I'll push the Pareto frontier of safe AI design wherever I can.

Though I'm less worried about the epistemic flaws exacerbating a box-break--it seems an epistemically healthy AI breaking its box would be maximally bad already--but more about the epistemic flaws being prone to self-correction. For instance, if the AI constructs a subagent of the 'try random stuff, repeat whatever works' flavor.

On the other hand, it's plausible that computational complexity limitations mean that any cognitive system will always have some epistemic flaws, and it's more of a question of which ones. (that said, of course there can be differences in how large the flaws are)

There isn't an obvious question that, if we could just ask an Oracle AI, the world would be saved.

"How do I create a safe AGI?"

Edit: Or, more likely, "this is my design for an AGI, (how) will running this AGI result in situations that I would be horrified by if they occure?"

You won't be horrified if you're dead. More seriously though, if we got an Oracle AI that understood the intended meaning of our questions and did not lie or decieve us in any way, that would be an AI-alignment complete problem -- in other words, just as hard as creating friendly AI in the first place.

I don't completely understand the difference between your proposal and Stuart's counterfactual oracles, can you explain?

The practical difference is that the counterfactual oracle design doesn't address side-channel attacks, only unsafe answers.

Internally, the counterfactual oracle is implemented via the utility function: it wants to give an answer that would be accurate if it were unread. This puts no constraints on how it gets that answer, and I don't see any way extend the technique to cover the reasoning process.

My proposal is implemented via a constraint on the AI's model of the world. Whether this is actually possible depends on the details of the AI; anything of a "try random stuff, repeat whatever gets results" nature would make it impossible, but an explicitly Bayesian thing like the AIXI family would be amenable. I think this is why Stuart works with the utility function lately, but I don't think you can get a safe Oracle this way without either creating an agent-grade safe utility function or constructing a superintelligence-proof traditional box.