This was an idea developed along with Eliezer when he was at the FHI in November.
In my post on "reduced impact AI" I talked about how you could estimate the impact of an AI by (hypothetically) observing a slice of the future. When Eliezer was here last November, we talked about how this could be cached out, and came up with a model.
Ignore the "output channel" from the original reduced impact post, and consider this situation:
We have two computers running in two similarly shielded rooms. One of them is a simple algorithm running through the digits of e (call it run-e), that was soon after deleted. Another is our dear old friend clippy, the AI motivated to maximise paperclips. Without removing clippy's paperclip preferences, we'd like to motivate it to have next to no impact on the world. How should we define "next to no impact"?
The obvious think is to compare it with run-e, and require it to have no more expected impact than that machine. How could this be measured? Well, the original idea was to put a stochastic process on clippy as it was turned on (inside the shielded room) and say that clippy has reduced impact if we can't find any extra evidence, from a vantage point 50 years in the future, that clippy was successfully turned on. Now put the stochastic same process on run-e and define:
Since run-e was simply thinking about e for a while, within a shielded room, then getting deleted, we should have very little posterior evidence about it. But how do we determine what counts as evidence in 50 years time - after all, run-e emitted some radiation (if it was on), so there should be some evidence in 50 years that it was on. So how to define "no more evidence"?
By using a random sampling. If clippy went wild and filled the universe with paperclips, it would be rapidly obvious: there would be paperclips all over the place, lack of unprocessed iron in planetary crusts, light beams showing images of mass paperclip factories, etc... If some super-intelligence started sampling random data about particles in 50 years time, it wouldn't take long for it to deduce whether there'd been a unconstrained clippy or not - in any case, it would (almost certainly) deduce this information long before it deduced whether run-e was turned on.
We can't be sure that we've set the probabilities of clippy and run-e being turned on as being exactly the same. So we'll use likelihood ratios rather than posterior probabilities. So set some likelihood ratio - say 90-to-1 - and say that clippy has reduced impact if:
It is expected that a hypothetical superintelligence, sampling randomly from data about the universe in 50 years time, would require equal amounts of evidence for [reaching 90-to-1 likelihood ratio that clippy was turned on or not] versus [reaching 90-to-1 likelihood ratio that run-e was turned on or not].
In that case, we can truly say that clippy has had no more impact than run-e.
Obvious failure mode
The obvious failure mode is if clippy expands, determines whether run-e was turned on or not, and then spreads that information across the future lightcone (as well as its paperclips). Then evidence for run-e would be just as easy to find as evidence for clippy - both run-e and clippy would have the same impact, but that impact would be huge, which is not what we want!
Here we might demand to see independent evidence - evidence that run-e was turned on that is not also evidence that clippy was also turned on. Then if clippy plasters "run-e lived!" all over the universe, this wouldn't help it, as this is strong evidence for clippy's existence as well. This point deserve some more thought, though.