I would like to a discuss potential problem with some configurations of the reduced-impact AI specified in "Reduced impact AI: no back channels". Feedback would be appreciated.
The proposed AI in the referenced article is intended to try to only influence the world through a single output channel. The essence of the AI in the article is that it tries to avoid having the world be perceptibly different than what it would have been if the AI was never actually turned on and the output in its output channel was instead created "excited, noisy gas, and random fluctuations in pressure or something".
The potential problem occurs when the AI's output size is sufficiently large. I know there are currently concerns about a boxed or reduced-impact AI having an output that is too large or too flexible and causing a catastrophe through the output channel. However, my concern my cause tighter bounds on the size of the output channel than there would otherwise be.
The output was said be be influenced by noisy gas. But this doesn't necessarily rule out the possibility that the gas caused the output to be what it was for some sort of systematic reason. For example, the gas might behave chaotically, but a sufficiently powerful system could still in principle predict and control it and use that to control the output channel. Suppose the AI considers both the hypothesis that the output was generated for no systematic reason and the output was generated for such a reason.
The agent would consider its output being generated by noisy gas with no systematic reason to create its actual output. However, its output would be extremely unlikely to be produced from such a process. For example, suppose its output is a 1kb output. If the noise flipped all the bits in the byte at random, with 1 being as likely as 0, then the probability of generating the output would be 1 in 2**(1024 * 8) = 2**8192 ~= 10**27306. A 100 byte message would have a probability of about 1 in 10**241, which would probably also be sufficient to cause the problem.
So, I think the AI would consider other hypotheses for the cause of the output. Maybe there were some chaos-controlling aliens who came to Earth and set the output as a prank. Or maybe the AI is just in a simulation and the aliens did the same thing. Or maybe there was some mysterious other optimization process that figured out what the AI should do and did it instead by manipulating the noise. All of these hypotheses seem pretty improbable, but still far more probable than 1 in 10**27306. That is why I think one of these, or some other systematic cause, may be the AI's dominant hypothesis.
If the AI does adopt such an exotic explanation, the AI's understanding of the world could be pretty weird and pretty wrong. It's not clear to me that the AI minimizing impact with such an inaccurate world model would result in its impact being minimized in our actual world. For example, the AI might think it's in a simulation and that only its immediate environment exists in the simulation, so it then might not worry about doing something that would destroy the rest of the real world. Or perhaps it reasons that whatever system has the power to produce its output from noisy gas would be just undo anything the AI does if it wants. Perhaps the AI would do what we want, but it seems difficult to verify considering the challenges of figuring out what the AI's explanation for the output would be.
One possible way to deal with this is to specify that the output was set by noise that was not manipulated with any "systematic reason". I'm not sure how you could formalize this idea, though.
A simpler way that might work is to just make sure the output length is extremely low. A question requiring an n-bit-long answer could be converted to to n questions, each asking for the value of one bit in the output you're looking for. This might not completely eliminate the problem of the AI placing higher probability in exotic hypothesis, but it might come close.