Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

My trollability post showed that any computable probability distribution on logic which updates on proofs in a Bayesian way can in principle be manipulated into nearly-arbitrary beliefs by an adversary. (At least, this is true in my -based treatment, and in a version which updates on theorems directly rather than on their provability.) Here, I sketch a result which suggests that it's fairly easy to do so, blocking some potential routes around the problem. I will use notation and concepts from the original trollability post.

First, note that corollary 1 does not depend on the odds ration 1/2 which Theorem 1 implies we can obtain. A weaker troll might not have the computational power to search for proofs which so much leverage over our beliefs. All the troll needs is a sequence of theorems such that the product of the likelihood ratios goes to zero. (Note: what Theorem 1 establishes is exactly that we can always find so that this likelihood ratio is less than or equal to 1/2.) The likelihood ratio doesn't even need to be bounded away from one. It could be that each new is able to reduce the probability less and less (meaning the likelihood ratio is closer and closer to one), but the overall effect is still to modify the probability of arbitrarily. This is guaranteed if the product of the ratios converges to zero.

(Analogously, we would talk about this product converging to infinity for corollary 2. I'll deal just with the case of driving probabilities down, here; as in the previous post, the idea is that if we can't keep probabilities from being driven arbitrarily down, we also cannot keep them from being driven arbitrarily up, so we get non-convergence.)

It seems natural to hope, then, that an agent could avoid manipulation by predictable trolls. If the agent is predicting the troll quite accurately, then it must be placing high probability on ahead of time; so, the likelihood ratio must be close to one. If the predictions converge to probability one very quickly, then perhaps the likelihood ratios can converge quickly enough to avoid the divergence result.

This turns out to be impossible for fairly simple trolls.

Before sketching the proof, I should explain what I am not doing. I am not saying that an agent cannot block a troll by beating a troll to theorems. This totally works: if the agent already knows a theorem by the time a troll presents it, then there is nothing to update on. Of course there is still a problem if the agent is beating the troll to theorems by anticipating what the troll will try next; in that case, the troll is manipulating the agent just as much as before. However, if the agent is so fast that it can beat the troll to theorems while not changing its search-order, the troll is toothless.

This is little comfort, however. My real concern in all of this is that the proof-order of the agent itself can cause non-convergence. We do not yet know that convergence is possible with any proof-ordering. The troll is only a personification of a bad proof-order in an agent's internal theorem-prover. So, it's sufficient to consider the case where all the theorems are being proved by the troll, and the agent acts as a passive receiver of information. If we can find any restrictions on the troll which allow convergence in that case, we are happy. The hope is that a prior which models the troll's sequence well could converge. So, let's generalize the theorem from last time to block that approach as much as we can.

Let be the probability distribution at time and be the truth which the troll proves to the agent at that time to get it to update to . I'll stick to discussing trolls which use of the form , with a theorem of , as in the proofs in the previous post. Furthermore, suppose that the troll eventually uses each theorem of as a . I assume that is undecidable, and that probabilities are nonzero unless they are forced to be zero by coherence. What we want is:

Along the lines of the proof of theorem 1, we can see that when has the form , . So, the desired result is equivalent to:

Note, however, that this product is a chain of conditional probabilities giving the probability of the next conditioned on all the previous and on . If the limit is greater than zero, then we can sample theories from and have a nonzero chance of sampling all the . But we also have a nonzero chance of sampling . Combined with all , this implies all the theorems of . Due to coherence, this implies that we have a nonzero chance of sampling a complete and consistent theory. This is impossible.

The conclusion is, the construction in my previous post does not need to cleverly look for which satisfy the 1/2 cutoff. All that's necessary is to go through all theorems. Rather than a proof that any -coherent prior is vulnerable to some troll, this gives a single troll which they all are vulnerable to.

Furthermore, we cannot immunize against any troll which eventually uses all theorems as . This is counter-intuitive, because it means the agent may have much more processing power than the troll, and still be incapable of predicting the troll as it conditions on the troll's outputs. We can imagine a -coherent prior which also has powerful sequence prediction distributions baked in. The troll might be enumerating theorems in quite a predictable way, following a low-order polynomial algorithm, while we use some bounded version of Solomonoff induction or something. The fact is, we will either (a) make our prior uncomputable, (b) have to weaken the sequence prediction to the point where it does not converge to predicting probability one for consistent patterns, or (c) make the probability distribution incoherent.

New to LessWrong?

New Comment