Suppose I am a self-modifying AI reasoning about my own behavior (or about the behavior of another AI I am designing).
To a human, it seems like it is very important that we trust our own deductions. That is, humans seem to believe "things I believe are probably true." Thus a human would not self-modify to stop acting on their beliefs. How do we formalize this sort of trust?
For simplicity say I exist in a fixed environment without uncertainty whose description I know (including my own existence in it). The only sort of statements I care about are mathematical statements: any property of our environment can be expressed as a purely mathematical statement. In order to act intelligently in this environment, I have a mathematical deduction engine which proves mathematical statements; my decisions are informed by the output of my deduction engine.
I am considering replacing my deduction engine with a psuedo-random statement evaluator in order to save energy (after all, I need energy for many purposes). Why shouldn't I do it? You might hope that the deduction engine would be able to tell me that my deduction engine is useful; that my deduction engine isn't just a psuedo-random statement evaluator. But an important property of mathematical deduction is the second incompleteness theorem: no reasonable proof system X can prove that X doesn't prove false statements. In fact the situation is more dire: if X ever proves a statement of the form "X wouldn't prove Y if Y weren't true" then X must also prove Y itself.
My question is: what sort of confidence in its own reasoning can a consistent thinker actually have? I know that an agent sure of its own correctness will start believing everything. But what about an agent who is 99% confident of its own correctness? What does Lob's theorem look like when applied to probabilistic beliefs?
For example, suppose the algorithm X believes itself to be well-calibrated in the following sense. For any statement A and any probability p and any time T, consider the statements S1 = "At time T, X believes A with probability p," and S2 = "At time T, X believes A with probability p, and A is really true." We say that X believes itself to be well-calibrated about A at time T if, for all p, the probability X assigns to S2 is exactly p times the probability X assigns to S1.
Lob's theorem says that this sort of belief in well-calibration is impossible if X is capable of carrying out complicated reasoning: X would infer with probability 1 that "At time T, if X believes A with probability 1 then A is true." If X believes this for a large enough time T then X can apply the inferences in Lob's theorem and arrive at belief in A with probability 1 (I think).
Now what if X believes itself to be well-calibrated only with probability 99%, in the following sense. Define S1 and S2 as before. What if, for all p, the probability X assigns to S2 is between (p+0.01) and (p-0.01) times the probability X assigns to S1? Does this lead to contradictory beliefs? Is this a strong enough belief to ensure that X will keep itself running and invest effort in improving its performance?
Of course given that we don't have any grasp of probabilistic reasoning about mathematics, maybe thinking about this issue is premature. But I would like to understand what probabilistic reasoning might look like in interesting cases to get more of a handle on the general problem.
Are there other good ways for X to believe in its own well-calibration? For example, in my description there is no explicit statement "X is well calibrated" which X can reason about, only particular special cases. Should there be such a statement? My description seems like it has many potential shortcomings, but I don't see anything better.