If we want to reason with probability theory, we seem to be stuck if we want to reason about mathematics.
You can skip this pararaph and the next if you're familiar with the problem. But if you're not, here's an illustration. Suppose your friend has some pennies that she would like to arrange into a rectangle, which of course is impossible if the number of pennies is prime. Let's call the number of pennies N. Your friend would like to use probability theory to guess whether it's worth trying; if there's a 50% chance that Prime(N), she won't bother trying to make the rectangle. You might imagine that if she counts them and finds that there's an odd number, this is evidence of Prime(N); if she furthermore notices that the digits don't sum to a multiple of three, this is further evidence of Prime(N). In general, each test of compositeness that she knows should, if it fails, raise the probability of Prime(N).
But what happens instead is this. Suppose you both count them, and find that N=53. Being a LessWrong reader, you of course recognize from recently posted articles that N=53 implies Prime(N), though she does not. But this means that P(N=53) <= P(Prime(N)). If you're quite sure of N=53—that is, P(N=53) is near 1—then P(Prime(N)) is also near 1. There's no way for her to get a gradient of uncertainty from simple tests of compositeness. The probability is just some number near 1.
In general, conditional on the axioms, mathematical theorems have probability 1 if they're true, and 0 if they're false. Deriving these probabilities is exactly as difficult as deriving the theorems themselves.
A way of assigning actual probabilities to theorems occurred to me today. I usually see this problem discussed by folks that want to develop formal models of AI, and I don't know if this'll be helpful at all for that. But it's something I can imagine myself using when I want to reason about a mathematical conjecture in a basically reasonable way.
The basic idea is just, don't condition on the axioms. Condition on a few relevant things implied by the axioms. Then, maximize entropy subject to those constraints.
Like, if you're trying to figure out whether a number N is prime. Don't condition on any properties of numbers, any facts about addition or multiplication etc., just treat Prime(N) as a predicate. Then condition on a few facts about Prime(N). I'd start with the prime number theorem and a theorem about the speed of its convergence, which I imagine would lead to a prior distribution P(Prime(N)|N=n, F1) = 1/log(n) (where F1 is Fact1, the fact that the prime number theorem is true). Now, let's say we want to be able to update on noticing a number is odd, so lets add Fact2, which is that all prime numbers are odd and half of all numbers are odd, which allows us to use Bayes' theorem:
which gives us the intuitive conclusion that observing that a number is odd doubles the probability that it's prime.
It's a weird thing to do, leaving out part of your knowledge. Here's how I think of it, though, so that it's less weird.
Suppose that the way you do statistics is by building Jaynesian robots. I'm referring here to an image used over and over again in Jaynes's book Probability Theory: The Logic of Science. The image is that we're building robots that assign plausibilities to statements, and we try to construct these robots in such a way that these plausibilities are useful to us. That is, we try to make robots whose conclusions we'll actually believe, because otherwise, why bother?
One of his "desiderata", his principles of construction, is that the robot gives equal plausibility assignments to logically equivalent statements, which gets us into all this trouble when we try to build a robot we can ask about theorems. But I'm keeping this desideratum; I'm building fully Jaynesian robots.
All I'm doing is hiding some of my knowledge from the robot. And this actually makes sense to me, because sometimes I'd rather have an ignorant Jaynesian robot than a fully informed one.
This is because speed is always important, and sometimes an ignorant Jaynesian robot is faster. Like, let's say my friend and I are building robots to compete in a Number Munchers-like game, where you're given lots of very large numbers and have to eat only the primes. And let's make the levels timed, too. If my friend builds a Jaynesian robot that knows the fundamental axioms of number theory, it's going to have to thoroughly test the primeness of each number before it eats it, and it's going to run out of time. But let's say I carefully program mine with enough facts to make it effective, but not so many facts that it's slow. I'll win.
This doesn't really apply to modeling AI, does it? I mean, it's a way to assign probabilities, but not the best one. Humans often do way better using, I don't know, analogy or whatever they do. So why would a self-modifying AI continue using this, after it finds a better way?
But it's very simple, and it does seem to work. Except for how I totally handwaved turning the prime number theorem into a prior. There could be all sorts of subtleties trying to get from "prime number theorem and some theorem on speed of convergence, and maximizing entropy" to "P(Prime(N)|N=n) = 1/log(n)". I at least know that it's not rigorously provable exactly as stated, since 1/log(n) is above 1 for small n.
And, I honestly don't know how to do any better than this, and still use probability theory. I have that Haim Gaifman paper, "Reasoning with Limited Resources and Assigning Probabilities to Arithmetical Statements", in which he describes an entirely different way of conceptualizing the problem, along with a much better probabilistic test of primeness. It's long though, and I haven't finished it yet.