Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The Goldbach conjecture is probably correct; so was Fermat's last theorem

23Paul Crowley

2TheMajor

2Sunny from QAD

1Adam Scherlis

1Phil Scadden

3Stuart_Armstrong

1Adam Scherlis

22paulfchristiano

10Ege Erdil

3paulfchristiano

3Ege Erdil

3paulfchristiano

3Ege Erdil

4Loppukilpailija

11paulfchristiano

2Stuart_Armstrong

6ErrethAkbe

2Stuart_Armstrong

2Pattern

5Sam Marks

3Stuart_Armstrong

4Donald Hobson

2Stuart_Armstrong

3Troof

1Stuart_Armstrong

3TheMajor

2Stuart_Armstrong

New Comment

27 comments, sorted by Click to highlight new comments since: Today at 10:03 AM

Reminiscent of Freeman Dyson's 2005 answer to the question: "what do you believe is true even though you cannot prove it?":

Since I am a mathematician, I give a precise answer to this question. Thanks to Kurt Gödel, we know that there are true mathematical statements that cannot be proved. But I want a little more than this. I want a statement that is true, unprovable, and simple enough to be understood by people who are not mathematicians. Here it is.

Numbers that are exact powers of two are 2, 4, 8, 16, 32, 64, 128 and so on. Numbers that are exact powers of five are 5, 25, 125, 625 and so on. Given any number such as 131072 (which happens to be a power of two), the reverse of it is 270131, with the same digits taken in the opposite order. Now my statement is: it never happens that the reverse of a power of two is a power of five.

The digits in a big power of two seem to occur in a random way without any regular pattern. If it ever happened that the reverse of a power of two was a power of five, this would be an unlikely accident, and the chance of it happening grows rapidly smaller as the numbers grow bigger. If we assume that the digits occur at random, then the chance of the accident happening for any power of two greater than a billion is less than one in a billion. It is easy to check that it does not happen for powers of two smaller than a billion. So the chance that it ever happens at all is less than one in a billion. That is why I believe the statement is true.

But the assumption that digits in a big power of two occur at random also implies that the statement is unprovable. Any proof of the statement would have to be based on some non-random property of the digits. The assumption of randomness means that the statement is true just because the odds are in its favor. It cannot be proved because there is no deep mathematical reason why it has to be true. (Note for experts: this argument does not work if we use powers of three instead of powers of five. In that case the statement is easy to prove because the reverse of a number divisible by three is also divisible by three. Divisibility by three happens to be a non-random property of the digits).

It is easy to find other examples of statements that are likely to be true but unprovable. The essential trick is to find an infinite sequence of events, each of which might happen by accident, but with a small total probability for even one of them happening. Then the statement that none of the events ever happens is probably true but cannot be proved.

How does the randomness of the digits imply that the statement cannot be proven? Superficially the quote seems to use two different notions of randomness, namely "we cannot detect any patterns" (i.e. a pure random generator is the best predictor we have) and "we have shown that there can be no patterns" (i.e. we have shown no other predictor can do any better). Is this a known result from Ergodic Theory?

I was about to type out a rebuttal of this, but halfway through I realized I actually agree with you. The "some non-random property" of the digits of the powers of two is that they are all digits found in order inside of powers of two. I would even go so far as to say that if the statement really can't be proven (even taking into account the fact that the digits aren't truly random) then there's a sense in which it isn't true. (And if it can't be proven false, then I'd also say it isn't false.)

If it can't be proven false, then it definitely isn't false!

Equivalently: If it's false, then it can be proven false.

Why do I say that? Well, if it's false, then there exists a power of two that is the reverse of a power of five. But that has a very short proof: just write down the smallest example.

As to the case where it can't be proven either way: I would say that it has to be true in that case, but this might be one of those things that sufficiently diehard constructivists would agree with you on.

I am extremely interested in examples of heuristic arguments like this where the naive version leads you astray and it takes much more sophisticated arguments to fix the problem. I'd be happy to pay a $500 bounty for any examples that I find really interesting, in number theory or elsewhere. (Usual disclaimers, my judgment of what counts as interesting will be completely arbitrary and unaccountable, *etc.*)

The best example I found was the Chebyshev bias that primes are very slightly more likely to be 3 mod 4 than 1 mod 4. The simplest explanation I know of this goes through an explicit formula for the von Mangoldt function's distribution mod 4 as explained by Terry Tao here.

(Roughly speaking: the prime *powers* should be uniform mod 4, based on an argument about L-functions that I don't understand. But the prime squares are all 1 mod 4, so the primes need to be 3 mod 4 slightly more often to cancel out. It's very surprising that it works this way---if you made me guess I would have quite strongly predicted that the prime powers would skew towards 1 mod 4, instead of the primes accommodating the squares like that.)

That post also describes another bias where prime gaps are surprisingly non-uniform, though once it's pointed out that one feels much more obvious to me.

Another example is that the strong form of Cramer's conjecture is false, i.e. the prime gaps are not as consistently small as you would expect based on a simple random model. I don't have a good understanding of why this happens, people don't seem shocked by it (but right now I feel kind of shocked).

I'm not really interested in diophantine equations where you can override a simple heuristic argument by factorization / analysis modulo fixed primes / dividing by common factors. It's pretty easy to see why that happens, the heuristic argument seems like a good *prima facie* guess, and it's easy to see how to adjust it once you notice new arguments.

It so happens that I asked a relevant question on MathOverflow recently. Very difficult to find any elementary explanation for this phenomenon.

That's another great example.

Thinking through the heuristic argument:

- About of the numbers are prime and have value .
- As we multiply by more and more factors, should alternate between .
- But starting at should introduce a bias towards .
- It's not clear how big the bias would be. A natural guess would be in the ballpark of .
- A bit more precisely, I think the number of prime factors is roughly plus a poisson with mean . (I'm just getting that from some random google results, e.g. here. But it seems intuitively right: each number has a probability of about of dividing and a probability of about of being prime, and the sum of is .)
- Poisson distributions are more likely to be even than odd (and we're adding 1 so more likely to be odd than even). This computation gives you a bias of order . Putting in we get a bias of order .
- That's still way bigger than the real bias .

So tentatively I'm pretty surprised by how close this is to 50-50. I do feel like I'm probably overlooking something or messed up that argument. If not then this is a bit of a different situation from the other surprises, since there's a heuristic argument suggesting a big bias that we don't actually see. It would be good to understand what's up here.

(From my perspective this is probably the more interesting kind of failure, since it suggests that this is a case where the heuristic argument really "was wrong" instead of merely overlooking something that could be pointed out by a more sophisticated argument.)

The Poisson approximation is not good: it's actually a known theorem that the number of prime factors of for large behaves like it's distributed normally with mean and standard deviation . ~~Since the whole even/odd distinction relies crucially on the Poisson approximation, it also fails here~~. (This part is incorrect, the distinction between Poisson and normal doesn't matter in this case because both approximations are too low resolution to be useful. See my next comment in the thread.)

It's easy to give a heuristic justification for this: the basic idea is that if you look at an interval for large, then the indicator functions which take the value if divides the input and otherwise are all "independent enough" for, say, .

Now, you want to know the distribution of

Primes between and don't contribute much to this sum, because the sum of the reciprocals of the primes scales like , so you can pretend they don't exist. When you do that, the remaining indicator functions are all uncorrelated enough with each other thanks to the Chinese remainder theorem that you can apply some version of the central limit theorem to conclude this random variable defined on basically behaves as if it's normally distributed. It's easy to check both its mean and variance are .

That's the same heuristic argument I was imagining for why it would be Poisson (except I wasn't being careful and thinking about CRT, or explicitly cutting out the large primes, just naively assuming independence by default). When mean and variance are equal, isn't a Poisson distribution extremely close to a normal distribution? Or put differently: if you are adding up coins whose probability of heads is close to 0, I'd expect the poisson approximation to be as good as the normal approximation. There are higher-order differences between the two distributions, but is it specifically known to be more normal than Poisson?

I guess the big problem is that you need an extremely accurate approximation to get at the parity, and neither of these approximations is very accurate? E.g. adding 1 is a pretty small impact on the distribution, but a huge impact on the parity.

Another pass at the heuristic argument:

- The number of copies of a prime that divides is geometric. The probability that the count is odd is . I'd guess these events are all pretty independent for based on what you said.
- If we have a bunch of coins with probability of being odd, the probability of the sum being even is
- That's a bias of order again.
- So it seems like the count of small prime divisors should have a very significant bias towards being even.
- So I guess all the action is in the large prime divisors, and probably from the anticorrelation with the count of small prime divisors (seems like the magic is that 0 isn't a valid answer). I guess we kind of knew that anyway, since it's going to flip the sign of the bias. In expectation there are O(1) large prime divisors, but that's still enough to totally dominate the calculus.

This doesn't feel like a promising line of attack. Would be interesting to check the claim that the count of small prime divisors with multiplicities is quite even-skewed.

I agree with all of that, and I think if you simulated it you'd indeed find a large bias for the number of small prime divisors to be even. The problem is is such a small bias in expectation that it will already be offset by just the primes in the interval which contributes on the order of to the average in the opposite direction, since all primes trivially have an odd number of divisors.

Furthermore, there are subtleties here about exactly how much independence you need. If you want everything to be jointly independent then you can really only work with the primes up to while being safe - this is because the product of the primes up to is roughly of order . Once you go past that, while correlations involving only a small number of primes are still fine, correlations involving lots of primes break down, and for parity problems you need to control the entire joint distribution in a fine way.

This is not a problem for the normal approximation because to show convergence in distribution to a normal distribution, you just need to show all the moments converge to the right values and use a result like Stone-Weierstrass to approximate any continuous test function uniformly by a polynomial. You can do this just by working with primes up to for depending on the exact moment you're studying. However, this result is really "low resolution", as you correctly identify.

A couple of examples from quadratic residue patterns modulo primes:

**Example 1.** Let be a large prime. How many elements are there such that both and are quadratic residues?

Since half of elements mod are quadratic residues and the events " is a QR" and " is a QR" look like they are independent, a reasonable guess is . This is the correct main term, but what about the error? A natural square-root error term is *not* right: one can show that the error is , the error depending only on whether is or mod . (The proof is by elementary manipulations with the Legendre symbol, see here. So there's hidden structure that makes the error smaller than what a naive randomness heuristic suggests.)

**Example 2.** Let be a large prime. How many elements are such that all of and are quadratic residues?

Again, the obvious guess for the main term is correct (there are roughly such ), so consider the error term. The error is not this time! The next guess is a square-root error term. Perhaps as ranges over the primes, the error term is (after suitable normalization) normally distributed (as is motivated by the central limit theorems)? This is not correct either!

The error is in fact bounded in absolute value by , following from a bound on the number of points on elliptic curves modulo . And for the distribution of the error, if one normalizes the error by dividing by (so that the resulting value is in ), the distribution behaves like , where is uniformly distributed on as ranges over the primes (see here). This is a deep result, which is not easy to motivate in a couple of sentences, but again there's hidden structure that the naive randomness heuristic does not account for.

(And no, one does not get normal distribution for longer streaks of quadratic residues either.)

Readers interested in this topic should probably read Terrence Tao's excellent discussions of probabilistic heuristics in number theory, e.g. this post discussing Fermat's last theorem, the ABC conjecture, and twin primes or this post on biases in prime number gaps. Those posts really helped improve my understanding of how such heuristic arguments work, and there's some cool surprises.

Isn't the Stuart conjecture an extremely weak form of the Lander, Parkin and Selfridge conjecture? If you specialize their conjecture to then it implies your conjecture but with instead of . (I'm not sure why you picked ).

I think the fundamental thing about Fermat's last theorem for n=3 is that you can divide by any common divisor of x and y. Once you consider a minimal solution with (x, y, z) relatively prime, and consider the factorization , the n=3 case is also very straightforward to argue heuristically. [ETA: this is not true, it actually seems to fundamentally be a subtle claim related to the elliptic curve structure.]

I don't think it's a fair deduction to conclude that Goldbach's conjecture is "probably true" based on a estimate of the measure (or probability) of the set of possible counter examples being small. The conjecture is either true or false, but more to the point I think you are using the words probability and probable in two different ways (the measure theoretic sense, and in the sense of uncertainty about the truth value of a statement), which obfuscates (at least to me) what exactly the conclusion of your argument is.

There is of course a case to be made about whether it matters if Goldbach's conjecture should be considered as true if the first counter example is larger than an number that could possibly and reasonable manifest in physical reality. Maybe this was what you are getting at, and I don't really have a strong or well thought out opinion either way on this.

Lastly, I wonder whether there are examples of heuristic calculations which make the wrong prediction about the truth value of the conjecture to which they pertain. I'm spitballing here, but it would be interesting to see what the heuristics for Fermat's theorem say about Euler's sum of primes conjecture (of which Fermat's last theorem is K = 2 case), since we know that the conjecture is false for K = 4. More specifically, how can we tell a good heuristic from a bad one? I'm not sure, and I also don't mean to imply that heuristics are useless, more that maybe they are useful because they (i) give one an idea of whether to try to prove something or look for a counter example, and (ii) give a rough idea of why something should be true or false, and what direction a proof should go in (e.g. for Goldbach's conjecture, it seems like one needs to have precise statements about how the primes behave like random numbers).

Note that the probabilistic argument fails for n=3 for Fermat's last theorem; call this (3,2) (power=3, number of summands is 2).

So we know (3,2) is impossible; Euler's conjecture is the equivalent of saying that (n+1,n) is also impossible for all n. However, the probabilistic argument fails for (n+1,n) the same way as it fails for (3,2). So we'd expect Euler's conjecture to fail, on probabilistic grounds.

In fact, the surprising thing on probabilistic grounds is that Fermat's last theorem is true for n=3.

There is of course a case to be made about whether it matters if Goldbach's conjecture should be considered as true if the first counter example is larger than an number that could possibly and reasonable manifest in physical reality. Maybe this was what you are getting at, and I don't really have a strong or well thought out opinion either way on this.

That would be sufficient, but there are more easily met conditions like:

1)

if you have a bunch of sequences of observations (100 or 1000 year's worth)

and none of them include the counterexample as an observation with high probability

2)

The frequency is low enough that it isn't worth accounting for.

For example, if you flip a coin it comes up heads or tails - is it worth bringing up another possibility?

So Andrew Wiles's genius was in showing there were no unexpected obstructions for the "likely" outcome to be true. That's why the proof is so hard: he was trying to prove something very "likely", and show an absence of structure, rather than a presence, without knowing what that structure could be.

This is a poor description of Wiles's proof; in fact, I would call it diametrically wrong. Wiles proved the *presence *of a very rigid structure - not the absence - and the presence of this structure implied FLT via the work of other mathematicians.

I don't have a great understanding of the broader point you are making, so I don't know how big an issue this mistake presents. However, be aware that the paradigm you've extracted from the ideas in this post has lead to at least one incorrect prediction.

I'll try to explain how Wiles's proof diverges from your model of it by way of analogy. Suppose that we instead wanted to prove Fermat's first theorem:

**Fermat's first theorem: **For every *even* integer there are no nontrivial integer solutions to the equation .

Further suppose that in our world, mathematicians know about the notion of positivity and absolute values, but the proof of the following fact has long evaded them.

**Positivity conjecture: **For every integer , we have .

The positivity conjecture is a very important structural fact about the integers. And it immediately implies Fermat's first theorem (since the left-hand side must be positive, but the right-hand side must be negative unless x,y,z are all 0). So Fermat's first theorem follows from an important structural fact.

However, in our supposed world, mathematicians don't have access to the positivity conjecture. They might perform the exact same analysis in your post (it goes through verbatim!), and conclude that if you check Fermat's first theorem for enough n, then it is probable to be true. However, it is *not* true that the proof of FFT via the positivity conjecture is "proving an absence of structure" - quite the opposite!

The analogue of the positivity conjecture in the real world is the Modularity theorem. This is what Wiles actually proved, and it was already known that the Modularity theorem implies FLT. And as with the positivity conjecture, the Modularity theorem is a very powerful **structural **result. To give a slogan, it says that every elliptic curve over is "modular," meaning that it relates in an appropriate way to an object called a modular form.

Wiles proved the presence of a very rigid structure - not the absence - and the presence of this structure implied FLT via the work of other mathematicians.

If you say that "Wiles proved the Taniyama–Shimura conjecture" (for semistable elliptic curves), then I agree: he's proved a very important structural result in mathematics.

If you say he proved Fermat's last theorem, then I'd say he's proved an important-but-probable *lack of structure* in mathematics.

So yeah, he proved the existence of structure in one area, and (hence) the absence of structure in another area.

And "to prove Fermat's last theorem, you have to go via proving the Taniyama–Shimura conjecture", is, to my mind, strong evidence for "proving lack of structure is hard".

I'm not sure whats going on with the small numbers though. When I use the formula to try and get the probability of goldbachs conjecture holding for all numbers >2 , I get around 10^-15.

More sophisticated formulae that take into account the even odd pattern still get around 10^-6. Of course, this is sensitively dependant on exactly where you start. But it seems that this probabilistic reasoning showing so strongly that a small even number shouldn't be a sum of primes indicates a problem with the probabilistic model.

Yes, I got that result too. The problem is that the prime number theorem isn't a very good approximation for small numbers. So we'd need a slightly more sophisticated model that has more low numbers.

I suspect that moving from "sampling with replacement" to "sampling without replacement" might be enough for low numbers, though.

For a more precise probabilistic approach to Fermat's last theorem by Feynman, which take into account the repartition of the nth powers, see this amazing article http://www.lbatalha.com/blog/feynman-on-fermats-last-theorem

EDIT: Added a section on Euler's conjecture.## The Goldbach conjecture is likely

The Goldbach conjecture is that "every even integer above two is the sum of two primes". For example, 4=2+2, 6=3+3, 8=5+3, and so on.

Though this is a mathematically precise statement, we can talk about the "probability" of it begin correct. How so?

Well, by the prime number theorem, the probability of a random number less than N being prime, is 1/log(N). So if we sum up all the primes less than N, we get (N/log(N))2 different sums; these sums will be less than 2N.

So, is N itself is one of these sums? Well, the "probability" that it's not the total of any given sum is 1−12N; therefore the probability of it being the total of none of the sums is:

(1−12N)(N/log(N))2=((1−12N)2N)N/(2log(N)2)≈(1/e)N/(2log(N)2).

So the probability of N being the total of such a sum is roughly:

1−e−N/(2log(N)2).

Therefore, the probability of all numbers N being the total of such a sum is roughly:

p2=∞∏N=21−e−N/(2log(N)2).

Now, the infinite product p2 converges to a non-zero number if and only if the sum ∑∞N=1e−N/(2log(N)2) converges to a finite number. That series can be seen to be convergent (for example, by noting that e−N/(2log(N)2)<1/N2 for large enough N and using the comparison test).

If use computers to get an estimate of p2, we get a pretty low probability. However, most of that improbability mass is on the low numbers, and the Goldbach conjecture has been tested up to 4×1018. So, if we assume it's valid up to 1000, we numerically get:

p1000=∞∏N=10001−e−N/(2log(N)2)≈0.9961.

So the Goldbach conjecture is pretty likely, and, the more examples we discover where it holds, the more likely it is to hold all the way to infinity.

## "Probabilities" of logical facts

The above reasoning seems dubious. The primes are not defined by random sampling among the natural numbers; quite to the contrary, they come from a mathematical rule of extreme precision. So what do these probabilities mean?

Let X be an infinite set of numbers, selected from the natural numbers in a way that looks like the prime number theorem (eg the n-th number is approximately nlog(n)). Then what we've shown is that, if such an X obeys the "X-Goldbach conjecture" up to 1000, then we'd expect it to go all the way to infinity.

Thus the Goldbach conjecture can be restated as "in terms of sums of two elements, the prime numbers behave like a typical sequence selected in a prime-number-theorem way".

So the Goldbach conjecture is not saying that there is something special about the primes; in fact, it's saying the opposite, that the primes are typical of similar sequences, that nothing in the specific ways that the primes are selected has an impact on the sum of two primes. So the Goldbach conjecture is essentially saying "there is no obstruction to the primes being typical in this way".

## One obstruction

Did you notice that, so far, at no point did I require N to be an even number? But all the primes except for 2 are odd. So the distribution of sums of primes is very (very!) heavily skewed towards even numbers; most odd numbers will not appear at all. So, that is one clear obstruction to the possible values of the sum, coming from the way the primes are constructed. The Goldbach conjecture is therefore saying that there are no additional obstructions beyond this one condition on parity.

In fact, the Goldbach conjecture has changed; 1 used to be seen as a prime number, and the original conjecture included 2=1+1 as another example Then 1 was removed from the list of prime numbers, and it turned out, as far as we can tell, that 2 was the only even number we lost from the list of sums.

If we removed 2 from the list of primes, we'd only lose 4=2+2 as a sum. Similarly, if we strike out the first m primes, we expect - on probabilistic grounds - that "all numbers greater than a given n are the sums of two primes (first m primes not included)". If that were to fail, then there's a really interesting obstruction out there.

## Fermat's last theorem was likely (for n>3)

We can show, similarly, that Fermat's last theorem was very likely on probabilistic grounds. The theorem states that, for n>2, there do not exist natural numbers x,y,z>0 such that xn+yn=zn.

Fix z and n>3. Counting 1 and z, there are z natural numbers less than or equal to z. Therefore there are z2 possible xn+yn, all less than 2zn. So the probability that any two of these n-th powers sum to zn is z2/(2zn)=1/(2zn−2).

So the probability that there are no z's such that zn=xn+yn, is

p2,n=∞∏z=21−1/(2zn−2).

The sum ∑∞z=2(1/2)⋅1/(zn−2) converges. Moreover, we can also sum over n: ∑∞z=2,n=4(1/2)⋅1/(zn−2)=∑∞z=2(1/2)⋅z−211−1/z. This also converges. So the probability of Fermat's last theorem was non-zero, at least for n>3; add on the fact that the theorem was proved for many n and checked for many x,y, and z, means that, even before it was proved, it was very probable it was correct.

So Andrew Wiles's genius was in showing there were no unexpected obstructions for the "likely" outcome to be true. That's why the proof is so hard: he was trying to prove something very "likely", and show an absence of structure, rather than a presence, without knowing what that structure could be.

## Euler's conjecture was unlikely

Euler's conjecture was that you needed to sum at least n powers of n to get another power of n; Fermat's last theorem establishes this for n=3, and Euler theorised that this extended.

Euler's theorem is in fact false; for n=4 we have three fourth powers that sum to another fourth power as:

958004+2175194+4145604=4224814.

There are counterexamples known for n=5 as well, so the conjecture is false, and not just for one value of n.

More interesting from our perspective, we expect it to be false on probabilistic grounds. Recall that the argument about Fermat's last theorem does not work for n=3; it fails because the crucial sum is of the type 1+1/2+1/3+1/4+…, which diverges.

Similarly, if we estimate the probability of Euler's conjecture, we get terms like the following (for some constants Cn):

p2,n=∞∏z=21−Cn/(zn−(n−1))=p2,n=∞∏z=21−Cn/z.

This goes to zero for the same reason as the n=3 case.

So, on probabilistic grounds, we expect Fermat's last theorem to be true for n≥4, and we expect Euler's conjecture to be false.

The only unexpected result here is that Fermat's last theorem and Euler's conjecture are

truefor n=3. So something about the structure of the problem for n=3 is moving the result away from the probabilistic outcome.## The "Stuart conjecture"

Based on what I wrote about Euler's conjecture, I'll hazard a conjecture myself, which I believe to be true on probabilistic grounds. Namely that if there are k integers, whose n-th powers sum non-trivially to another n-th power, then k is greater than or equal to n/2.

Fermat's last theorem shows this is true for 1,2,3,4,5, and 6.