# 27

Eliezer wrote a post warning against unrealistically confident estimates, in which he argued that you can't be 99.99% sure that 53 is prime. Chris Hallquist replied with a post arguing that you can.

That particular case is tricky. There have been many independent calculations of the first hundred prime numbers. 53 is a small enough number that I think someone would notice if Wikipedia included it erroneously. But can you be 99.99% confident that 1159 is a prime? You found it in one particular source. Can you trust that source? It's large enough that no one would notice if it were wrong. You could try to verify it, but if I write a Perl or C++ program, I can't even be 99.9% sure that the compiler or interpreter will interpret it correctly, let alone that the program is correct.

Rather than argue over the number of nines to use for a specific case, I want to emphasize the the importance of not assigning things probability zero or one. Here's a real case where approximating 99.9999% confidence as 100% had disastrous consequences.

I developed a new gene-caller for JCVI. Genes are interpreted in units of 3 DNA nucleotides called codons. A bacterial gene starts with a start codon (usually ATG, TTG, or GTG) and ends at the first stop codon (usually TAG, TGA, or TAA). Most such sequences are not genes. A gene-caller is a computer program that takes a DNA sequence and guesses which of them are genes.

The first thing I tried was to create a second-order Markov model on codons, and train it on all of the large possible genes in the genome. (Long sequences without stop codons are unlikely to occur by chance and are probably genes.) That means that you set P = 1 and go down the sequence of each large possible gene, codon by codon, multiplying P by the probability of seeing each of the 64 possible codons in the third position given the codons in the first and second positions. Then I created a second Markov model from the entire genome. This took about one day to write, and plugging these two models into Bayes' law as shown below turned out to work better than all the other single-method gene-prediction algorithms developed over the past 30 years.

But what probability should you assign to a codon sequence that you've never seen? A bacterial genome might have 4 million base pairs, about half of which are in long possible genes and will be used for training. That means your training data for one genome has about 2 million codon triplets. Surprisingly, a little less than half of all possible codon triplets do not occur at all in that data (DNA sequences are not random). What probability do you assign to an event that occurs zero times out of 2 million?

This came up recently in an online argument. Another person said that, if the probability that X is true is below your detection threshold or your digits of accuracy, you should assign P(X) = 0, since any other number is just made up.

Well, I'd already empirically determined whether that was true for the gene caller. First, due to a coding error, I assigned such events P(X) = 1 / (64^3 * size of training set), which is too small by about 64^3. Next I tried P(X) = 0.5 / (size of training set), which is approximately correct. Finally I tried P(X) = 0. I tested the results on genomes where I had strong evidence for what where and were not genes.

How well do you think each P(X) worked?

The two non-zero probabilities gave nearly the same results, despite differing by 6 orders of magnitude. But using P(X) = 0 caused the gene-caller to miss hundreds of genes per genome, which is a disastrous result. Why?

Any particular codon triplet that was never found in the training set would have a prior of less than one in 4 million. But because a large number of triplets are in genes outside the training set, that meant some of those triplets (not most, but about a thousand of them) had true priors of being found somewhere in those genes of nearly one half. (You can work it out in more detail by assuming a Zipf law distribution of priors, but I won't get into that.)

So some of them did occur within genes in that genome, and each time one did, its assigned probability of zero annihilated all the hundreds of other pieces of evidence for the existence of that gene, making the gene impossible to detect.

You can think of this using logarithms. I computed

P(gene | sequence) = P(sequence | gene) * P(gene) / P(sequence)

where P(sequence) and P(sequence | gene) are computed using the two Markov models. Each of them is the product of a sequence of Markov probabilities. Ignoring P(gene), which is constant, we can compute

log(P(gene|sequence)) ~ log(P(sequence | gene)) - log(P(sequence)) =

sum (over all codon triplets in the sequence) [ log(P(codon3 | codon1, codon2, gene)) - log(P(codon3 | codon1, codon2)) ]

You can think of this as adding the bits of information it would take to specify that triplet outside of a gene, and subtracting the bits of information it would take to specify that information inside a gene, leaving bits of evidence that it is in a gene.

If we assign P(codon3 | codon1, codon2, gene) = 0, the number of bits of information it would take to specify "codon3 | codon1, codon2" inside a gene is -log(0) = infinity. Assign P(X) = 0 is claiming to have infinite bits of information that X is false.

Going back to the argument, the accuracy of the probabilities assigned by the Markov model are quite low, probably one to three digits of accuracy in most cases. Yet it was important to assign positive probabilities to events whose probabilities were at least seven orders of magnitude below that.

It didn't matter what probability I assigned to them! Given hundreds of other bits scores to add up, changing the number of bits taken away by one highly improbable event by 10 had little impact. It just matters not to make it zero.