Trouble with Bayes Theorem? (The actual math is confusing)

by TheatreAddict1 min read25th Sep 201117 comments

12

Personal Blog

This is probably going to sound utterly ridiculous, but I have a sad confession.

I've read Yudkowsky's post on Bayes' Theorem (http://yudkowsky.net/rational/bayes) five times. I've written down the equation. Tried to formulate an answer. 

I still don't understand it. That being said, I've lived my entire life under the false mentality that maths is boring and painful, and it's just recently I've tried to actually understand the concepts I learn in school, and not just temporarily memorize them for the next exam. 

Here's the problem, on Yudkowsky's post: 

"1% of women at age forty who participate in routine screening have breast cancer.  80% of women with breast cancer will get positive mammographies.  9.6% of women without breast cancer will also get positive mammographies.  A woman in this age group had a positive mammography in a routine screening.  What is the probability that she actually has breast cancer?" 

When Eliezer changes the percentages to real numbers:

"100 out of 10,000 women at age forty who participate in routine screening have breast cancer.  80 of every 100 women with breast cancer will get a positive mammography.  950 out of  9,900 women without breast cancer will also get a positive mammography.  If 10,000 women in this age group undergo a routine screening, about what fraction of women with positive mammographies will actually have breast cancer?"

 

When I see this equation, I can properly make the answer come out to 7.8 percent. I do this, by taking the 80 women, and dividing 80 women by the 80 women plus the 950 women, so 80/80+950 (or 80/1030=.078). So I get 7.8%, which should be the right answer.

 

But when I try to do the same with percentages, it all gets sort of screwy. I take the 80 percent of women (.8) divided by that same 80 percent (.8) plus 9.5 percent of women without cancer who test postive for it (.095). So I get .8/.8+.095=89%.

I feel like I'm making a really, really stupid error. But I just don't know what it is. >_> 

12

17 comments, sorted by Highlighting new comments since Today at 11:59 PM
New Comment

But when I try to do the same with percentages, it all gets sort of screwy. I take the 80 percent of women (.8) divided by that same 80 percent (.8) plus 9.5 percent of women without cancer who test postive for it (.095). So I get .8/.8+.095=89%.

Let's look at those numbers closely. In particular the numbers on the denominator (80% + 9.5%). What are those percentages of?

  • 80% of women with breast cancer
  • 9.5% of women without cancer

When we add 80% + 9.5% and get 89.5% what is that 89.5% of? Nothing that makes sense because those are two completely different units. It's like adding 80% of an apple with 10% of an orange. We end up with 90% of an appleorange which just looks silly.

I recommend taking a look at An Intuitive Explanation of Eliezer Yudkowsky’s Intuitive Explanation of Bayes’ Theorem.

I came to reccomend that page and the Children's Book version of Bayes Theorem. Bayes Theorem didn't click until I read both of those.

It may seem silly, but I still visualize Bayesian updates as colored polygons.

You're forgetting the "base rate" in your calculation: the actual rate of cancer in the population. What you should really be taking the ratio of is (the fraction of all women that have cancer and test positive) / (the fraction of all women that test positive, whether or not they have cancer). In percentages, that's

(80% of the 1% of women who have cancer, who correctly test positive) = 0.8 * 0.01.

divided by

(80% of the 1% of women who have cancer, who correctly test positive) together with (9.6% of the 99% of women who don't have cancer, who test positive anyway) = 0.8 0.01 + 0.096 0.99.

So the ratio is (0.8 0.01) / (0.8 0.01 + 0.096 * 0.99), and that does equal 0.078.

Thanks. I'm pretty sure I understand now. Although I'm not sure why I get the correct answer when I'm working with the actual numbers and not percentages when I do the math wrong.

But when I do the math like you wrote, I get the right answer for the precentages. So I get that part. But aren't I ignoring the base rate in the actual numbers one? Or no?

Although I'm not sure why I get the correct answer when I'm working with the actual numbers and not percentages when I do the math wrong.

I know it now makes more sense to you now, but I want to point out that reality isn't school, and nobody is going to take marks off for using actual numbers or ratios instead of percentages (the 'pure' way that the teacher prefers or what-have-you).

A calculator more reliably gets me the answer than mental arithmetic, and so I use a calculator at work even though it seems lazier than doing it in my head - in the same way, if ratios and actual numbers more reliably let you use Bayes Theorem than percentages, use actual numbers and all the people who think it's purer to use percentages be damned.

I'm awfully glad to here that, I'm not a big fan of percentages... Real numbers just come easier to me, I suppose.

Once I figure out the formulat itself, then I feel comfortable using a calculator, but I hate using a calculator if I don't understand the mental math to begin with.

You're not. Remember, you're not taking 80 of the 10,000 women in the population. You're only taking 80 of the 100 women with breast cancer. Likewise, it's not 9.6% of all the women, it's 9.6% of the women who don't have breast cancer, or 950/9900. The wording of the problem already took the base rates into count, so when you're plugging the real numbers in, you are automatically taking the base rates into account. By giving you 80/100 and 950/9900, Eliezer already did the division for you.

....Oh.

Well, thanks Owen, Swimmy. I now understand Bayes Theorem significantly more than I did a half hour ago. :)

You can think of accounting for the base rate as equivalent to using the actual numbers.

How many women have cancer and test positive? 0.8 probability 0.01 population. How many women don't have cancer and test positive? 0.096 probability 0.99 population.

When you use the actual numbers of people, you get those numbers by using the base rate: 10,000 women total, of which 100 have cancer (that's the base rate in action), of which 80 test positive, etc. So if you use the numbers 80 (= 0.8 0.01 10000) and 950 = (0.096 0.99 10000), you're not ignoring the base rate. You would be ignoring the base rate if you used the numbers 8000 and 960 (80% and 9.6% of the population of 10,000, respectively), but those numbers don't refer to any relevant groups of people.

But aren't I ignoring the base rate in the actual numbers one? Or no?

The actual numbers in the problem were chosen in such a way to make the base rates obvious. Here is another version using real numbers where the base rates aren't quiet so obvious, see if you can get it right:

"1 out of every 100 women at age forty who participate in routine screening have breast cancer. 80 out of every 100 women with breast cancer will get positive mammographies. 96 out of every 1,000 women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?"

80/80+950 (or 80/1030=.078)

In mathematical language this is written 80/(80+950). This is because when you're doing expressions you do multiplication and division before addition but parentheses before either. 80/80+950 actually equals 951. This isn't what is giving you the problem right now but it actually would be if you used an advanced calculator.

Does this help make it intuitive? I haven't seen it explained this way.

After going through all this trouble, did you find math less boring than before?

Here's how I think about these problems. You know that the woman tested positive. There are two kinds of women who test positive: women who actually have breast cancer and correctly test positive (the true positives), and women who don't have breast cancer but mistakenly test positive (the false positives). How big are these two groups?

First, the true positives: women who actually have breast cancer and correctly test positive. 1% of women actually have breast cancer, and 80% of them test positive, so 0.8% of women are in that group (.01 x .8).

Then, the false positives: women who don't have breast cancer but mistakenly test positive. 99% of women don't have breast cancer, and 9.6% of them mistakenly test positive, so 9.5% of women are in that group (.99 x .096).

What you care about is the relative size of the two groups. For every 0.8 women who have cancer and test positive, 9.5 women don't have cancer but still test positive. That's a 0.8:9.5 ratio; if you want to turn it into a percentage it's 0.8/10.3 = 7.8% of the women who test positive are ones that actually have cancer.

So instead of using the whole formula, I think through the problem and do three simple calculations along the way. If you only need to calculate a rough estimate, you can do this even quicker with less calculating. Glancing at the numbers, about 1% of women are in the true positive group, and about 10% of women are in the false positive group, so about 1/11 of women who test positive have cancer (9%). That's pretty close to the actual answer of 7.8%.

[-][anonymous]10y 0

I absolutely understand this situation. Been there, done that, got the Post-It note which explains the concept in Rubix-code and lives permanently in my notebook.

You've got your p(can+|mam+)=(p[mam+|can+]p[can+])/p(mam+). In numbers, that's .8(eighty percent of 1%,) divided by 9.5(9.5% of 100%) plus that same .8. What you're doing, I think, is omitting to recall that the eighty percent of women with cancer who test positive compromise .8% of all the women who get tested.

Also, it shook me up like nobody's business that the percentage of women without cancer who tested positive kept switching between problems, between 9.5 and 9.6! (I'm easily shaken.)

The error is pretty basic, but it is not stupid because you asked for help. The only stupid question is the one which is not asked. This is a common statement, but a true one and I mean it.

[This comment is no longer endorsed by its author]Reply
[-][anonymous]10y 0

Owen has your answer for you here but something I noticed was a part of your math that could tell you the idea was wrong before you checked the final percentage: When you were adding the percentages, you ended up with .895 or 89.5% chance of getting a positive result from the test (you added all the positive odds, with the thinking that they're independent and referring to the same group). But it's fairly clear that more than 10.1% will get negative results, so the addition of those probabilities can't be right (This is more useful to way to check if you know the negative rate but not the quantity amount).