Strong Evidence is Common

by Mark Xu1 min read13th Mar 202142 comments

150

Bayes' TheoremRationality
Curated
This is a linkpost for https://markxu.com/strong-evidence

Portions of this are taken directly from Three Things I've Learned About Bayes' Rule.

One time, someone asked me what my name was. I said, “Mark Xu.” Afterward, they probably believed my name was “Mark Xu.” I’m guessing they would have happily accepted a bet at 20:1 odds that my driver’s license would say “Mark Xu” on it.

The prior odds that someone’s name is “Mark Xu” are generously 1:1,000,000. Posterior odds of 20:1 implies that the odds ratio of me saying “Mark Xu” is 20,000,000:1, or roughly 24 bits of evidence. That’s a lot of evidence.

Seeing a Wikipedia page say “X is the capital of Y” is tremendous evidence that X is the capital of Y. Someone telling you “I can juggle” is massive evidence that they can juggle. Putting an expression into Mathematica and getting Z is enormous evidence that the expression evaluates to Z. Vast odds ratios lurk behind many encounters.

One implication of the Efficient Market Hypothesis (EMH) is that is it difficult to make money on the stock market. Generously, maybe only the top 1% of traders will be profitable. How difficult is it to get into the top 1% of traders? To be 50% sure you're in the top 1%, you only need 200:1 evidence. This seemingly large odds ratio might be easy to get.

On average, people are overconfident, but 12% aren't. It only takes 50:1 evidence to conclude you are much less overconfident than average. An hour or so of calibration training and the resulting calibration plots might be enough.

Running through Bayes’ Rule explicitly might produce a bias towards middling values. Extraordinary claims require extraordinary evidence, but extraordinary evidence might be more common than you think.

42 comments, sorted by Highlighting new comments since Today at 11:26 PM
New Comment

Corollary: most beliefs worth having are extreme.

Though any belief so extreme wouldn't really feel like a "belief" in the colloquial sense, I don't internally label my belief that there is a chair under my butt as a "belief". That label instinctually gets used for things I am much less certain about, so most normal people doing an internal search for "beliefs" will only think of things that they are not extremely certain of. Most beliefs worth having are extreme, but most beliefs internally labeled as "belief" worth having are not extreme.

"Worth having" is a separate argument about relative value of new information. It is reasonable when markets exist or we are competing in other ways where we can exploit our relative advantage. But there's a different mistake that is possible which I want to note.

Most extreme beliefs are false; for every correct belief, there are many, many extreme beliefs that are false. Strong consensus on some belief is (evidence for the existence of) strong evidence of the truth of that belief, at least among the considered alternatives. So picking a belief on the basis of extremity ("Most sheeple think X, so consider Y") is doing this the wrong way around, because extremity alone is negligible evidence of value. (Prosecutor's fallacy.)

What makes the claim that extremity isn't a useful indicator of value, less valid? That is, where should we think that extreme beliefs should even be considered? 

I think the answer is when the evidence is both novel and cumulatively outweighs the prior consensus, or the belief is new / previously unconsidered. ("We went to the moon to inspect the landing site," not "we watched the same video again and it's clearly fake.") So we should only consider extreme beliefs, even on the basis of our seemingly overwhelming evidence, if the proposed belief is significantly newer than the extant consensus AND we have a strong argument that the evidence is not yet widely shared / understood.

You seem to have made two logical errors here. First, "This belief is extreme" does not imply "This belief is true", but neither does it imply "This belief is false". You shouldn't divide beliefs into "extreme" and "non-extreme" buckets and treat them differently. 

Second, you seem to be using "extreme" to mean both "involving very high confidence" and "seen as radical", the latter of which you might mean to be "in favour of a proposition I assign a very low prior probability". 

Restating my first objection, "This belief has prior odds of 1:1024" is exactly 10 bits of evidence against the belief. You can't use that information to update the probability downward, because -10 bits is "extreme", any more than you can update the probability upward because -10 bits is "extreme". If you could do that, you would have a prior that immediately requires updating based on its own content (so it's not your real prior), and I'm pretty sure you would either get stuck in infinite loops of lowering and raising the probability of some particular belief (based on whether it is "extreme" or not), or else be able to pump out infinite evidence for or against some belief.

Extreme, in this context, was implying far from the consensus expectation. That implies both "seen as radical" and "involving very high [consensus] confidence [against the belief]." 

Contra your first paragraph, I think, I claim that this "extremeness" is valid Bayesian evidence for it being false, in the sense that you identify in your third paragraph - it has low prior odds. Given that, I agree that it would be incorrect to double-count the evidence of being extreme. But my claim was that, holding "extremeness" constant, the newness of a claim was independent reason to consider it as otherwise more worthy of examination, (rather than as more likely,) since VoI was higher / the consensus against it is less informative. And that's why it doesn't create a loop in the way you suggested. 

So I wasn't clear in my explanation, and thanks for trying to clarify what I meant. I hope this explains better / refined my thinking to a point where it doesn't have the problem you identified - but if I'm still not understanding your criticism, feel free to try again.

Extreme, in this context, was implying far from the consensus expectation.

FWIW, my interpretation of Eliezer's comment was just that he meant high confidence.

There is a subtlety here. Large updates from extremely unlikely to quite likely are common. Large updates from quite likely to exponentially sure are harder to come by. Lets pick an extreme example, suppose a friend builds a coin tossing robot. The friend sends you a 1mb file, claiming it is the sequence of coin tosses. Your probability assigned to this particular sequence being the way the coin landed will jump straight from  to somewhere between 1% and 99% (depending on the friends level of trustworthiness and engineering skill) Note that the probability you assign to several other sequences increases too. For example, its not that unlikely that your friend accidentally put a not in their code, so your probability on the exact opposite sequence should also be  Its not that unlikely that they turned the sequence backwards, or xored it with pi or ... Do you see the pattern. You are assigning high probability to the sequences with low conditional Komolgorov complexity relative to the existing data. 

Now think about what it would take to get a probability of  on the coin landing that particular sequence. All sorts of wild and wacky hypothesis have probability  . From the boring stuff like a dodgy component or other undetected bug, to more exotic hypothesis like aliens tampering with the coin tossing robot, or dark lords of the matrix directly controlling your optic nerve. You can't get this level of certainty about anything ever. (modulo concerns about what it means to assign p<1 to probability theory) 

You can easily update from exponentially close to 0, but you can't update to exponentially close to one. This may have something to do with  there being exponentially many very unlikely theories to start off with. But only a few likely ones.

If you have 3 theories that predict much the same observations, and all other theories predict something different, you can easily update to "probably one of these 3". But you can't tell those 3 apart. In AIXI, any turing machine has a parade of slightly more complex, slightly less likely turing machines trailing along behind it. The hypothesis "all the primes, and grahams number" is only slightly more complex than "all the primes", and is very hard to rule out. 

Meta point: this is one of those insights which is very likely to hit you over the head if you're doing practical technical work with probabilitistic models, but not if you're just using them for small semi-intuitive problems (a use-case we often see on LW).

I remember the first time I wrote a mixture of gaussians clustering model, and saw it spitting out probabilities like 10^-5000, and thought it must be a bug. It wasn't a bug. Probabilities naturally live on a log scale, and those sorts of numbers are normal once we move away from artificially-low-dimensional textbook problems and start working with more realistic high-dimensional systems. When your data channel has a capacity of kilobytes or megabytes per data point, even if 99% of that information is irrelevant, that's still a lot of bits; the probabilities get exponentially small very quickly.

Tying back to an example in the post: if we're using ascii encoding, then the string "Mark Xu" takes up 49 bits. It's quite compressible, but that still leaves more than enough room for 24 bits of evidence to be completely reasonable.

Tying back to an example in the post: if we're using ascii encoding, then the string "Mark Xu" takes up 49 bits. It's quite compressible, but that still leaves more than enough room for 24 bits of evidence to be completely reasonable.

This paper suggests that spoken language is consistently ~39bits/second.

https://advances.sciencemag.org/content/5/9/eaaw2594

I've curated this post. It's really changed my worldview, to realize how often I was able to quickly get substantial amounts of evidence, and resolve an important question. It's also related to The First Sample Gives the Most Information.

I also admire how short this post is. I do not know how to write such short posts.

It's worth noting that most of the strong evidence here is in locating the hypothesis.
That doesn't apply to the juggling example - but that's not so much evidence. "I can juggle" might take you from 1:100 to 10:1. Still quite a bit, but 10 bits isn't 24.

I think this relates to Donald's point on the asymmetry between getting from exponentially small to likely (commonplace) vs getting from likely to exponentially sure (rare). Locating a hypothesis can get you the first, but not the second.

It's even hard to get back to exponentially small chance of x once it seems plausible (this amounts to becoming exponentially sure of ¬x). E.g., if I say "My name is Mortimer Q. Snodgrass... Only kidding, it's actually Joe Collman", what are the odds that my name is Mortimer Q. Snodgrass? 1% perhaps, but it's nowhere near as low as the initial prior.
The only simple way to get all the way back is to lose/throw-away the hypothesis-locating information - which you can't do via a Bayesian update. I think that's what makes privileging the hypothesis such a costly error: in general you can't cleanly update your way back (if your evidence, memory and computation were perfectly reliable, you could - but they're not). The way to get back is to find the original error and throw it out.

How difficult is it to get into the top 1% of traders? To be 50% sure you're in the top 1%, you only need 200:1 evidence. This seemingly large odds ratio might be easy to get.

I don't think your examples say much about this. They're all of the form [trusted-in-context source] communicates [unlikely result]. They don't seem to show a reason to expect strong evidence may be easy to get when this pattern doesn't hold. (I suppose they do say that you should check for the pattern - and probably it is useful to occasionally be told "There may be low-hanging fruit. Look for it!")

Great comment, though I disagree with this line:

The only simple way to get all the way back is to lose/throw-away the hypothesis-locating information - which you can't do via a Bayesian update.

You can definitely do this via a Bayesian update. This is exactly the "explaining away" phenomenon in causal DAGs/Bayes nets: I notice the sidewalk is wet, infer that rain is likely, but then notice the sprinkler ran, so the (Bayesian) sprinkler-update sends my chance-of-rain back down to roughly its original value.

Sure, but what I mean is that this is hard to do for hypothesis-location, since post-update you still have the hypothesis-locating information, and there's some chance that your "explaining away" was itself incorrect (or your memory is bad, you have bugs in your code...).

For an extreme case, take Donald's example, where the initial prior would be 8,000,000 bits against.
Locating the hypothesis there gives you ~8,000,000 bits of evidence. The amount you get in an "explaining away" process is bounded by your confidence in the new evidence. How sure are you that you correctly observed and interpreted the "explaining away" evidence? Maybe you're 20 bits sure; perhaps 40 bits sure. You're not 8,000,000 bits sure.

Then let's say you've updated down quite a few times, but not yet close to the initial prior value. For the next update, how sure are you that the stored value that you'll be using as your new prior is correct? If you're human, perhaps you misremembered; if a computer system, perhaps there's a bug...
Below a certain point, the new probability you arrive at will be dominated by contributions from weird bugs, misrememberings etc.
This remains true until/unless you lose the information describing the hypothesis itself.

I'm not clear how much this is a practical problem - I agree you can update the odds of a hypothesis down to no-longer-worthy-of-consideration. In general, I don't think you can get back to the original prior without making invalid assumptions (e.g. zero probability of a bug/hack/hoax...), or losing the information that picks out the hypothesis.

One implication of the Efficient Market Hypothesis (EMH) is that is it difficult to make money on the stock market. Generously, maybe only the top 1% of traders will be profitable. 

Nitpick: it's incredibly easy to make money on the stock market: just put your money in it, ideally in an index fund. It goes up by an average of 8% a year. Almost all traders will be profitable, although many won't beat that 8% average. 

The entire FIRE movement is predicated on it being incredibly simple to make money on the stock market. It takes absolutely zero skill to be a sufficiently profitable trader, given a sizeable enough initial investment. 

I get that you're trying to convey above-market-rate returns here, but your wording is imprecise. 

Maybe a nitpick, but the driver's license posterior of 95% seems too high. (Or at least the claim isn't stated precisely.) I'd have less than a 95% success rate at guessing the exact name string that appears on someone's driver's license. Maybe there's a middle name between the "Mark" and the "Xu", maybe the driver's license says "Marc" or "Marcus", etc.

I think you can get to 95% with a phone number or a wifi password or similar, so this is probably just a nitpick.

In the specific scenario of "Mark Xu asks to bet me about the name on his driver's license", my confidence drops immediately because I start questioning his motives for offering such a weird bet.

Well, yes, if you interpret a lot of thought experiments literally, the proper response is more like "I think I'm having a stroke or that I overdosed on potent psychoactive substances or am asphyxiating." than anything 'in the spirit' of the experiments.

But it gets old fast to describe (yet again) how you'd answer any question posed to you by Omega with "You're a figment of my imagination." or whatever.

Also many people with East Asian birth names go by some Anglicized given name informally, enough that I'm fairly sure randomly selected "Mark Xu"s in the US will have well below 95% at "has a driver's license that says "Mark Xu"

I also bet that if you just said "Mark Xu" there'd be a high rate of not knowing how the last name was spelled.

Yeah, I agree 95% is a bit high.

Linking my comment from the Forum:

I think in the real world there are many situations where (if we were to put explicit Bayesian probabilities on such beliefs, which we almost never do), beliefs with ex ante ~0 credence quickly get extraordinary updates. My favorite example is sense perception. If I woke up after sleeping on a bus and were to put explicit Bayesian probabilities on anticipating what I will see next time I open my eyes, then my belief I'd assign in the true outcome (ignoring practical constraints like computation and my near inability to have any visual imagery) has ~0 credence. Yet it's easy to get strong Bayesian updates: I just open my eyes. In most cases, this should be a large enough update, and I go on my merry way. 

But suppose I open my eyes and instead see  people who are  approximate lookalikes of dead US presidents sitting around the bus. Then at that point (even though the ex ante probability of this outcome and that of a specific other thing I saw isn't much different), I will correctly be surprised, and have some reasons to doubt my sense perception.

Likewise, if instead of saying your name is Mark Xu, you instead said "Lee Kuan Yew", I at least would be pretty suspicious that your actual name is Lee Kuan Yew.

I think a lot of this confusion in intuitions can be resolved by looking at what MacAskill calls the difference between unlikelihood and fishiness:

Lots of things are a priori extremely unlikely yet we should have high credence in them: for example, the chance that you just dealt this particular (random-seeming) sequence of cards from a well-shuffled deck of 52 cards is 1 in 52! ≈ 1 in 10^68, yet you should often have high credence in claims of that form.  But the claim that we’re at an extremely special time is also fishy. That is, it’s more like the claim that you just dealt a deck of cards in perfect order (2 to Ace of clubs, then 2 to Ace of diamonds, etc) from a well-shuffled deck of cards. 

Being fishy is different than just being unlikely. The difference between unlikelihood and fishiness is the availability of alternative, not wildly improbable, alternative hypotheses, on which the outcome or evidence is reasonably likely. If I deal the random-seeming sequence of cards, I don’t have reason to question my assumption that the deck was shuffled, because there’s no alternative background assumption on which the random-seeming sequence is a likely occurrence.  If, however, I deal the deck of cards in perfect order, I do have reason to significantly update that the deck was not in fact shuffled, because the probability of getting cards in perfect order if the cards were not shuffled is reasonably high. That is: P(cards not shuffled)P(cards in perfect order | cards not shuffled) >> P(cards shuffled)P(cards in perfect order | cards shuffled), even if my prior credence was that P(cards shuffled) > P(cards not shuffled), so I should update towards the cards having not been shuffled.

Put another way, we can dissolve this by looking explicitly at Bayes' theorem. 

and in turn, 

 is high in both the "fishy" and "non-fishy" regimes. However, is much higher for fishy hypotheses than  for non-fishy hypotheses, even if the surface-level evidence looks similar!

This a really nice 'sharpening' of 'fishy'!

This is insightful.  The areas where strong evidence is common are largely those areas we don't intuitively think of as governed by probability theory and where classic logic performs well.  

It seems like someone could take this a little further even and show that the limiting case for strong evidence and huge likelihood ratios would just be logic.  This might be fruitful to unpack.  I could see it being the case that our instincts for seeking "certainty" make more sense than we give them credit for.  Gathering enough evidence sometimes allows reasoning to be performed using propositional logic with acceptable results. 

Such logic is many orders of magnitude cheaper to evaluate compute wise vs probabilistic reasoning, especially as we get into larger and larger causal networks.  There’s an obvious tradeoff between the cost to obtain more evidence vs more compute – it’s not always a choice that’s available (e.g., spend time investigating vs. spend time thinking/tinkering with models) but is often enough.

When I think about how I’m using the reasoning skills I’ve picked up here that’s roughly what I’m having to do for real-world problems.  Use probabilistic reasoning to resolve simpler more object level propositions into true/false/maybe, then propositional logic to follow the implications.  Fail back to probabilistic reasoning whenever encountering a paradox or any of the other myriad problems with simple logic – Or just for periodic sanity checking.

This comment is insightful!

The areas where strong evidence is common are largely those areas we don't intuitively think of as governed by probability theory and where classic logic performs well.

I'm pretty sure that statistics (as mathematics) all assume 'logic' (first-order logic at least), so I think this is also technically correct!

Gathering enough evidence sometimes allows reasoning to be performed using propositional logic with acceptable results.

Yes! Being able to use logic can be a fantastic super-power (when it works). Sometimes the universe really is like a Sudoku puzzle!

Being able to use both probabilities and logical statements, and appropriately, is a significant part of what I think David Chapman is gesturing at with what he calls 'meta-rationality'. And beyond both of those formal rational systems, there's an entire Platonic universe of alternative ontologies that can also be useful in some contexts (and for some purposes).

The prior odds that someone’s name is “Mark Xu” are generously 1:1,000,000. Posterior odds of 20:1 implies that the odds ratio of me saying “Mark Xu” is 20,000,000:1, or roughly 24 bits of evidence. That’s a lot of evidence.

There are 26 letters in the English alphabet. Even if, for simplicity, our encoding ignores word boundaries and message ending, that's  bits per letter so hearing you say "Mark Xu" is 28.2 bits of evidence total - more than the 24 bits required.

Of course - my encoding is flawed. An optimal encoding should assign "Mark Xu" with less bits than, say, "Rqex Gh" - even though they both have the same amount of letters. And "Maria Rodriguez" should be assigned an even shorter message even though it has more than twice the letters of "Mark Xu".

Measuring the amount of information given in messages is not as easy to do on actual real life cases as it is in theory...

A study that found that English has about 1.1 bits of information per letter, if you already know the message is in english. (XKCD "What If" linked to the original)

Isn't that the information density for sentences? With all the conjunctions, and with the limitness of the number of different words that can appear in different places of the sentence, it's not that surprising we only get 1.1 bits per letter. But names should be more information dense - maybe not the full 4.7 (because some names just don't make sense) but at least 2 bits per letter, maybe even 3?

I don't know where to find (or how to handle) a big list of full names, so I'm settling for the (probably partial) lists of first names from https://www.galbithink.org/names/us200.htm (picked because the plaintext format is easy to process). I wrote a small script: https://gist.github.com/idanarye/fb75e5f813ddbff7d664204607c20321 

When I run it on the list of female names from the 1990s I get this:

$ ./names_entropy.py https://www.galbithink.org/names/s1990f.txt
Entropy per letter: 1.299113499617074

Any of the 5 rarest name are 1:7676.4534883720935
Bits for rarest name: 12.906224226276189
Rarest name needs to be 10 letters long
Rarest names are between 4 and 7 letters long

#1 Most frequent name is Christin, which is 8 letters long
Christin is worth 5.118397576228959 bits
Christin would needs to be 4 letters long

#2 Most frequent name is Mary, which is 4 letters long
Mary is worth 5.380839995073667 bits
Mary would needs to be 5 letters long

#3 Most frequent name is Ashley, which is 6 letters long
Ashley is worth 5.420441711983749 bits
Ashley would needs to be 5 letters long

#4 Most frequent name is Jesse, which is 5 letters long
Jesse is worth 5.4899422055346445 bits
Jesse would needs to be 5 letters long

#5 Most frequent name is Alice, which is 5 letters long
Alice is worth 5.590706018293878 bits
Alice would needs to be 5 letters long

And when I run it on the list of male names from the 1990s I get this:

$ ./names_entropy.py https://www.galbithink.org/names/s1990m.txt
Entropy per letter: 1.3429318549784128

Any of the 11 rarest name are 1:14261.4
Bits for rarest name: 13.799827993443198
Rarest name needs to be 11 letters long
Rarest names are between 4 and 8 letters long

#1 Most frequent name is John, which is 4 letters long
John is worth 5.004526222833823 bits
John would needs to be 4 letters long

#2 Most frequent name is Michael, which is 7 letters long
Michael is worth 5.1584658860672485 bits
Michael would needs to be 4 letters long

#3 Most frequent name is Joseph, which is 6 letters long
Joseph is worth 5.4305677416620135 bits
Joseph would needs to be 5 letters long

#4 Most frequent name is Christop, which is 8 letters long
Christop is worth 5.549228103371756 bits
Christop would needs to be 5 letters long

#5 Most frequent name is Matthew, which is 7 letters long
Matthew is worth 5.563161441124633 bits
Matthew would needs to be 5 letters long

So the information density is about 1.3 bits per letter. Higher than 1.1, but not nearly as high as I expected. But - the rarest names in these list are about 1:14k - not 1:1m like OP's estimation. Then again - I'm only looking at given names - surnames tend to be more diverse. But that would also give them higher entropy, so instead of to figure out how to scale everything let's just go with the given names, which I have numbers for (for simplicity, assume these lists I found are complete)

So - the rare names are about half as long as the number of letters required to represent them. The frequent names are anywhere between the number of letters required to represent them and twice that amount. I guess that is to be expected - names are not optimized to be an ideal representation, after all. But my point is that the amount of evidence needed here is not orders of magnitude bigger than the amount of information you gain from hearing the name.

Actually, due to what entropy is supposed to represent, on average the amount of information needed is exactly the amount of information contained in the name.

"Mark Xu" is an unusually short name, so the message-ending might actually contain most of the entropy.

The phrases "my name is Mark Xu" and "my name is Mortimer Q. Snodgrass" contain roughly the same amount of evidence, even though the second has 12 additional letters.  ("Mark Xu" might be a more likely name on priors, but it's nowhere near 2^(4.7 * 12) more likely.)

This interacts with the efficient market hypothesis in an interesting way. Suppose a magic button test that would instantly and freely tell anyone if they would make money (in expectation) on the market. Its not omniscient, but it is epistemically efficient compared to humans. All hopefuls press the button, and only trade if they pass the test. Those that pass make money. That money is coming from the people who actually want to buy or sell stuff, and the stubbornest of those who failed the magic button test.

(Assuming that people don't try to use this as a way to send info back in time, or otherwise munchkin it.)

To be 50% sure you're in the top 1%, you only need 200:1 evidence. This seemingly large odds ratio might be easy to get.

 

Of course it's easy! You just compare how much you've made, and how long you've stayed solvent, against the top 1% of traders. If you've already done just as well as the others, you'd in the top 1%. Otherwise, you aren't.

It's much harder to become a top 1% trader. That's the point of the EMH and Eliezer's Inadequate Equilibria. Actually, some forms of the EMH go further and say that there's no such thing as a "top trader," because there are only two types: the lucky and the insolvent. There is no way to beat the market trading on publicly available information.

Of course it's easy! You just compare how much you've made, and how long you've stayed solvent, against the top 1% of traders. If you've already done just as well as the others, you'd in the top 1%. Otherwise, you aren't.

This object-level example is actually harder than it appears, performance of a fund or trader in one time period generally has very low correlation to the next, e.g. see this paper: https://www.researchgate.net/profile/David-Smith-256/publication/317605916_Evaluating_Hedge_Fund_Performance/links/5942df6faca2722db499cbce/Evaluating-Hedge-Fund-Performance.pdf

There's a fair amount of debate over how much data you need to evaluate whether a person is a consistently good trader, in my moderately-informed opinion a trader who does well over 2 years is significantly more likely to be lucky than skilled. 

That's why I said "how long you've stayed solvent." Thanks for bringing in a more nuanced statement of the argument + source.

I think the point is that you could become confident of this before spending a lot of time/money on the market, if there are other strong forms of evidence correlated with market success.

It's a very different project to determine whether you presently are at the top 1% of any domain, and what your chances are of entering the top 1% given a certain level of time, effort, and expense.

I’m guessing they would have happily accepted a bet at 20:1 odds that my driver’s license would say “Mark Xu” on it.

I think they wouldn't have, mostly because you (or someone) offering the bet is fairly strong evidence that the name on your driver's license is not, in fact, "Mark Xu".

I really like this post! I have a concerned intuition around 'sure, the first example in this post seems legit, but I don't think this should actually update anything in my worldview, for the real-life situations where I actively think about Bayes Rule + epistemics'. And I definitely don't agree with your example about top 1% traders. My attempt to put this into words:

1. Strong evidence is rarely independent. Hearing you say 'my name is Mark' to person A might be 20,000:1 odds, but hearing you then say it to person B is like 10:1 tops. Most hypotheses that explain the first event well, also explain the second event well. So while the first sample contains the most information, the second sample contains way less. Making this idea much less exciting.

It's much easier to get to middling probabilities than high probabilities. This makes sense, because I'm only going to explicitly consider the odds of <100 hypotheses for most questions, so a hypothesis with say <1% probability isn't likely to be worth thinking about. But to get to 99% it needs to defeat all of the other ones too

Eg, in the 'top 1% of traders' example, it might be easy to be confident I'm above the 90th percentile, but much harder to move beyond that.

2. This gets much messier when I'm facing an adversarial process. If you say 'my name is Mark Xu, want to bet about what's on my driver's license' this is much worse evidence because I now face adverse selection. Many real-world problems I care about involve other people applying optimisation pressure to shape the evidence I see, and some of this involves adversarial potential. The world does not tend to involve people trying to deceive me about world capitals.

An adversarial process could be someone else trying to trick me, but it could also be a cognitive bias I have, eg 'I want to believe that I am an awesome, well-calibrated person'. It could also be selection bias - what is the process that generated the evidence I see?

3. Some questions have obvious answers, others don't. The questions most worth thinking about are rarely the ones that are obvious. The ones where I can access strong evidence easily are much less likely to be worth thinking about. If someone disagrees with me, that's at least weak evidence against the existence of strong evidence.

How much evidence is breaking into the top 50 on metaculus in ~6 months?

I stayed out of finance years ago because I thought I didn't want to compete with Actually Smart People.

Then I jumped in when the prediction markets were clearly not dominated by the Actually Smart.

But I still don't feel confident to try in the financial markets.

As a rule of thumb, if you are in the top 1% of some non-lucrative thing (metacalc, Magic The Gathering, women's softball, etc) then there are millions of people in the US and a billion people in the world that would be better than you at it, if they put in as much effort as you did.

If you are the absolute aknowledged best at the niche thing, there's only thousands / millions who would have done better.

And many of those are devoting their effort towards something lucrative, like a stock market (or becoming CEO)

I think "a billion people in the world" is wrong here--it should only be about 75 million by pure multiplication.

They might have been talking about the total amount of people with the potential to become better than you at the specific thing rather than the pure percentage of people who would be if everyone tried.

Meta: I would enjoy seeing growth in the number of users with everyday objects as names. I look forward to one day seeing the HDMI Cable, the USB-C Adapter and the Laptop Bag in conversation with each other.

Maybe but the US number lines up with 1% of the population lines up with the top 1% figure; if people outside the US are ~50x as likely to be top-1% at various hobbies that's a bold statement that needs justification, not an obvious rule of thumb!

Or it could be across all time, which lines up with ~100 billion humans in history.