I remember this by analogy to Curry's paradox.
Where the sentence from Curry's paradox says "If this statement is true, then ", says "if this statement is provable, then ", that is, .
In Curry's paradox, if the sentence is true, that would indeed imply that is true. And with , the situation is analogous, but with truth replaced by provability: if is provable, then is provable. That is, .
But, unlike in Curry's paradox, this is not what itself says! Replacing truth with provability has attenuated the sentence, destroyed its ability to cause paradox.
If only , then we would have our paradox back... and that's Löb's theorem.
This is all about , just about one direction of the biimplication, whereas the post proves not just that but the other direction. It seems that only this forward direction is used in the proof at the end of the post though.
You say "if we are to accurately model the world"...
If I am modelling the path of a baseball, and I write "F = mg", would you "correct" me that it's actually inverse square, that the Earth's gravitation cannot stay at this strength to arbitrary heights? If you did, I would remind you that we are talking about a baseball game, and not shooting it into orbit—or conclude that you had an agenda other than determining where the ball lands.
What if I'm sampling from a population, and you catch me multiplying probabilities together, as if my draws are independent, as if the population is infinite? Yes there is an end to the population, but as long as it's far away, the dependence induced by sampling without replacement is negligible.
Well, that's the question, whether to include an effect in the model or whether it's negligible. An effect like finite population size, diminishing gravity, or the "crowding" effects that turn an exponential growth model logistic.
And the question cannot be escaped just by noting the effect is important eventually.
Eliezer in 2008, in When (Not) To Use Probabilities, wrote:
To be specific, I would advise, in most cases, against using non-numerical procedures to create what appear to be numerical probabilities. Numbers should come from numbers.
Yeah... well, I thought of the because it sounds like we're getting the probabilities of from some experiment. So is the results of the experiment, which in this case is a vector of frequencies. When I put it like that, it sounds like it's is just a rhetorical device for saying that we have given probabilities of .
But I still seem to need for my dictionary. I have . What is ? It is some kind of updated probability of , right? Like we went from one probability to the other by doing an experiment. If I didn't write , I'd need something like and .
Reading again, it seems like this is exactly Jeffrey conditionalization. So whether you include some extra variable just depends on what you think of Jeffrey conditionalization.
I feel like I'm missing something, though, about what this experiment is and means. For example, I'm not totally clear on whether we have one state , and a collection of replicates of state ; or is it a collection of replicates of pairs?
Looking at the paper, I see the connection to Jeffrey conditionalization is made explicitly. And it mentions Pearl's "virtual evidence method"; is this what he calls introducing this ? But no clarity on exactly what this experiment is. It just says:
But how should the above be generalized to the situation where the new information does not come in the form of a definite value for , but as “soft evidence,” i.e., a probability distribution ?"
By the way, regarding your coin toss example, I can at least say how this is handled in Bayesian statistics. There are separate random variables for each coin toss. is the first, is the second, etc. If you have coin tosses, then your sample is a vector containing to . Then the posterior probability is . This will be covered in any Bayesian statistics textbook as "the Bernoulli model". My class used Hoff's book, which provides a quick start.
I guess this example suggests a single unknown (whether the coin is loaded or not) and replicates of .
The "Classical derivation" made more sense to me after translating it to standard probability notation, so I'm commenting to share the "dictionary" I made for it, as well as the unexpected extra assumption I had to make.
The obvious:
It got tricky with . Instead of observing , we observe something else that gives us a probability distribution over . I considered this "something else" to be the value of some other unknown: . The probability distribution over is a conditional distribution:
Hate to have on only one side like that... maybe I should have called it ... but I'll leave it as is.
Then,
Not quite the right formula for a simple interpretation of ... if only
This is conditional independence, which could be represented with this Bayes net:
Then, we have
That completes the dictionary.
So to do what feels like ordinary probability theory, I had to introduce this extra unknown so that we have something to observe, and then to assume that only provides information about (and indirectly about , through ).
The way you described as some probability distribution resulting from an observation, but not a conditional distribution, is in philosophy called Jeffrey conditionalization. The Stanford Encyclopedia of Philosophy gives this example:
A gambler is very confident that a certain racehorse, called Mudrunner, performs exceptionally well on muddy courses. A look at the extremely cloudy sky has an immediate effect on this gambler’s opinion: an increase in her credence in the proposition (muddy) that the course will be muddy—an increase without reaching certainty. Then this gambler raises her credence in the hypothesis (win) that Mudrunner will win the race, but nothing becomes fully certain. (Jeffrey 1965 [1983: sec. 11.3])
The idea is, we go from one probability distribution over to another, without becoming certain of anything. My introduction of corresponds to introducing an unknown representing the status of the sky. I would say we are conditioning on .
I recalled vaguely that Jaynes discussed Jeffrey conditionalization in Probability Theory, and criticized it for holding only in a special case. I took a look, and sure enough, it's in section 5.6, and he's pointing out exactly what I did, right down to the arrows, though he calls it a "logic flow diagram" rather than identifying it as a Pearl-style Bayes net.
The last formula in this post, the conservation of expected evidence, had a mistake which I've only just now fixed. Since I guess it's not obvious even to me, I'll put a reminder for myself here, which may not be useful to others. Really I'm just "translating" from the "law of iterated expectations" I learned in my stats theory class, which was:
This is using a notation which is pretty standard for defining conditional expectations. To define it you can first consider the expected value given a particular value of the random variable . Think of that as a function of that particular value: Then we define conditional expectation as a random variable, obtained from plugging in the random value of : The problem with this notation is it gets confusing which capital letters are random variables and which are propositions, so I've bolded random variables. But it makes it very easy to state the law of iterated expectations.
The law of iterated expectations also holds when "relativized". That is, where is an event. If we wanted to stick to just putting random variables behind the conditional bar we could have used the indicator function of that event.
And this translates to the statement in my post. is an indicator for the event , which makes a conditional expectation of it a conditional probability of . So is . Our proposition is the background information , I used the same symbol there. And the right hand side is another expectation of an indicator and therefore also a probability.
I really didn't want to define this notation in the post itself, but it's how I'm trained to think of this stuff, so for my own confidence in the final formula I had to write it out this way.
It would be nice if you had the sexes of the siblings, since it's supposedly only the older brothers that count, though I don't really expect that to change anything.
Really the important thing is just to separate birth order from family size. Usually the way I think of this is, we can look at number of older brothers, with a given number of older siblings. I like this setup because it looks like a randomized trial. I have two older siblings, so do you, meiosis randomizes their sexes.
But I guess with the data you have you can look at birth order with a given family size, so we don't have to worry about the effect of a larger or smaller family. I... don't think this is what you did? Did I misunderstand something? It seems like if cardinals come from smaller families, that would show up as lower birth orders.
With 9 million people I'd just split it into categories by number of siblings, with your data I'm not sure.
After reading this post, I handed over $200 for a month of ChatGPT pro and I don't think I can go back. o1-pro and Deep Research are next level. o1-pro often understands my code or what I'm asking about without a whole conversation of clarifying, whereas other models it's more work than it's worth to get them focused on the real issue rather than something superficially similar. And then I can use "Deep Research" to get links to webpages relevant to what I'm working on. It's like... smart Google, basically. Google that knows what I'm looking for. I never would have known this existed if I didn't hand over the $200.
Depends entirely on Cybercab. A driverless car can be made cheaper for a variety of reasons. If the self-driving tech actually works, and if it's widely legal, and if Tesla can mass produce it at a low price, then they can justify that valuation. Cybercab is a potential solution to the problem that they need to introduce a low priced car to get their sales growing again but cheap electric cars is a competitive market now without much profit margin. But there's a lot of ifs.
I'm surprised at how hard it is for me to think of counterexamples.
I thought surely whale populations due to the slow generation time, but it looks like humpback whale populations have already recovered from whaling, and blue whales will get there before long.
Thinking again—in my baseball example, gravity is pulling the ball into the domain of applicability of the constant acceleration model.
Maybe what's special about the exponential growth model is it implies escape from its own domain of applicability, in time that grows slowly (logarithmically) with the threshold.