JonasMoss — LessWrong

The number of elements in won't change when removing every other element from it. The cardinality of $0 N$ is countable. And when you remove every other element, it is still countable, and indistinguishable from $0 N$ . If you're unconvinced, ask yourself how many elements $0 N$ with every other element removed contains. The set is certainly not larger than $N$ , so it's at most countable. But it's certainly not finite either. Thus you're dealing with a set of countably many 0s. As there is only one such multiset, $0 N$ equals $0 N$ with every other element removed.

That there is only one such multiset follows from the definition of a multiset, a set of pairs $(a, c)$ , where $a$ is an element and $c$ is its cardinality. It would also be true if we define multisets using sets containing all the pairs $(a, 1), (a, 2), \dots$ -- provided we ignore the identity of each pair. I believe this is where our disagreement lies. I ignore identities, working only with sets. I think you want to keep the identities intact. If we keep the identities, the set ${(0, 1), (0, 2), (0, 3), \dots}$ is not equal to ${(0, 1), (0, 3), (0, 5), (0, 7), \dots}$ , and my argument (as it stands) fails.

On infinite ethics

JonasMoss4y30

I don't understand what you mean. The upgraded individuals are better off than the non-upgraded individuals, with everything else staying the same, so it is an application of Pareto.

Now, I can understand the intuition that (a) and (b) aren't directly comparable due to identity of individuals. That's what I mean with the caveat "(Unless we add an arbitrary ordering relation on the utilities or some other kind of structure.)"

On infinite ethics

JonasMoss4y10

Pareto: If two worlds (w1 and w2) contain the same people, and w1 is better for an infinite number of them, and at least as good for all of them, then w1 is better than w2.

As far as I can see, the Pareto principle is not just incompatible with the agent-neutrality principle, it's incompatible with set theory itself. (Unless we add an arbitrary ordering relation on the utilities or some other kind of structure.)

Let's take a look at, for instance, vs $2 N \cup 0 N$ , where $n N$ is the multiset containing $n, 2 n, 3 n, \dots$ and $\cup$ is the disjoint union. Now consider the following scenarios:

(a) Start out with $N \cup 0 N$ and multiply every utility by $2$ to get $2 N \cup 0 N$ . Since infinitely many people are better off and no one is worse off, $N \cup 0 N ≺ 2 N \cup 0 N$ .

(b) Start out with $2 N \cup 0 N$ and take every other of the $0$ -utilities from $0 N$ and change them to $1, 3, 5, \dots = 2 N - 1$ . Since a copy of $0 N$ is still left over, this operation leaves us with $N \cup 0 N$ . Again, since infinitely many are better off and no one worse off, $2 N \cup 0 N ≺ N \cup 0 N$ .

In conclusion, both $2 N \cup 0 N ≺ N \cup 0 N$ and $N \cup 0 N ≺ 2 N \cup 0 N$ , a contradiction.

A Bayesian Aggregation Paradox

JonasMoss4y10

Okay, thanks for the clarification! Let's see if I understand your setup correctly. Suppose we have the probability measures $p_{E}$ and $p_{1}$ , where $p_{E}$ is the probability measure of the expert. Moreover, we have an outcome $x \in {A, B, C} .$

In your post, you use $p_{1} (x ∣ z) \propto p_{E} (z ∣ x) p_{1} (x)$ , where $z$ is an unknown outcome known only to the expert. To use Bayes' rule, we must make the assumption that $p_{1} (z ∣ x) = p_{E} (z ∣ x)$ . This assumption doesn't sound right to be, but I suppose some strange assumption is necessary for this simple framework. In this model, I agree with your calculations.

Yes! If I am understanding this right, I think this gets to the crux of the post. The compression is lossy, and necessarily loses some information.

I'm not sure. When we're looking directly at the probability of an event $x$ (instead of the probability of the probability an event), things get much simpler than I thought.

Let's see what happens to the likelihood when you aggregate from the expert's point of view. Letting $x \in {A, B, C}$ , we need to calculate the expert's likelihoods $p_{E} (z ∣ A)$ and $p_{E} (z ∣ B \cup C)$ . In this case,

\begin{matrix} p_{E} (z ∣ B \cup C) & = & \frac{p_{E} (B \cup C ∣ z)}{p_{E} (B \cup C)} p_{E} (z), = & \frac{p_{E} (B ∣ z) + p_{E} (C ∣ z)}{p_{E} (B \cup C)} p_{E} (z), = & \frac{p_{E} (z ∣ B) P (B) + p_{E} (z ∣ C) P (C)}{p_{E} (B) + p_{E} (C)}, \end{matrix}

which is essentially your calculations, but from the expert's point of view. The likelihood $p_{E} (z ∣ B \cup C)$ depends on $p_{E} (B \cup C)$ , the prior of the expert, which is unknown to you. That shouldn't come as a surprise, as he needs to use the prior of in order to combine the probability of the events $B$ and $C$ .

But the calculations are exactly the same from your point of view, leading to

p_{1} (z ∣ B \cup C) = \frac{p_{E} (z ∣ B) p_{1} (B) + p_{E} (z ∣ C) p_{1} (C)}{p_{1} (B) + p_{1} (C)}

Now, suppose we want to generally ensure that $p_{E} (z ∣ B \cup C) = p_{1} (z ∣ B \cup C)$ . Which is what I believe you want to do, and which seems pretty natural to do, at least since we're allowed to assume that $p_{E} (z ∣ x) = p_{1} (z ∣ x)$ for all simple events $x$ . To ensure this, we will probably have to require that your priors are the same as the expert. In other words, your joint distributions are equal, or $p_{1} (z, x) = p_{E} (z, x)$ .

Do you agree with this summary?

Harms and possibilities of schooling

JonasMoss4y10

Do you have a link to the research about the effect of a bachelor of education?

A Bayesian Aggregation Paradox

JonasMoss4y*10

I find the beginning of this post somewhat strange, and I'm not sure your post proves what you claim it does. You start out discussing what appears to be a combination of two forecasts, but present it as Bayesian updating. Recall that Bayes theorem says . To use this theorem, you need both an $x$ (your data / evidence), and a $θ$ (your parameter). Using “posterior $\propto$ prior $\times$ likelihood” (with priors $p_{1}, p_{2}, p_{3}$ and likelihoods $e_{1}, e_{2}, e_{3}$ ), you're talking as if your expert's likelihood equals $p (x ∣ θ)$ – but is that true in any sense? A likelihood isn't just something you multiply with your prior, it is a conditional pmf or pdf with a different outcome than your prior.

I can see two interpretations of what you're doing at the beginning of your post:

You're combining two forecasts. That is, with $θ \in A, B, C$ being the outcome, you have your own pmf $p_{1} (θ)$ and the expert's $e = p_{2} (θ)$ , then combine them using $p (θ) \propto p_{1} (θ) p_{2} (θ)$ . That's fair enough, but I suppose $p (θ) \propto \sqrt{p_{1} (θ) p_{2} (θ)}$ or maybe $p (θ) \propto p_{1} (θ)^{q} p_{2} (θ)^{1 - q}$ for some $q \in [0, 1]$ would be a better way to do it.
It might be possible to interpret your calculations as a proper application of Bayes' rule, but that requires stretching it. Suppose $θ$ is your subjective probability vector for the outcomes $A, B, C$ and $x$ is the subjective probability vector for the event supplied by an expert (the value of $x$ is unknown to us). To use Bayes' rule, we will have to say that the evidence vector $e = p (x ∣ θ)$ , the probability of observing an expert judgment of $x$ given that $θ$ is true. I'm not sure we ever observe such quantities directly, and it is pretty clear from your post that you're talking about $e = p_{2} (θ)$ in the sense used above, not $p (x ∣ θ)$ .

Assuming interpretation 1, the rest of your calculations are not that interesting, as you're using a method of knowledge pooling no one advocates.

Assuming interpretation 2, ~~the rest of your calculations are probably incorrect. I don't think there is a unique way to go from~~ $p (x ∣ θ)$ ~~to, let's say,~~ $p (x^{*} ∣ θ^{*})$ , ~~where~~ $x^{*}$ ~~is the expert's probability vector over~~ $A, A^{c}$ ~~and~~ $θ^{*}$ ~~your probability vector over~~ $A, A^{c}$ .

Harms and possibilities of schooling

JonasMoss4y*30

Children became grown-ups 200 years ago too. I don't think we need to teach them anything at all, much less anything in particular.

According to this SSC post, kids can easily catch up in math even if they aren't taught any math at all in the 5 first years of school.

In the Benezet experiment, a school district taught no math at all before 6th grade (around age 10-11). Then in sixth grade, they started teaching math, and by the end of the year, the students were just as good at math as traditionally-educated children with five years of preceding math education.

That would probably work for reading too, I guess. (Reading appears to require more purpose-built brain circuitry than math. At least I got that impression from reading Henrich's WEIRD. I don't have any references though.)

Magna Alta Doctrina

JonasMoss4y30

I found this post interesting, especially the first part, but extremely difficult to understand (yeah, that hard). I believe some of the analogies might be valuable, but it's simply too hard for me to confirm / disconfirm most of them. Here are some (but far from all!) examples:

1. About local optimizers. I didn't understand this section at all! Are you claiming that gradient descent isn't a local optimizer? Or are you claiming that neural networks can implement mesa-optimizers? Or something else?

2. The analogy to Bayesian reasoning feels forced and unrelated to your other points in the Bayes section. Moreover, Bayesian statistics typically doesn't work (it's inconsistent) when you ignore the normalizing constant. And in the case of neural networks, what is your prior? Unless you're thinking about approximate priors using weight decay, most neural networks do not employ priors on their parameters.

3. In your linear model, you seem to interpret the maximum likelihood estimator of the parameters as a Bayesian estimator. Am I on the right track here?

4. Building on your linear toy model, it is natural to understand the weight decay parameters as priors, as that is what they are. (In an exact sense; with L2 weight decay you're looking at ridge regression, which is a linear regression with normal priors on the parameters. L1 weights with Laplace priors, etc.) But you don't do that. In what sense is "the bayesian prior could be encoded purely in the initial weight distribution." What's more, it seems to me you're thinking about the learning rate as your prior. I think this has something do to with your interpretation of the linear model maximum likelihood estimator as a Bayesian procedure...?

Ordinary and unordinary decision theory

JonasMoss4y10

I disagree. Sometimes your entire payoffs also change when you change your action space (in the informal description of the problem). That is the point of the last example, where precommitment changes the possible payoffs, not only restricts the action space.

Ordinary and unordinary decision theory

JonasMoss4y50

Paradoxical decision problems are paradoxical in the colloquial sense (such as Hilbert's hotel or Bertrand's paradox), not the literal sense (such as "this sentence is false"). Paradoxicality is in the eye of the beholder. Some people think Newcomb's problem is paradoxical, some don't. I agree with you and don't find it paradoxical.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments