LESSWRONG
LW

JonasMoss
343190
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1JonasMoss's Shortform
4y
3
On infinite ethics
JonasMoss3y30

The number of elements in 0N won't change when removing every other element from it. The cardinality of  0N is countable. And when you remove every other element, it is still countable, and indistinguishable from 0N.  If you're unconvinced, ask yourself how many elements 0N with every other element removed contains. The set is certainly not larger than N, so it's at most countable. But it's certainly not finite either. Thus you're dealing with a set of countably many 0s. As there is only one such multiset, 0N equals 0N with every other element removed.

That there is only one such multiset follows from the definition of a multiset, a set of pairs (a,c), where a is an element and c is its cardinality. It would also be true if we define multisets using sets containing all the pairs (a,1),(a,2),… -- provided we ignore the identity of each pair. I believe this is where our disagreement lies. I ignore identities, working only with sets. I think you want to keep the identities intact. If we keep the identities, the set {(0,1),(0,2),(0,3),…} is not equal to {(0,1),(0,3),(0,5),(0,7),…}, and my argument (as it stands) fails. 

Reply
On infinite ethics
JonasMoss3y30

I don't understand what you mean. The upgraded individuals are better off than the non-upgraded individuals, with everything else staying the same, so it is an application of Pareto.

Now, I can understand the intuition that (a) and (b) aren't directly comparable due to identity of individuals. That's what I mean with the caveat "(Unless we add an arbitrary ordering relation on the utilities or some other kind of structure.)"

Reply
On infinite ethics
JonasMoss3y10

Pareto: If two worlds (w1 and w2) contain the same people, and w1 is better for an infinite number of them, and at least as good for all of them, then w1 is better than w2.

As far as I can see, the Pareto principle is not just incompatible with the agent-neutrality principle, it's incompatible with set theory itself. (Unless we add an arbitrary ordering relation on the utilities or some other kind of structure.)

Let's take a look at, for instance, N∪0N vs 2N∪0N, where nN is the multiset containing n,2n,3n,… and ∪ is the disjoint union. Now consider the following scenarios:

(a) Start out with N∪0N and multiply every utility by 2 to get 2N∪0N. Since infinitely many people are better off and no one is worse off, N∪0N≺2N∪0N.

(b) Start out with 2N∪0N and take every other of the 0-utilities from 0N and change them to 1,3,5,…=2N−1. Since a copy of 0N is still left over, this operation leaves us with N∪0N. Again, since infinitely many are better off and no one worse off, 2N∪0N≺N∪0N.

In conclusion, both 2N∪0N≺N∪0N and N∪0N≺2N∪0N, a contradiction.

Reply
A Bayesian Aggregation Paradox
JonasMoss3y10

Okay, thanks for the clarification! Let's see if I understand your setup correctly. Suppose we have the probability measures pE and p1, where pE is the probability measure of the expert. Moreover, we have an outcome x∈{A,B,C}.

In your post, you use p1(x∣z)∝pE(z∣x)p1(x), where z is an unknown outcome known only to the expert. To use Bayes' rule, we must make the assumption that p1(z∣x)=pE(z∣x). This assumption doesn't sound right to be, but I suppose some strange assumption is necessary for this simple framework. In this model, I agree with your calculations.

Yes! If I am understanding this right, I think this gets to the crux of the post. The compression is lossy, and necessarily loses some information.

I'm not sure. When we're looking directly at the probability of an event x (instead of the probability of the probability an event), things get much simpler than I thought.

Let's see what happens to the likelihood when you aggregate from the expert's point of view. Letting x∈{A,B,C}, we need to calculate the expert's likelihoods pE(z∣A) and pE(z∣B∪C). In this case,

pE(z∣B∪C)=pE(B∪C∣z)pE(B∪C)pE(z),=pE(B∣z)+pE(C∣z)pE(B∪C)pE(z),=pE(z∣B)P(B)+pE(z∣C)P(C)pE(B)+pE(C),

which is essentially your calculations, but from the expert's point of view. The likelihood pE(z∣B∪C) depends on pE(B∪C), the prior of the expert, which is unknown to you. That shouldn't come as a surprise, as he needs to use the prior of in order to combine the probability of the events B and C.

But the calculations are exactly the same from your point of view, leading to

p1(z∣B∪C)=pE(z∣B)p1(B)+pE(z∣C)p1(C)p1(B)+p1(C)

Now, suppose we want to generally ensure that pE(z∣B∪C)=p1(z∣B∪C). Which is what I believe you want to do, and which seems pretty natural to do, at least since we're allowed to assume that pE(z∣x)=p1(z∣x) for all simple events x. To ensure this, we will probably have to require that your priors are the same as the expert. In other words, your joint distributions are equal, or p1(z,x)=pE(z,x).

Do you agree with this summary?

Reply
Harms and possibilities of schooling
JonasMoss3y10

Do you have a link to the research about the effect of a bachelor of education?

Reply
A Bayesian Aggregation Paradox
JonasMoss3y*10

I find the beginning of this post somewhat strange, and I'm not sure your post proves what you claim it does. You start out discussing what appears to be a combination of two forecasts, but present it as Bayesian updating. Recall that Bayes theorem says p(θ∣x)=p(x∣θ)p(θ)p(x). To use this theorem, you need both an x (your data / evidence), and a θ (your parameter). Using “posterior∝ prior × likelihood” (with priors p1,p2,p3 and likelihoods e1,e2,e3), you're talking as if your expert's likelihood equals p(x∣θ) – but is that true in any sense? A likelihood isn't just something you multiply with your prior, it is a conditional pmf or pdf with a different outcome than your prior.

I can see two interpretations of what you're doing at the beginning of your post:

  1. You're combining two forecasts. That is, with θ∈A,B,C being the outcome, you have your own pmf p1(θ) and the expert's e=p2(θ), then combine them using p(θ)∝p1(θ)p2(θ). That's fair enough, but I suppose p(θ)∝√p1(θ)p2(θ) or maybe p(θ)∝p1(θ)qp2(θ)1−q for some q∈[0,1] would be a better way to do it.
  2. It might be possible to interpret your calculations as a proper application of Bayes' rule, but that requires stretching it. Suppose θ is your subjective probability vector for the outcomes A,B,C and x is the subjective probability vector for the event supplied by an expert (the value of x is unknown to us). To use Bayes' rule, we will have to say that the evidence vector e=p(x∣θ), the probability of observing an expert judgment of x given that θ is true. I'm not sure we ever observe such quantities directly, and it is pretty clear from your post that you're talking about e=p2(θ) in the sense used above, not p(x∣θ).

Assuming interpretation 1, the rest of your calculations are not that interesting, as you're using a method of knowledge pooling no one advocates.

Assuming interpretation 2, the rest of your calculations are probably incorrect. I don't think there is a unique way to go from p(x∣θ)to, let's say, p(x∗∣θ∗), where x∗ is the expert's probability vector over A,Ac and θ∗ your probability vector over A,Ac.

Reply
Harms and possibilities of schooling
JonasMoss3y*30

Children became grown-ups 200 years ago too. I don't think we need to teach them anything at all, much less anything in particular.

According to this SSC post, kids can easily catch up in math even if they aren't taught any math at all in the 5 first years of school.

In the Benezet experiment, a school district taught no math at all before 6th grade (around age 10-11). Then in sixth grade, they started teaching math, and by the end of the year, the students were just as good at math as traditionally-educated children with five years of preceding math education.

That would probably work for reading too, I guess. (Reading appears to require more purpose-built brain circuitry than math. At least I got that impression from reading Henrich's WEIRD. I don't have any references though.)

Reply
Magna Alta Doctrina
JonasMoss3y30

I found this post interesting, especially the first part, but extremely difficult to understand (yeah, that hard). I believe some of the analogies might be valuable, but it's simply too hard for me to confirm / disconfirm most of them. Here are some (but far from all!) examples:

1. About local optimizers. I didn't understand this section at all! Are you claiming that gradient descent isn't a local optimizer? Or are you claiming that neural networks can implement mesa-optimizers? Or something else?

2. The analogy to Bayesian reasoning feels forced and unrelated to your other points in the Bayes section. Moreover, Bayesian statistics typically doesn't work (it's inconsistent) when you ignore the normalizing constant. And in the case of neural networks, what is your prior? Unless you're thinking about approximate priors using weight decay, most neural networks do not employ priors on their parameters.

3. In your linear model, you seem to interpret the maximum likelihood estimator of the parameters as a Bayesian estimator. Am I on the right track here?

4. Building on your linear toy model, it is natural to understand the weight decay parameters as priors, as that is what they are. (In an exact sense; with L2 weight decay you're looking at ridge regression, which is a linear regression with normal priors on the parameters. L1 weights with Laplace priors, etc.) But you don't do that. In what sense is "the bayesian prior could be encoded purely in the initial weight distribution." What's more, it seems to me you're thinking about the learning rate as your prior. I think this has something do to with your interpretation of the linear model maximum likelihood estimator as a Bayesian procedure...?

Reply
Ordinary and unordinary decision theory
JonasMoss3y10

I disagree. Sometimes your entire payoffs also change when you change your action space (in the informal description of the problem). That is the point of the last example, where precommitment changes the possible payoffs, not only restricts the action space.

Reply
Ordinary and unordinary decision theory
JonasMoss4y50

Paradoxical decision problems are paradoxical in the colloquial sense (such as Hilbert's hotel or Bertrand's paradox), not the literal sense (such as "this sentence is false"). Paradoxicality is in the eye of the beholder. Some people think Newcomb's problem is paradoxical, some don't. I agree with you and don't find it paradoxical.

Reply
Load More
12Thoughts on the good regulator theorem
3y
0
1JonasMoss's Shortform
4y
3
3Ordinary and unordinary decision theory
4y
7