Jsevillamol

Finding the Central Limit Theorem in Bayes' rule

Great post! Helped me build an intuition of why this is true, and I came off pretty convinced it is.

I specially liked how each step is well motivated, so that by the end I already knew where this was going.

One note:

In the last section you write that the convolution of the distribution equals the Fourier transform of the pointwise distributions.

But I think you meant to write that the Fourier transform of the convolution of the distributions is the pointwise product of their Fourier transforms (?).

This does not break the proof.

EfficientZero: How It Works

Great post!

Do you mind if I ask you what is the amount of free parameters and training compute of EfficientZero?

I tried scanning the paper but didn't find them readily available.

A Bayesian Aggregation Paradox

average log odds could make sense in the context in which there is a uniform prior

This is something I have heard from other people too, and I still cannot make sense of it. Why would questions where uninformed forecasters produce uniform priors make logodds averaging work better?

A tendency for the questions asked to have priors of near 50% according to the typical unknowledgeable person would explain why more knowledgeable forecasters would assign more extreme probabilities on average: it takes more expertise to justifiably bring their probabilities further from 50%.

I don't understand your point. Why would forecasters care about what other people would do? They only want to maximize their own score.

If A, B, and C are mutually exclusive, then they can't all have 50% prior probability, so a pooling method that implicitly assumes that they do will not give coherent results.

This also doesn't make much sense to me, though it might be because I still don't understand the point about needing uniform priors for logodd pooling.

Different implicit priors don't appear to be ruining anything.

Neat!

I conclude that the incoherent results in my ABC example cannot be blamed on switching between the uniform prior on {A,B,C} and the uniform prior on {A,A}, and, instead, should be blamed entirely on the experts having different beliefs conditional on A, which is taken account in the calculation using A,B,C, but not in the calculation using A,A.

I agree with this.

Laplace's rule of succession

Solid post - it is good to have the full reasoning for Laplace's rule of succession in a single post instead of buried in a statistics post. I also liked the discussion on how to use it in practice - I'd love to see a full example using actual numbers if you feel like writing one!

On this topic I also recently enjoyed UnexpectedValues post. He provides a cool proof / intuition for the rule of succession.

A Bayesian Aggregation Paradox

Note that you are making the same mistake than me! Updates are not summarized in the same way as beliefs - for the update the "correct" way is to take an average of the likelihoods:

This does not invalidate the example though!

Thanks for suggesting, I think it helps clarify the conondrum.

A Bayesian Aggregation Paradox

I like this framing.

This seems to imply that summarizing beliefs and summarizing updates are two distinct operations.

For summarizing beliefs we can still resort to summing:

But for summarizing updates we need to use an average - which in the absence of prior information will be a simple average:

Annoyingly and as you point out this is not a perfect summary - we are definitely losing information here and subsequent updates will be not as exact as if we were working with the disaggregated odds.

I still find it quite disturbing that the update after summarizing depends on prior information - but I can't see how to do better than this, pragmatically speaking.

Jsevillamol's Shortform

From OpenAI Five's blogpost:

We’re still fixing bugs.The chart shows a training run of the code that defeated amateur players, compared to a version where we simply fixed a number of bugs, such as rare crashes during training, or a bug which resulted in a large negative reward for reaching level 25. It turns out it’s possible to beat good humans while still hiding serious bugs!

One common line of thought is thinking that goals are very brittle - small misspecifications will be amplified after optimizing.

Yet Open AI Five managed to wrangle a good performance out of a seriously buggy reward function.

Hardly conclusive, but it would be interesting to see more examples of this. One could also do deliberate experiments to see how much you can distort a reward function before behaviour breaks.

My ML Scaling bibliography

For people interested in scaling laws, check out *"Parameter, Compute and Data Trends in Machine Learning"* - the biggest public database of milestone AI systems annotated with data on parameters, compute and data.

A visualization of the current dataset is available here.

(disclaimer: I am the main coordinator of the project)

Emergent modularity and safety

Relevant related work : NNs are surprisingly modular

https://arxiv.org/abs/2003.04881v2?ref=mlnews

On the topic of pruning neural networks, see the lottery ticket hypothesis

Related work:

Also, while this post focuses a lot on Physics, my experience is that top level math people are quite comfortable with informal math reasoning.