LESSWRONG
LW

Leon Lang
1727Ω132131560
Message
Dialogue
Subscribe

I'm a last-year PhD student at the University of Amsterdam working on AI Safety and Alignment, and specifically safety risks of Reinforcement Learning from Human Feedback (RLHF). Previously, I also worked on abstract multivariate information theory and equivariant deep learning. https://langleon.github.io/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
X explains Z% of the variance in Y
Leon Lang7d20

Thanks, I've replaced the word "likelihood" by "probability" in the comment above and in the post itself!

Reply
X explains Z% of the variance in Y
Leon Lang10d20

Thanks, I think this is an excellent comment that gives lots of useful context.

To summarize briefly what foorforthought has already expressed, what I meant with platoninc variance explained is the explained variance independent of a specific sample or statistical model, but as you rightly point out, this still depends on lots of context that depends on crucial details of study design or the population one studies. 

Reply
X explains Z% of the variance in Y
Leon Lang10d30

what is a measurable space?

I'm not sure if clarifying this is most useful for the purpose of understanding this post specifically, but for what it's worth: A measurable space is a set together with a set of subsets that are called "measurable". Those measurable sets are the sets to which we can then assign probabilities once we have a probability measure P (which in the post we assume to be derived from a density p, see my other comment under your original comment).

"the function X is constant," you mean its just one outcome like a die that always lands on one side?

I think that's what the commenter you replied to means, yes. (They don't seem to be active anymore)

what makes a function measurable? 

This is another technicality that might not be too useful to think about for the purpose of this post. A function is measurable if the preimages of all measurable sets are measurable. I.e.: f:X→Z, for two measurable spaces X and Z, is measurable, if f−1(A)⊆X is measurable for all measurable A⊆Z. For practical purposes, you can think of continuous functions or, in the discrete case, just any functions. 

Reply
X explains Z% of the variance in Y
Leon Lang10d*30

I'm sorry that the terminology of random variables caused confusion!
If it helps, you can basically ignore the formalism of random variables and instead simply talk about the probability of certain events. For a random variable X with values in X and density p(x),  an event is (up to technicalities that you shouldn't care about) any subset A⊆X. Its probability is given by the integral

P(A):=∫x∈Ap(x).

In the case that X is discrete and not continuous (e.g., in the case that it is the set of all possible human DNA sequences), one would take a sum instead of an integral:

P(A):=∑x∈Ap(x).

The connection to reality is that if we sample x∈X from the random variable X, then its probability of being in the event A is modeled as being precisely P(A). I think with these definitions, it should be possible to read the post again without getting into the technicalities of what a random variable is. 

I think this post would be much easier to learn from if it was a jupyter notebook with python code intermixed or R markdown.

In the end of the article I link to this piece of code of how to do the twin study analysis. I hope that's somewhat helpful.

Reply
X explains Z% of the variance in Y
Leon Lang14d*40

They are synonyms! Both are the expected value of (the function of) a random variable. (I had started writing mu, but then changed the notation for the remaining variance to also make the expected value explicit as requested. Mu seemed like less appropriate notation for this. Maybe I’ll change all mu to E once I have access to more than my phone again. Edit: I was too lazy to do that change :) ).

Reply
X explains Z% of the variance in Y
Leon Lang19d20

Okay, you people convinced me to change the notation!

Reply
leogao's Shortform
Leon Lang20d50

I now hate shapes, reshaping, squeezing, unsqueezing

Are you using einops and einsum? I hate these somewhat less since using them. See here for more details. 

Reply
X explains Z% of the variance in Y
Leon Lang20d20

The idea that the mean or average is a good measure of central tendency of a distribution, or a good estimator, is so familiar we forget that it requires justification. For Normal distributions, it is the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but this isn't true of all distributions.  For a skewed, long-tailed distribution, for example, the median is a better estimator.

Is it correct to say that the mean is a good estimator whenever the variance is finite? If so, maybe I should have added that assumption to the post. 

I wonder how to think about that in the case of entropy, which you thought about analyzing. Differential entropy can also be infinite, for example. But the Cauchy distribution, which you mention, has infinite variance but finite differential entropy, at least.

1.sorry, I haven't figured out the equation editor yet.

You can type Cmd+4 to type inline latex formulas, and Cmd+m to type standalone latex formulas! Hope that helps. 

In principle, conceptually, you can estimate entropy directly from the probability density function (PDF) non-parametrically as H = sum(-P log2 P), where the sum is over all possible values of Y, and P is the probability Y takes on a given value. Likewise, you can estimate the mutual information directly from the joint probability distribution between X and Y, the equation for which I won't try to write out here without an equation editor.

Note: After writing the next paragraph, I noticed that you made essentially the same points further below in your answer, but I'm still keeping my paragraph here for completeness.

I was more wondering whether we can estimate them from data, where we don't get the ground-truth values for the probabilities that appear in the formulas for entropy and mutual information, at least not directly. If we have lots of data, then we can approximate a PDF, that is true, but I'm not aware of a way of doing so that is entirely principles  or works without regularity assumptions. As an example, let's say we want to estimate the conditional entropy H(Y∣X) (a replacement for the "remaining variance" in my post) for continuous X and Y. I think in this case, if all sampled x-values differ from each other, you could in principle come to the conclusion that there is no uncertainty in Y conditional on X at all since you observe only one Y-value for each X-value. But that would be severe overfitting, similar to what you'd expect in my section titled "When you have lots of data" for continuous X. 

Maybe it would be interesting to analyze the conditional entropy case for non-continuous distributions where variance makes less sense.

I think from my point of view we're largely in agreement, thanks for your further elaborations!

Reply
X explains Z% of the variance in Y
Leon Lang21d40

Thanks for the comment! Actually, after writing the post, I also wondered why this concept isn't based on information theory :) I think what I'd enjoy most, if you wanted to write it, is probably an in-depth treatment of the differences in meaning, properties, and purpose of:

  • Entropy vs. variance
  • Mutual information vs. variance explained
  • conditional entropy vs. average remaining variance
  • etc.

But unlike variance explained, it does not require positing any model of how Y depends on X. This is powerful, because it gives you a fact about the X-Y relationship, not a fact about the goodness of some model.

Note that parts of my post are actually model-free! For example, the mathematical definition and the example of twin studies do not make use of a model.

But this is predicated on the implicit model that Y is a normally distributed variable.

I'm not aware of (implicitly) making that assumption in my post!

You can measure mutual information even if the form of the relationship is unknown or complicated.

Is this so? Suppose we'd want to measure differential entropy, as a simplified example, and the true density "oscillates" a lot. In that case, I'd expect that the entropy is different than what it is if the density were smoother. But it might be hard to see the difference in a small dataset. The type of regularity/simplicity assumptions about the density might thus influence the result.

Reply
X explains Z% of the variance in Y
Leon Lang23d70

The notation in the post is inspired by similar notation for conditional entropy:

Reply
Load More
No wikitag contributions to display.
2Leon Lang's Shortform
3y
64
154X explains Z% of the variance in Y
15d
32
33How to work through the ARENA program on your own
1mo
3
51[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Ω
9mo
Ω
2
90We Should Prepare for a Larger Representation of Academia in AI Safety
2y
14
32Andrew Ng wants to have a conversation about extinction risk from AI
2y
2
26Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios
2y
0
48[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques
Ω
2y
Ω
0
246Natural Abstractions: Key Claims, Theorems, and Critiques
Ω
2y
Ω
26
38Andrew Huberman on How to Optimize Sleep
2y
6
31Experiment Idea: RL Agents Evading Learned Shutdownability
Ω
2y
Ω
7
Load More