I asked the authors for feedback on my summary of the distributional generalization paper, and Preetum responded with the following (copied with his permission):

I agree with everything you've said in this summary, so my feedback below is mostly commentary / minor points.

- One intuitive way to think about Feature Calibration is that f(x) is "close to" a sample from p(y|x). Where the quality of the "closeness" is depends on the power of the classifier.

- Re. "classifiers which do not fit their train set": As you say, our paper mostly focuses on Distributional Generalization (DG) for interpolating models. But I am hopeful that DG actually holds much more generally, and we should really be thinking of generalization as saying "test and train behaviors are close *as distributions*".
Though we don't formalize this yet for non-interpolating models, there are some suggestive experiments in Section 7 (eg: the confusion matrix of a model on the test set remains close to its confusion matrix on the train set, throughout the training process. As you start to fit noise on the train set, you see exactly this noise start to appear on the test set. Regularization which prevents fitting noise on the train set, also prevents this noise from appearing at test time).

- For me, one of the most interesting implications of DG/feature-calibration is that it gives a separation between overparameterized and underparameterized regimes (in the scaling limits of large models/data). With enough data, large enough underparameterized models will converge to Bayes-optimal classifiers, whereas overparameterized models will not (assuming DG). That is, interpolation is not always "benign", it can actually hurt.

- You may like the discussion we added on these issues in the short version of our paper: Section 1.3 ("Related Work and Significance") here: https://mltheory.org/dg_short.pdf
(there is no new material in this pdf, outside the Related Work).

- Also, we have a number of supporting experiments for Feature Calibration in the appendix that didn't make it into the body (eg: more tasks for decision trees, and experiments with "bad" image classifiers like MLPs and RBF kernels).

- Sidenote: The "agreement property" has been bugging me for a while since it seems kind of magical. My current view is that "agreement" may be be a special case of a stronger (but less magical) property: The joint distribution (f(x), y) is statistically close to (f(x), f'(x))
on the test set, where f' is an independently-trained classifier.
This can also be seen as an instance of DG, and it implies the agreement property. I sketched this conjecture in this tweet: https://twitter.com/PreetumNakkiran/status/1385741115211530241
(But this is speculative -- not in the paper and hasn't been rigorously tested).

- I included this figure in a talk on DG recently -- point being that DG is a general definition, which includes both classical generalization and our new conjectures as special cases (and could include other yet-undiscovered behaviors).

- As mentioned at the end of our paper, there are *many* open questions remaining (and I would be very happy to see more work in this area).

Reply

[-]evhub5y40

Read more: Section 1.3 of this version of the paper

This is in the wrong spot.

Reply

[-]Rohin Shah5y20

Huh, weird. Probably a data entry error on my part. Fixed, thanks for catching it.

Reply

[-]romeostevensit5yΩ120

First-person vs. third-person: In a first-person perspective, the agent is central. In a third-person perspective, we take a “birds-eye” view of the world, of which the agent is just one part.

Static vs. dynamic: In a dynamic perspective, the notion of time is explicitly present in the formalism. In a static perspective, we instead have beliefs directly about entire world-histories.

I think these are two instances of a general heuristic of treating what have traditionally been seen as philosophical positions (e.g. here cognitive and behavioral views and A and B theories of time) instead as representations one can run various kinds of checks on to achieve more sample complexity reduction than using a single representation.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

24

[AN #148]: Analyzing generalization across more axes than just accuracy or loss

24

Ω 16

24

Ω 16

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

AGENT FOUNDATIONS

FORECASTING

FIELD BUILDING

MISCELLANEOUS (ALIGNMENT)

FEEDBACK

PODCAST