(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)

I'm very glad to see some serious progress on this agenda. If I'm being honest, I was less than excited by the latest long stretch of (published) results on it. It felt like there was no progress on the core ideas being made, only the shoring-up of the foundations and some minor peripheral insights. It looked concerningly plausible that this path routed through regions of math-space so thorny they were effectively impassable. (This might be an unfair characterization, particularly if those results looked more important in the context of unpublished research/considerations. Sorry if so.)

This result I am very much excited about. It seems to be a meaningful step "depthwards", and serves as an existence proof that depthwards progress is possible at all. Great job!

Reply

1

[-]kave3mo140

I wrote this tl;dr for a friend, and thought it worth sharing. I'm not sure it's accurate. I've only read the "Recap"

Here is how I understand it.

Suppose that, depending on the temperature, your mirror might be foggy and you might have goose pimples. As in, the temperature helps you predict those variables. But once you know the temperature, there's (approximately) nothing you learn about the state of your mirror from your skin, and vice versa. And! Once you know whether your mirror is foggy, there's basically nothing left to learn about the temperature by observing your skin (and vice versa).

But you still don't know the temperature once you observe those things.

This is a stochastic (approximate) natural latent. The stochasticity is that you don't know the temperature once you know the mirror and skin states.

Their theorem, iiuc, says that there does exist a variable where you (approximately) know its exact state after you've observed either the mirror or your skin.

(I don't currently understand exactly what coarse-graining process they're using to construct the exact natural latent).

Reply

1

[-]johnswentworth3mo140

Yup, good example!

The theorem doesn't actually specify a coarse-graining process. The proof would say:

We can construct a new variable T' by sampling a temperature given mirror-state. By construction, mirror-state perfectly mediates between T' and goosebumps.
There exists some pareto-optimal (under the errors of the natural latent conditions) latent which is pareto-as-good-as T'
Any pareto optimal latent which is pareto-as-good-as T' can be perfectly coarse-grained, by graining together any values of the latent which give exactly the same distribution P[mirror|latent value].

Because the middle bullet is not constructive, we don't technically specify a process. That said, one could specify a process straightforwardly by just starting from T' and pareto-improving the latent in a specific direction until one hits an optimum.

In this case, the coarse-graining would probably just be roughly (temperatures at which the mirror fogs) and (temperatures at which it doesn't), since that's the only nontrivial coarse-graining allowed by the setup (because the coarse-grained value must be approximately determined by the mirror-state).

Reply

[-]Satya Benson1mo*32

Noticing this only works as an example if the two signals are (approximately) the same partition of , i.e. (temperatures at which the mirror fogs) is approximately the same as (temperatures at which you have goosebumps).

Reply

[-]kave1mo20

Yeah I think

And! Once you know whether your mirror is foggy, there's basically nothing left to learn about the temperature by observing your skin (and vice versa).

is supposed to be scoped under the "Suppose that" from the beginning of the paragraph

Reply

1

[-]Jeremy Gillen3mo70

This is fantastic. I was a bit annoyed by the pareto optimality section and felt that surely there must be a way to skip that part of the proof. I tried a number of simple transformations on that intuitively I thought would makes the $X_{1} | λ_{i}$ 's equal for the appropriate $i$ 's. None worked. Lesson learned (again), test out ideas first before trying to prove them correct.

How did you work out you could use pareto-optimality? I'm guessing you got it from looking at properties of optimized and unoptimized empirical latents?

stochastic natural latents are relatively easy to test for in datasets

Why is it that stochastic natural latents are easier to test for than deterministic? It is that you can use just use the variables themselves as the latent and quickly compute mediation error?

Reply

[-]johnswentworth3mo60

How did you work out you could use pareto-optimality? I'm guessing you got it from looking at properties of optimized and unoptimized empirical latents?

No, actually. The magic property doesn't always hold for pareto optimal latents; the resampling step is load bearing. So when we numerically experimented with optimizing the latents, we often got latents which didn't have the structure leveraged in the proof (though we did sometimes get the right structure, but we didn't notice that until we knew to look for it).

We figured it out by algebraically playing around with the first-order conditions for pareto optimality, generally trying to simplify them, and noticed that if we assumed zero error on one resampling condition (which at the time we incorrectly thought we had already proven was a free move), then it simplified down a bunch and gave the nice form.

It is that you can use just use the variables themselves as the latent and quickly compute mediation error?

Yup.

Reply

[-]johnswentworth3mo20

How did you work out you could use pareto-optimality? I'm guessing you got it from looking at properties of optimized and unoptimized empirical latents?

No, actually. The magic property doesn't always hold for pareto optimal latents; the resampling step is load bearing. So when we numerically experimented with optimizing the latents, we often got latents which didn't have the structure leveraged in the proof (though we did sometimes get the right structure, but we didn't notice that until we knew to look for it).

We figured it out by algebraically playing around with the first-order conditions for pareto optimality, generally trying to simplify them, and noticed that if we assumed zero error on one resampling condition (which at the time we incorrectly thought we had already proven was a free move), then it simplified down a bunch and gave the nice form.

It is that you can use just use the variables themselves as the latent and quickly compute mediation error?

Yup.

Reply

[-]Daniel C3mo*60

Congrats!

Some interesting directions I think this opens up: Intuitively, given a set of variables , we want natural latents to be approximately deterministic across a wide variety of (collections of) variables, and if a natural latent $Y$ is approximately deterministic w.r.t a subset of variables $S \subseteq X$ , then we want $S$ to be as small as possible (e.g. strong redundancy is better than weak redundancy when the former is attainable)

The redundancy lattice seems natural for representing this: Given an element of the redundancy lattice $α \subset P (X)$ , we say $Y$ is a redund over $α$ if it’s approximately deterministic w.r.t each subset in $α$ . E.g. $Λ$ is weakly redundant over $X$ if it’s a redund over ${{_{i}} | X_{i} \in X}$ (approximately deterministic function of each $_{i}$ ), and strongly redundant if it’s a redund over ${{X_{i}} | X_{i} \in X}$ . If $Y$ is a redund over $α \subset P (X)$ , our intuitive desiderata for natural latents correspond to $α$ containing more subsets (more redundancy), and each subset $A_{i} \in α$ being small (less “synergy”). Combine this with the mediation condition can probably give us a notion of pareto-optimality for natural latents.

Another thing we could do is when we construct pareto-optimal natural latents $Y$ over $X$ , we add them to the original set of variables to augment the redundancy lattice, so that new natural latents can be approximately deterministic functions over (collections of) existing natural latents, and this naturally allows us to represent the “hierarchical nature of abstractions” where lower-level abstractions makes it easier to compute higher-level ones.

A concrete setting where this can be useful is where a bunch of agents receive different but partially overlapping sets of observations and aims to predict partially overlapping domains. Having a fine grained collection of natural latents redundant across different elements of the redundancy lattice means we get to easily zoom in on the smaller subset of latent variables that’s (maximally) redundantly represented by all of the agents (& be able to tell which domains of predictions these latents actually mediate).

Reply

2

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

126

(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)

126

Ω 52

126

Ω 52

Recap: What Was The Problem Again?

Some Intuition From The Exact Case

The Problem

"Stochastic" Natural Latents

"Deterministic" Natural Latents

The Goal

The Proof

Key Ideas

Math

Assumptions & Preconditions

Resampling Conserves Naturality

Pareto Minimization -> Single Objective Minimization

Lagrangian & First Order Conditions

Putting The Pieces Together & Solving The Equations

A (Non-Strict) Pareto Improvement Via Coarse Graining

Finally, A Deterministic Natural Latent

Can we do better?

What's Next?