Daniel C — LessWrong

Congratulations!

I would guess the issue with KL relates to the fact that a bound on permits situations where $P (X = x)$ is small but $Q (X = x)$ is large (as we take the expectation under $P$ ), whereas JS penalizes both ways.

In particular, in the original theorem on resampling using KL divergence, the assumption bounds KL w.r.t the joint distribution $P (X, Λ)$ , so there may be situation where the resampled probability $Q (X = x, Λ = λ) = P (X = x) P (Λ = λ | X_{2} = x_{2})$ is large but $P (X = x, Λ = λ)$ is small. But the intended conclusion bounds the KL under the resampled distribution $Q$ , so the error on the values $(X = x, Λ = λ)$ would be weighted much more under $Q$ than under $P$ . Since we're taking expectation under $Q$ for the conclusion, the bound on the other resampling error under $P$ becomes insufficient.

Resampling Conserves Redundancy (Approximately)

Daniel C9d30

Would this still give us guarantees on the conditional distribution ?

E.g. Mediation: $D_{K L} (P (X_{1}, X_{2}, Λ) ∥ P (X_{1} | Λ) P (X_{2} | Λ) P (Λ))$ $= D_{K L} (P (X_{1}, X_{2} | Λ) P (Λ) ∥ P (X_{1} | Λ) P (X_{2} | Λ) P (Λ))$ $= D_{K L} (P (X_{1}, X_{2} | Λ) ∥ P (X_{1} | Λ) P (X_{2} | Λ))$

is really about the expected error conditional on individual values of $Λ$ , & it seems like there are distributions with high mediation error but low error when the latent is marginalized inside $D_{K L}$ , which could be load-bearing when the agents cast out predictions on observables after updating on $Λ$

johnswentworth's Shortform

Daniel C23d30

The current theory is based on classical hamiltonian mechanics, but I think the theorems apply whenever you have a markovian coarse-graining. Fermion doubling is a problem for spacetime discretization in the quantum case, so the coarse-graining might need to be different. (E.g. coarse-grain the entire hilbert space, which might have locality issues but probably not load-bearing for algorithmic thermodynamics)

On outside view, quantum reduces to classical (which admits markovian coarse-graining) in the correspondence limit, so there must be some coarse-graining that works

johnswentworth's Shortform

Daniel C24d61

I also talked to Aram recently & he's optimistic that there's an algorithmic version of the generalized heat engine where the hot vs cold pool correspond to high vs low k-complexity strings. I'm quite interested in doing follow-up work on that

johnswentworth's Shortform

Daniel C24d76

The continuous state-space is coarse-grained into discrete cells where the dynamics are approximately markovian (the theory is currently classical) & the "laws of physics" probably refers to the stochastic matrix that specifies the transition probabilities of the discrete cells (otherwise we could probably deal with infinite precision through limit computability)

Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies

Daniel C1mo30

As in, take a set of variables X, then search for some set of its (non-overlapping?) subsets such that there's a nontrivial natural latent over it? Right, it's what we're doing here as well.

I think the subsets can actually be partially overlapping, for instance you may have a that’s approximately deterministic w.r.t ${X_{1}, X_{2}}$ and ${X_{2}, X_{3}}$ but not $X_{2}$ alone, weak redundancy (approximately deterministic w.r.t ${_{i}} \forall i$ ) is also an example of redunds across overlapping subsets

Research Agenda: Synthesizing Standalone World-Models

Daniel C1mo30

Mm, this one's shaky. Cross-hypothesis abstractions don't seem to be a good idea, see here.

yea so I think the final theory of abstraction will have a weaker notion of equivalence espeically when we incorporate ontology shifts. E.g. we want to say that water is the same concept before and after we discover water is H2O, but the discovery obviously breaks predictive agreement (Indeed, the solomonoff version of natural latent is more robust to the agreement condition)

Also, you can totally add new information/abstraction that is not shared between your current and new hypothesis, & that seems consistent with the picture you described here (you can have separate ontologies but you try to capture the overlap as much as possible)

My guess is that there's something like a hierarchy of hypotheses, with specific high-level hypotheses corresponding to several lower-level more-detailed hypotheses, and what you're pointing at by "redundant information across a wide variety of hypotheses" is just an abstraction in a (single) high-level hypothesis which is then copied over into lower-level hypotheses. (E. g., the high-level hypothesis is the concept of a tree, the lower-level hypotheses are about how many trees are in this forest.)

yes I think that's the right picture

But we don't derive it by generating a bunch of low-level hypotheses and then abstracting over them, that'd lead to broken ontologies.

I agree that we don't do that practically as it'd be slower (instead we simply generate an abstraction & use future feedback to determine whether it's a robust one), but I think if you did generate a bunch of low-level hypotheses and look for redundant computation among them, then an adequate version of it would just recover the "high-level low-level hypotheses" picture you've described?

In particular, with cross-hypothesis abstraction we don't have to separately define what the variables are, so we can sidestep dataset-assembly entirely & perhaps simplify the shifting structures problem

Synthesizing Standalone World-Models, Part 2: Shifting Structures

Daniel C1mo40

Nice, I've gestured at similar things in this comment, conceptually the main thing you want to model is variables that control the relationships between other variables, the upshot is you can continue the recursion indefinitely: Once you have second order variables that control the relationships between other variables, you can then have variables that control the relationship among second order variables and so on.

Using function calls as an analogy: When you're executing a function that itself makes a lot of function calls, there are two main ways these function calls can be useful:

The results of these function calls might be used to compute the final output
The results of these function calls can tell you what other function calls would be useful to make (e.g. if you want to find the shape of a glider, the position tells you which cells to look at to determine that)

an adequate version of this should also be turing complete which means it can accomodate shifting structures, & function calls seem like a good way to represent hierarchies of abstractions

CSI in bayesian networks also deals with the idea that the causal structure between variables changes over time/depending on context (you're probably more interested in how relationships between levels of abstraction changes with context, but the two directions seem linked). I plan to explore the following variant at some point(not sure if it's already in the literature):

Suppose that there is a variable that "controls" the causal structure of $X$ , we use the good-old KL approximation to represent the error conditional on a particular value of $Y$ $D_{K L} (P (X | Y = y) ∥ Π_{i} P (X_{i} | X_{p a_{G} (i)}, Y = y))$ under a particular diagram $G$
You can imagine that the conditional distrbution initially approximately satisfies a diagram $G_{1}$ , but as you change the value of $Y$ , the error for $G_{1}$ goes up while the error for some other diagram $G_{2}$ goes to 0
In particular, if $Y$ is a continuous variable, and the conditional distribution $P (X | Y = y)$ changes continuously with $Y$ , then $D_{K L} (P (X | Y = y) ∥ Π_{i} P (X_{i} | X_{p a_{G} (i)}, Y = y))$ changes continuously with $Y$ which is quite nice
So this is a formalism that deals with "context-dependent structure" in a way that plays well with continuity, and if you have discrete variables controlling the causal structure, you can use it to accommodate uncertainty over the discrete outcomes (that determine causal structure).

Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies

Daniel C1mo*50

But note that synergistic information can be defined by referring purely to the system we're examining, with no "external" target variable. If we have a set of variables , we can define the variable s such that $I (X; s)$ is maximized under the constraint of $\forall X_{i} \in (P (X) ∖ X) : I (X_{i}; s) = 0$ . (Where $P (X) ∖ X$ is the set of all subsets of $X$ except $X$ itself.)

That's a nice formulation of synergistic information, it's independent with redundant info via the data-processing inequality $0 = I (X_{i}; s) \geq I (f (X_{i}); s)$ so somewhat promising that it can add up to total entropy.

You might be interested in this comment if distinguishing betweeen synergistic and redundant information is not your main objective: You can simply define redunds over collections of subsets, such that e.g. "dogness" is a redund over every subset of atoms that allows you to conclude you're looking at a dog. In particular, the redundancy lattice approach seems simpler when the latent depends on not just synergistic but also redundant and unique information

One issue with PID worth mentioning is that they haven't figured out what measure to use for quantifying multivariate redundant information. It's the same problem we seem to have. But it's probably not a major issue in the setting we're working in (the well-abstracting universes).

Recent impossibility result seems to rule out general multivariate PID that guarantees non-negativity of all components, though partial entropy decomposition may be more tractable

If there's a pair of $q_{i}$ , $q_{k}$ such that $X_{i} \subset X_{k}$ , then $q_{i}$ necessarily contains all information in $q_{k}$ . Re-define $q_{i}$ , removing all information present in $q_{k}$ .

This seems similar to capturing unique information, where the constructive approach is probably harder in PID than PED. E.g. in BROJA it involves an optimization problem over distributions with some constraints on marginals, but it only estimates the magnitude of unique info, not an actual random variable that represents unique info

Research Agenda: Synthesizing Standalone World-Models

Daniel C1mo*30

Nice post!

Some frames about abstractions & ontology shifts I had while thinking through similar problems (which you may have considered already):

The dual of "abstraction as redundant information across a wide variety of agents in the same environment" is "abstraction as redundant information/computation across a wide variety of hypotheses about the environment in an agent's world model" (E.g. a strawberry is a useful concept to model for many worlds that I might be in). I think this is a useful frame when thinking about "carving up" the world model into concepts, since a concept needs to remain invariant while the hypothesis keeps being updated
The semantics of a component in a world model is partly defined by its relationship with the rest of the components (e.g. move a neuron to a different location and its activation will have a different meaning), so if you want a component to have stable semantics over time, you want to put the "relational/indexical information" inside the component itself
In particular, this means that when an agent acquires new concepts, the existing concepts should be able to "specify" how it should relate to that new concept (e.g. learning about chemistry then using it to deduce macro-properties of strawberries from molecular composition)

happy to discuss more via PM as some of my ideas seem exfohazardous

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments