Daniel C — LessWrong

LESSWRONG
LW

Great to see the concreteness of this example, some thoughts on the candidate properties:

The relationships between latent variables can change under ontology shifts, but we still want the semantics of our latent variables to remain invariant in some sense. This means we need some other variables tracking the relationships between our latent variables, & we’d update on those variables during ontology shifts. Latent variables are partly selected on the criterion that they can adapt to these changes in relationships while still maintaining the correspondence principle.
There’s a problem of whether we want to think of factorization/latent variables as a stance/frame we’re taking towards the system or an objective structural property of the system. I lean towards the former because when humans model the minds of other humans, they think of those minds as making use of similar abstractions to think & reason despite almost never observing the actual internals of human brains. The degree of freedom in the “latent variable embedding” of nuclear exchange initiation also suggests some notion of subjectivity.
The fact that we have to identify nuclear exchange initiation across a wide variety of scenarios points to ‘re-use’ being an important basis of convergent factorization. Loopiness comes in when our existing reusable abstractions affect what new abstractions/models of the world we’re able to construct.

For this we need a mechanism such that the maintenance of the mechanism is a schelling point. Specifically, the mechanism at T+1 should reward agents for actions at time T that reinforce the mechanism itself (in particular the actions are distributed). The incentive raises the probability of the mechanism being actualized at T+1, which in turn raises the "weight" of the reward offered by the mechanism at T+1, creating a self-fulfilling prophecy.

"Merging" forces parallelism back into sequential structures, which is why most blockchains are slow. You could make it faster by bundling a lot of actions together, but you need to make sure all actions are actually observable & checked by most of the agents (aka the data availability problem)

Three Kinds Of Ontological Foundations

Daniel C21d*30

For translatability guarantees, we also want an answer for why agents have distinct concepts for different things, and the criteria for carving up the world model into different concepts. My sketch of an answer is that different hypotheses/agents will make use of different pieces of information under different scenarios, and having distinct reference handles to different types of information allows the hypotheses/agents to access the minimal amount of information they need.

For environment structure, we'd like an answer for what it means for there to be an object that persists through time, or for there to be two instances of the same object. One way this could work is to look at probabilistic predictions of an object over its Markov blanket, and require some sort of similarity in probabilistic predictions when we "transport" the object over spacetime

I'm less optimistic about the mind structure foundation because the interfaces that are the most natural to look at might not correspond to what we call "human concepts", especially when the latter requires a level of flexibility not supported by the former. For instance, human concepts have different modularity structures with each other depending on context (also known as shifting structures), which basically rules out any simple correspondence with interfaces that have fixed computational structure over time. How we want to decompose a world model is an additional degree of freedom to the world model itself, and that has to come from other ontological foundations.

Toward Statistical Mechanics Of Interfaces Under Selection Pressure

Daniel C24d30

Seems like the main additional source of complexity is that each interface has its own local constraint, and the local constraints are coupled with each other (but lower-dimensional than parameters themselves); whereas regular statmech usually have subsystems sharing the same global constraints (different parts of a room of ideal gas are independent given the same pressure/temperature etc)

To recover the regular statmech picture, suppose that the local constraints have some shared/redundant information with each other: Ideally we'd like to isolate that redundant/shared information into a global constraint that all interfaces has access to, and we'd want the interfaces to be independent given the global constraint. For that we need something like relational completeness, where indexical information is encoded within the interfaces themselves, while the global constraint is shared across interfaces.

The Zen Of Maxent As A Generalization Of Bayes Updates

Daniel C24d30

IIUC there are two scenarios to be distinguished:

One is that the die has bias p unknown to you (you have some prior over p) and you use i.i.d flips to estimate bias as usual & get maxent distribution for a new draw. The draws are independent given p but not independent given your priors, so everything works out.

The other is that the die is literally i.i.d over your priors. In this case everything from your argument routes through: Whatever bias\constraint you happen to estimate from your outcome sequence doesn't say anything about a new i.i.d draw because they're uncorrelated, the new draw is just another sample from your prior

Jemist's Shortform

Daniel C24d10

I think steering is basically learning, backwards, and maybe flipped sideways. In learning, you build up mutual information between yourself and the world; in steering, you spend that mutual information. You can have learning without steering---but not the other way around---because of the way time works.

Alternatively, for learning your brain can start out in any given configuration, and it will end up in the same (small set of) final configuration (one that reflects the world); for steering the world can start out in any given configuration, and it will end up in the same set of target configurations

It seems like some amount of steering without learning is possible (open-loop control), you can reduce entropy in a subsystem while increasing entropy elsewhere to maintain information conservation

The Zen Of Maxent As A Generalization Of Bayes Updates

Daniel C1mo80

Nice, some connections with why are maximum entropy distributions so ubiquitous:

If your system is ergodic, time average=ensemble average. Hence expected constraints can be estimated via following your dynamical system over time
If your system follows the second law, then entropy increases subject to the constraints

So the system converges to the maxent invariant distribution subject to constraint, which is why langevin dynamics converges to the Boltzmann distribution, and you can estimate equilibrium energy by following the particle around

In particular, we often use maxent to derive the prior itself (=invariant measure), and when our system is out of equilibrium, we can then maximize relative entropy w.r.t our maxent prior to update our distribution

Resampling Conserves Redundancy & Mediation (Approximately) Under the Jensen-Shannon Divergence

Daniel C1mo10

Congratulations!

I would guess the issue with KL relates to the fact that a bound on permits situations where $P (X = x)$ is small but $Q (X = x)$ is large (as we take the expectation under $P$ ), whereas JS penalizes both ways.

In particular, in the original theorem on resampling using KL divergence, the assumption bounds KL w.r.t the joint distribution $P (X, Λ)$ , so there may be situation where the resampled probability $Q (X = x, Λ = λ) = P (X = x) P (Λ = λ | X_{2} = x_{2})$ is large but $P (X = x, Λ = λ)$ is small. But the intended conclusion bounds the KL under the resampled distribution $Q$ , so the error on the values $(X = x, Λ = λ)$ would be weighted much more under $Q$ than under $P$ . Since we're taking expectation under $Q$ for the conclusion, the bound on the other resampling error under $P$ becomes insufficient.

Resampling Conserves Redundancy (Approximately)

Daniel C1mo30

Would this still give us guarantees on the conditional distribution ?

E.g. Mediation: $D_{K L} (P (X_{1}, X_{2}, Λ) ∥ P (X_{1} | Λ) P (X_{2} | Λ) P (Λ))$ $= D_{K L} (P (X_{1}, X_{2} | Λ) P (Λ) ∥ P (X_{1} | Λ) P (X_{2} | Λ) P (Λ))$ $= D_{K L} (P (X_{1}, X_{2} | Λ) ∥ P (X_{1} | Λ) P (X_{2} | Λ))$

is really about the expected error conditional on individual values of $Λ$ , & it seems like there are distributions with high mediation error but low error when the latent is marginalized inside $D_{K L}$ , which could be load-bearing when the agents cast out predictions on observables after updating on $Λ$

johnswentworth's Shortform

Daniel C2mo30

The current theory is based on classical hamiltonian mechanics, but I think the theorems apply whenever you have a markovian coarse-graining. Fermion doubling is a problem for spacetime discretization in the quantum case, so the coarse-graining might need to be different. (E.g. coarse-grain the entire hilbert space, which might have locality issues but probably not load-bearing for algorithmic thermodynamics)

On outside view, quantum reduces to classical (which admits markovian coarse-graining) in the correspondence limit, so there must be some coarse-graining that works

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments