Dalcy

The eleventh virtue is scholarship. Study many sciences and absorb their power as your own. Each field that you consume makes you larger.

10

I think something in the style of abstracting causal models would make this work - defining a high-level causal model such that there is a map from the states of the low-level causal model to it, in a way that's consistent with mapping low-level interventions to high-level interventions. Then you can retain the notion of causality to non-low-level-physical variables with that variable being a (potentially complicated) function of potentially all of the low-level variables.

*tl;dr, the unidimensional continuity of preference assumption in the **money pumping argument** used to justify the VNM axioms correspond to the assumption that there exists some unidimensional "resource" that the agent cares about, and this language is provided by the notion of "souring / sweetening" a lottery.*

Various coherence theorems - or more specifically, various money pumping arguments generally have the following form:

If you violate this principle, then [you are rationally required] / [it is rationally permissible for you] to follow this trade that results in you

throwing away. Thus, for you to avoid behaving pareto-suboptimally byresourcesthrowing away, it is justifiable to call this principle a 'principle of rationality,' which you must follow.resources

... where "resources" (the usual example is money) are something that, apparently, these theorems assume exist. They *do*, but this fact is often stated in a very implicit way. Let me explain.

In the process of justifying the VNM axioms using money pumping arguments, one of the three main mathematical primitives are: (1) lotteries (probability distribution over outcomes), (2) preference relation (general binary relation), and **(3) a notion of Souring/Sweetening of a lottery.** Let me explain what (3) means.

- Souring of is denoted , and a sweetening of is denoted .
- is to be interpreted as "basically identical with A but strictly inferior in a single dimension that the agent cares about." Based on this interpretation, we assume . Sweetening is the opposite, defined in the obvious way.

Formally, souring could be thought of as introducing a new preference relation , which is to be interpreted as "lottery B is basically identical to lottery A, but strictly inferior in a single dimension that the agent cares about".

- On the syntactic level, such is denoted as .
- On the semantic level, based on the above interpretation, is related to via the following:

This is where the language to talk about resources come from. "Something you can independently vary alongside a lottery A such that more of it makes you prefer that option compared to A alone" sounds like what we'd intuitively call a resource^{[1]}.

Now that we have the language, notice that so far we haven't assumed sourings or sweetenings exist. The following assumption does it:

Unidimensional Continuity of Preference: If , then there exists a prospect such that 1) is a souring of X and 2) .

Which gives a more operational characterization of souring as something that lets us interpolate between the preference margins of two lotteries - intuitively satisfied by e.g., money due to its infinite divisibility.

So the above assumption is where the assumption of *resources* come into play. I'm not aware of any money pump arguments for this assumption, or more generally, for the existence of a "resource." Plausibly instrumental convergence.

^{^}I don't actually think this + the assumption below fully capture what we intuitively mean by "resources", enough to justify this terminology. I stuck with "resources" anyways because others around here used that term to (I think?) refer to what I'm describing here.

30

Yeah I'd like to know if there's a unified way of thinking about information theoretic quantities and causal quantities, though a quick literature search doesn't show up anything interesting. My guess is that we'd want separate boundary metrics for informational separation and causal separation.

70

I no longer think the setup above is viable, for reasons that connect to why I think Critch's operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions.

*(Note: I am thinking as I'm writing, so this might be a bit rambly.)*

Intuition: Why does a robust glider in Lenia intuitively feel like a system possessing boundary? Well, I imagine various situations that happen in the world (like bullets) and this pattern mostly stays stable in face of them.

Now, notice that the measure of infiltration/exfiltration depends on , a distribution over world history.

So, for the above measure to capture my intuition, the approximate Markov condition (operationalized by low infil & exfil) must consider the world state that contains the Lenia pattern with it avoiding bullets.

Remember, is the *raw* world state, no coarse graining. So is the distribution over the raw world trajectory. It already captures all the "potentially occurring trajectories under which the system may take boundary-preserving-action." Since everything is observed, our distribution already encodes all of "Nature's Intervention." So in some sense Critch's definition is already causal (in a very trivial sense), by the virtue of requiring a distribution over the raw world trajectory, despite mentioning no Pearlian Causality.

Maybe there is some canonical true for our physical world that minds can intersubjectively arrive at, so there's no ambiguity.

But when I imagine trying to implement this scheme on Lenia, there's immediately an ambiguity as to which distribution (representing my epistemic state on which raw world trajectories that will "actually happen") we should choose:

- Perhaps a very simple distribution: assigning uniform probability over world trajectories where the world contains nothing but the glider moving in a random direction with some initial point offset.
- I suspect many stances other the one factorizing the world into gliders would have low infil/exfil, because the world is so simple. This is the case of "accidental boundary-ness."

- Perhaps something more complicated: various trajectories where e.g., the Lenia patterns encounters bullets, evolves with various other patterns, etc.
- This I think rules out "accidental boundary-ness."

I think the latter works. But now there's a subjective choice of the distribution, and what are the set of possible/realistic "Nature's Intervention" - all the situations that can ever be encountered by the system under which it has boundary-like behaviors - that we want to implicitly encode into our observational distribution. I don't think it's natural for assign much probability to a trajectory whose initial conditions are set in a very precise way such that everything decays into noise. But this feels quite subjective.

I think the discussion above hints at a very crucial insight:

** must arise as a consequence of the stable mechanisms in the world.**

Suppose the world of Lenia contains various stable mechanisms like a gun that shoots bullets at random directions, scarce food sources, etc.

We want to describe distributions that the boundary system will "actually" experience in some sense. I want the "Lenia pattern dodges bullet" world trajectory to be considered, because there *is* a plausible mechanism in the world that can cause such trajectories to exist. For similar reasons, I think the empty world distributions are impoverished, and a distribution containing trajectories where the entire world decays into noise is bad because no mechanism can implement it.

**Thus, unless you have a canonical choice of ****, a better starting point would be to consider the ****abstract causal model**** that encodes the stable mechanisms in the world, and using Discovering Agents-style interventional algorithms that operationalize the notion "boundaries causally separate environment and viscera."**

- Well, because of everything mentioned above on how the causal model informs us on which trajectories are realistic, especially in the absence of a canonical . It's also far more efficient, because the knowledge of the mechanism informs the algorithm of the precise interventions to query the world for, instead of having to implicitly bake them in .

There are still a lot more questions, but I think this is a pretty clarifying answer as to how Critch's boundaries are limiting and why DA-style causal methods will be important.

30

I think it's plausible that the general concept of boundaries can possibly be characterized somewhat independently of preferences, but at the same time have boundary-preservation be a quality that agents mostly satisfy (discussion here. very unsure about this). I see Critch's definition as a first iteration of an operationalization for boundaries in the general, somewhat-preference-independent sense.

But I do agree that ultimately all of this should tie back to game theory. I find Discovering Agents most promising in this regards, though there are still a lot of problems - some of which I suspect might be easier to solve if we treat systems-with-high-boundaryness as a sort of primitive for the kind-of-thing that we can associate agency and preferences with in the first place.

**EDIT: I no longer think this setup is viable, for reasons that connect to why I think Critch's operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions. Check ****update****.**

**I believe there's nothing much in the way of actually implementing an approximation of Critch's ****boundaries**^{[1]}** using deep learning.**

Recall, Critch's boundaries are:

- Given a world (markovian stochastic process) , map its values (vector) bijectively using into 'features' that can be split into four vectors each representing a boundary-possessing system's Viscera, Active Boundary, Passive Boundary, and Environment.
- Then, we characterize boundary-ness (i.e. minimal information flow across features unmediated by a boundary) using two mutual information criterion each representing infiltration and exfiltration of information.
- And a policy of the boundary-posessing system (under the 'stance' of viewing the world implied by ) can be viewed as a stochastic map (that has no infiltration/exfiltration by definition) that best approximates the true dynamics.
- The interpretation here (under low exfiltration and infiltration) is that can be viewed as a policy taken by the system in order to perpetuate its boundary-ness into the future and continue being well-described as a boundary-posessing system.

All of this seems easily implementable using very basic techniques from deep learning!

- Bijective feature map are implemented using two NN maps each way, with an autoencoder loss.
- Mutual information is approximated with standard variational approximations. Optimize to minimize it.
- (the interpretation here being - we're optimizing our 'stance' towards the world in a way that best views the world as a boundary-possessing system)

- After you train your 'stance' using the above setup, learn the policy using an NN with standard SGD, with fixed .

A very basic experiment would look something like:

- Test the above setup on two cellular automata (e.g., GoL, Lenia, etc) systems, one containing just random ash, and the other some boundary-like structure like noise-resistant glider structures found via optimization (there are a lot of such examples in the Lenia literature).
^{[2]} - Then (1) check if the infiltration/exfiltration values are lower for the latter system, and (2) do some interp to see if the V/A/P/E features or the learned policy NN have any interesting structures.

I'm not sure if I'd be working on this any time soon, but posting the idea here just in case people have feedback.

^{^}I think research on boundaries - both conceptual work and developing practical algorithms for approximating them & schemes involving them - are quite important for alignment for reasons discussed earlier in my shortform.

^{^}Ultimately we want our setup to detect boundaries that aren't just physically contiguous chunks of matter, like informational boundaries, so we want to make sure our algorithm isn't just always exploiting basic locality heuristics.

I can't think of a good toy testbed (ideas appreciated!), but one easy thing to try is to just destroy all locality by mapping the automata lattice (which we were feeding as input) with the output of a complicated fixed bijective map over it, so that our system will have to

*learn*locality if it turns out to be a*useful*notion in its attempt at viewing the system as a boundary.

Damn, why did Pearl recommend readers (in the preface of his causality book) to read all the chapters other than chapter 2 (and the last review chapter)? Chapter 2 is literally the coolest part - inferring causal structure from purely observational data! Almost skipped that chapter because of it ...

30

Here's my current take, I wrote it as a separate shortform because it got too long. Thanks for prompting me to think about this :)

I find the intersection of computational mechanics, boundaries/frames/factored-sets, and some works from the causal incentives group - especially discovering agents and robust agents learn causal world model (review) - to be a very interesting theoretical direction.

By boundaries, I mean a sustaining/propagating system that informationally/causally insulates its 'viscera' from the 'environment,' and only allows relatively small amounts of deliberate information flow through certain channels in both directions. Living systems are an example of it (from bacteria to humans). It doesn't even have to be a physically distinct chunk of spacetime, they can be over more abstract variables like societal norms. Agents are an example of it.

I find them very relevant to alignment especially from the direction of detecting such boundary-possessing/agent-like structures embedded in a large AI system and backing out a sparse relationship between these subsystems, which can then be used to e.g., control the overall dynamic. Check out these posts for more.

A prototypical deliverable would be an algorithm that can detect such 'boundaries' embedded in a dynamical system when given access to some representation of the system, performs observations & experiments and returns a summary data structure of all the 'boundaries' embedded in a system and their desires/wants, how they game-theoretically relate to one another (sparse causal relevance graph?), the consequences of interventions performed on them, etc - that's versatile enough to detect e.g., gliders embedded in Game of Life / Particle Lenia, agents playing Minecraft while only given coarse grained access to the physical state of the world, boundary-like things inside LLMs, etc. (I'm inspired by this)

Why do I find the aforementioned directions relevant to this goal?

- Critch's Boundaries operationalizes boundaries/viscera/environment as functions of the underlying variable that executes policies that continuously prevents information 'flow'
^{[1]}between disallowed channels, quantified via conditional transfer entropy. - Relatedly, Fernando Rosas's paper on Causal Blankets operationalize boundaries using a similar but subtly different
^{[2]}form of mutual information constraint on the boundaries/viscera/environment variables than that of Critch's. Importantly, they show that such blankets always exist between two coupled stochastic processes (using a similar style of future morph equivalence relation characterization from compmech, and also a metric they call "synergistic coefficient" that quantifies how boundary-like this thing is.^{[3]} - More on compmech, epsilon transducers generalize epsilon machines to input-output processes. PALO (Perception Action Loops) and Boundaries as two epsilon transducers coupled together?
- These directions are interesting, but I find them still unsatisfactory because all of them are purely behavioral accounts of boundaries/agency. One of the hallmarks of agentic behavior (or some boundary behaviors) is adapting ones policy if an intervention changes the environment in a way that the system can observe and adapt to.
^{[4]}^{[5]} - (is there an interventionist extension of compmech?)
- Discovering agents provide a genuine causal, interventionist account of agency and an algorithm to detect them, motivated by the intentional stance. I think the paper is very enlightening from a conceptual perspective, but there are many problems yet to be solved before we can actually implement this. Here's my take on it.
- More fundamentally, (this is more vibes, I'm really out of my depth here) I feel there is something intrinsically limiting with the use of Bayes Nets, especially with the fact that choosing which variables to use in your Bayes Net already encodes a lot of information about the specific factorization structure of the world. I heard good things about finite factored sets and I'm eager to learn more about them.

^{^}Not exactly a 'flow', because transfer entropy conflates between intrinsic information flow and synergistic information - a 'flow' connotes only the intrinsic component, while transfer entropy just measures the overall amount of information that a system couldn't have obtained on its own. But anyways, transfer entropy seems like a conceptually correct metric to use.

^{^}Specifically, Fernando's paper criticizes blankets of the following form ( for viscera, and for active/passive boundaries, for environment):

- DIP implies

- This clearly forbids dependencies formed in the past that stays in 'memory'.

but Critch instead defines boundaries as satisfying the following two criteria:

- (infiltration)
- DIP implies

- (exfiltration)
- DIP implies

- and now that the independencies are entangled across different t, there is no longer a clear upper bound on , so I don't think the criticisms apply directly.

^{^}My immediate curiosities are on how these two formalisms relate to one another. e.g., Which independency requirements are more conceptually 'correct'? Can we extend the future-morph construction to construct Boundaries for Critch's formalism? etc etc

^{^}For example, a rock is very goal-directed relative to 'blocking-a-pipe-that-happens-to-exactly-match-its-size,' until one performs an intervention on the pipe size to discover that it can't adapt at all.

^{^}Also, interventions are really cheap to run on digital systems (e.g., LLMs, cellular automata, simulated environments)! Limiting oneself to behavioral accounts of agency would miss out on a rich source of cheap information.

(the causal incentives paper convinced me to read it, thank you! good book so far)

Can you explain this part a bit more?

My understanding of situations in which 'reward is not the optimization target' is when the assumptions of the policy improvement theorem don't hold. In particular, the theorem (that iterating policy improvement step must yield strictly better policies and it converges at the optimal, reward maximizing policy) assumes that each step we're updating the policy π by greedy one-step lookahead (by argmaxing the action via qπ(s,a)).

And this basically doesn't hold irl because realistic RL agents aren't

forcedto explore all states (the classic example of "Icanexplore the state of doing cocaine, and I'm sure my policy will drastically change in a way that my reward circuit considers an improvement, but I don'thaveto do that). So my opinion that the circumstances under which 'reward is the optimization target' is very narrow remains unchanged, and I'm interested in why you believe otherwise.