Dalcy

Tomorrow can be brighter than today

Although the night is cold

The stars may seem so very far awayBut courage, hope and reason burn

In every mind, each lesson learned

Shining light to guide our wayMake tomorrow brighter than today

*Epistemic status: literal shower thoughts, perhaps obvious in retrospect, but was a small insight to me.*

I’ve been thinking about: “what proof strategies could prove *structural* selection theorems, and not just behavioral selection theorems?”

Typical examples of selection theorems in my mind are: coherence theorems, good regulator theorem, causal good regulator theorem.

- Coherence theorem: Given an agent satisfying some axioms, we can observe their behavior in various conditions and construct , and then the agent’s behavior is equivalent to a system that is maximizing .
- Says nothing about whether the agent internally constructs and uses them.

- (Little Less Silly version of the) Good regulator theorem: A regulator that minimizes the entropy of a system variable (where there is an environment variable upstream of both and ) without unnecessary noise (hence deterministic) is behaviorally equivalent to a deterministic function of (despite being a function of ).
- Says nothing about whether actually internally reconstructs and uses it to produce its output.

- Causal good regulator theorem (summary): Given an agent achieving low regret across various environment perturbations, we can observe their behavior in specific perturbed-environments, and construct that is very similar to the true environment . Then argue: “hence the agent must have something internally isomorphic to ”. Which is true, but …
- says nothing about whether the agent actually uses those internal isomorphic-to- structures in the causal history of computing its output.

And I got stuck here wondering, man, how do I ever prove anything structural.

Then I considered some theorems that, if you squint *really really* hard, could also be framed in the selection theorem language in a very broad sense:

- SLT: Systems selected to get low loss are likely to be in a degenerate part of the loss landscape.
^{[1]}- Says something about structure: by assuming the system to be a parameterized statistical model, it says the parameters satisfy certain conditions like degeneracy (which further implies e.g., modularity).

This made me realize that **to prove selection theorems on structural properties of agents, you should obviously give more mathematical structure to the “agent” in the first place:**

- SLT represents a system as a parameterized function - very rich!
- In coherence theorem, the agent is just a single node that outputs decision given lotteries. In the good regulator theorem and the causal good regulator theorem, the agent is literally just a single node in a Bayes Net - very impoverished!

And recall, we actually have an agent foundations style selection theorem that *does* prove something structural about agent internals by giving more mathematical structure to the agent:

- Gooder regulator theorem: A regulator is now two nodes instead of one, but the latter-in-time node gets an additional information about the choice of “game” it is being played against (thus the former node acts as a sort of information bottleneck). Then, given that the regulator makes take minimum entropy, the first node must be isomorphic to the likelihood function .
- This
*does*say something about structure, namely that an agent (satisfying certain conditions) with an internal information bottleneck (structural assumption) must have that bottleneck be behaviorally equivalent to a likelihood function, whose output is then connected to the second node. Thus it is valid to claim that (under our structural assumption) the agent internally reconstructs the likelihood values and uses it in its computation of the output.

- This

So in short, we need more initial structure or even assumptions on our “agent,” at least more so than literally a single node in a Bayes Net, to expect to be able to prove something structural.

Here is my 5-minute attempt to put more such "structure" to the [agent/decision node] in the Causal good regulator theorem with the hopes that this would make the theorem more structural, and perhaps end up as a formalization of the Agent-like Structure Problem (for World Models, at least), or very similarly the Approximate Causal Mirror hypothesis:

- Similar setup to the Causal good regulator theorem, but instead of a single node representing an agent’s decision node, assume that the agent as a whole is represented by an unknown causal graph , with a number of nodes designated as input and output, connected to the rest-of-the-world causal graph . Then claim: Agents with low regret must have that admits an abstracting causal model map (summary) from , and (maybe more structural properties such as) the approximation error should roughly be lowest around the input/output & utility nodes, and increase as you move further away from it in the low-level graph. This would be a very structural claim!

^{^}I'm being

*very very*[imprecise/almost misleading] here—because I'm just trying to make a high-level point and the details don't matter too much—one of the caveats (among many) being that this statement makes the theoretically yet unjustified connection between SGD and Bayes.

102

"I always remember, [Hamming] would come into my office and try to solve a problem [...] I had a very big blackboard, and he’d start on one side, write down some integral, say, ‘I ain’t afraid of nothin’, and start working on it. So, now, when I start a big problem, I say, ‘I ain’t afraid of nothin’, and dive into it."

10

The question is whether this expression is easy to compute or not, and fortunately the answer is that it's quite easy! We can evaluate the first term by the simple Monte Carlo method of drawing many independent samples and evaluating the empirical average, as we know the distribution explicitly and it was presumably chosen to be easy to draw samples from.

My question when reading this was: why can't we say the same thing about ? i.e. draw many independent samples and evaluate the empirical average? Usually is also assumed known and simple to sample from (e.g., gaussian).

So far, my answer is:

- , so assuming is my data, usually will be high when is high, so the samples during MCMC will be big enough to contribute to the sum, unlike blindly sampling from where most samples will contribute nearly to the sum.
- Also another reason being how the expectation can be reduced to the sum of expectations over each of the dimensions of and if and factorizes nicely.

Tl;dr, Systems are abstractable to the extent they admit an abstracting causal model map with low approximation error. This should yield a pareto frontier of high-level causal models consisting of different tradeoffs between complexity and approximation error. Then try to prove a selection theorem for abstractability / modularity by relating the form of this curve and a proposed selection criteria.

Recall, an abstracting causal model (ACM)—exact transformations, -abstractions, and approximations—is a map between two structural causal models satisfying certain requirements that lets us reasonably say one is an abstraction, or a high-level causal model of another.

- Broadly speaking, the condition is a sort of causal consistency requirement. It's a commuting diagram that requires the "high-level" interventions to be consistent with various "low-level" ways of implementing that intervention. Approximation errors talk about how well the diagram commutes (given that the support of the variables in the high-level causal model is equipped with some metric)

Now consider a curve: x-axis is the node count, and y-axis is the minimum approximation error of ACMs of the original system with that node count (subject to some conditions^{[1]}). It would hopefully an decreasing one^{[2]}.

- This curve would represent the abstractability of a system. Lower the curve, the more abstractable it is.
- Aside: we may conjecture that natural systems will have discrete jumps, corresponding to natural modules. The intuition being that, eg if we have a physics model of two groups of humans interacting, in some sense 2 nodes (each node representing the human-group) and 4 nodes (each node representing the individual-human) are the most natural, and 3 nodes aren't (perhaps the 2 node system with a degenerate node doing ~nothing, so it would have very similar approximation scores with the 2 node case).

Then, try hard to prove a selection theorem of the following form: given low-level causal model satisfying certain criteria (eg low regret over varying objectives, connection costs), the abstractability curve gets pushed further downwards. Or conversely, find conditions that make this true.

I don't know how to prove this^{[3]}, but at least this gets closer to a well-defined mathematical problem.

^{^}I've been thinking about this for an hour now and finding the right definition here seems a bit non-trivial. Obviously there's going to be an ACM of zero approximation error for any node count, just have a single node that is the joint of all the low-level nodes. Then the support would be massive, so a constraint on it may be appropriate.

Or instead we could fold it in to the x-axis—if there is perhaps a non ad-hoc, natural complexity measure for Bayes Nets that capture [high node counts => high complexity because each nodes represent stable causal mechanisms of the system, aka modules] and [high support size => high complexity because we don't want modules that are "contrived" in some sense] as special cases, then we could use this as the x-axis instead of just node count.

Immediate answer: Restrict this whole setup into a prediction setting so that we can do model selection. Require on top of causal consistency that both the low-level and high-level causal model have a single node whose predictive distribution are similar. Now we can talk about eg the RLCT of a Bayes Net. I don't know if this makes sense. Need to think more.

^{^}Or rather, find the appropriate setup to make this a decreasing curve.

^{^}I suspect closely studying the robust agents learn causal world models paper would be fruitful, since they also prove a selection theorem over causal models. Their strategy is to (1) develop an algorithm that queries an agent with low regret to construct a causal model, (2) prove that this yields an approximately correct causal model of the data generating model, (3) then arguing that this implies the agent must internally represent something isomorphic to a causal world model.

I don't know if this is just me, but it took me an embarrassingly long time in my mathematical education to realize that the following three terminologies, which introductory textbooks used interchangeably without being explicit, mean the same thing. (Maybe this is just because English is my second language?)

X => YmeansX is sufficient for YmeansX only if Y

X <= YmeansX is necessary for YmeansX if Y

10

I'd also love to have access!

10

(the causal incentives paper convinced me to read it, thank you! good book so far)

if you read Sutton & Barto, it might be clearer to you how narrow are the circumstances under which 'reward is not the optimization target', and why they are not applicable to most AI things right now or in the foreseeable future

Can you explain this part a bit more?

My understanding of situations in which 'reward is not the optimization target' is when the assumptions of the policy improvement theorem don't hold. In particular, the theorem (that iterating policy improvement step must yield strictly better policies and it converges at the optimal, reward maximizing policy) assumes that each step we're updating the policy by greedy one-step lookahead (by argmaxing the action via ).

And this basically doesn't hold irl because realistic RL agents aren't *forced* to explore all states (the classic example of "I *can* explore the state of doing cocaine, and I'm sure my policy will drastically change in a way that my reward circuit considers an improvement, but I don't *have* to do that). So my opinion that the circumstances under which 'reward is the optimization target' is very narrow remains unchanged, and I'm interested in why you believe otherwise.

https://www.lesswrong.com/posts/KcvJXhKqx4itFNWty/k-complexity-is-silly-use-cross-entropy-instead

However: