Writing Causal Models Like We Write Programs

by johnswentworth3 min read5th May 20205 comments

57

Ω 18

AbstractionWorld Modeling TechniquesWorld Modeling
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Clunc

We’ll start with a made-up programming language called Clunc. The distinguishing feature of clunc is that it combines classes and functions into a single type, called a clunc. It looks like this:

quad = Clunc {
x = 4
constant = 3
linear = 2*x + constant
result = x*x + linear
}

We could then go to a terminal:

>>> quad.result
27
>>> quad.linear
11

In order to use this clunc like a function, we apply the do() operator. For instance,

>>> quad3 = do(quad, x=2)

… creates a new clunc which is just like quad, except that x is 2 rather than 4:

>>> quad3
Clunc {
x = 2
constant = 3
linear = 2*x + constant
result = x*x + linear
}

When we query fields of quad3, they reflect the new x-value:

>>> quad3.result
11
>>> quad3.linear
7

There’s no designated “input” or “output”; we can use the do() operator to override any values we please. For instance

>>> quad_zero_linear = do(quad, linear=0)
>>> quad_zero_linear
Clunc {
x = 4
constant = 3
linear = 0
result = x*x + linear
}
>>> quad_zero_linear.result
16

A few quick notes:

  • Clunc is purely clunctional: everything is immutable, and each variable can only be written once within a clunc. No in-place updates.
  • Clunc is lazy.
  • Variables can be set randomly, e.g. “x = random.normal(0, 1)”.
  • The do() operator creates a new clunc instance with the changes applied. If there are any random variables, they are re-sampled within the new clunc. If we want multiple independent samples of a randomized clunc M, then we can call do(M) (without any changes applied) multiple times.

To make this whole thing Turing complete, we need one more piece: recursion. Cluncs can “call” other cluncs, including themselves:

factorial = Clunc {
n = 4
base_result = 1
recurse_result = do(factorial, n=n-1).result
result = (n == 0) ? base_result : n * recurse_result
}

… and that’s where things get interesting.

Causal Models

Hopefully the mapping from clunc to probabilistic causal models is obvious: any clunc with random variables in it is a typical Pearl-style causal DAG, and the operator works exactly like it does for causal models. The “clunc” is really a model, given by structural equations. The one big change is the possibility of recursion: causal models “calling” other models or other instances of themselves.

To get some practice with this idea, let’s build a reasonably-involved analogue model of a ripple-carry adder circuit.

We’ll start at the level of a NAND gate (levels below that involve equilibrium models, which would require a bunch of tangential explanation). We’ll assume that we have some model , and we use to get the (noisy) output voltage of the NAND gate in terms of the input voltages and . Since we’re building an analogue model, we’ll be using actual voltages (including noise), not just their binarized values.

We’ll take as given (i.e. assume somebody else built that model). Building everything out of NAND gates directly is annoying, so we’ll make an XOR as well:

This looks like a program which performs an XOR using NAND gates. But really, it’s a Pearl-style causal DAG model which uses a lot of NAND-submodels. We can write out the joint probability distribution via the usual method, with each line in the model generating a term in the expansion: The full distribution is the product of those terms.

That’s just the first step. Next, we need a full adder, a circuit block which computes the sum and carry bits for one “step” of binary long addition. It looks like this:

As before, we can write out the components of the joint distribution line-by-line. I’ll just do a few this time:

Notice that some of these involve probabilities on the model , which we could further expand using the joint distribution of variables from earlier.

Finally, we can hook up a bunch of full adders to make our 32-bit ripple-carry adder:

The components of the joint distribution for this one are left as an exercise for the reader.

Why Is This Useful?

Classes/functions let us re-use code; we don’t have to repeat ourselves. Likewise, clunc-ish causal models let us re-use submodels; we don’t have to repeat ourselves.

Obviously this has many of the same advantages as in programming. We can modularize our models, and fiddle with the internals of one submodel independently of other submodels. We can “subclass” our models via the -operator, to account for different contexts. Different people can work on different submodels independently - we could even imagine libraries of submodels. An electrical engineer could write a probabilistic causal model representing the low-level behavior of a chip; others could then import that model and use it as a reference when designing things which need to work with the chip, like packaging, accessories, etc.

From a more theoretical perspective, when we write programs with unbounded runtime, we have to have some way to re-use code: there’s only so many lines in the program, so the program must visit some of the lines multiple times in the course of execution. Some lines must be re-used. Likewise for probabilistic models: if we want to define large models - including unbounded models - with small/finite definitions, then we need some way to re-use submodels. We could do that by writing things like “”, but if we want Turing completeness anyway, we might as well go for recursion.

From a pure modelling perspective, the real world contains lots of repeating structures. If we’re modelling things like cars or trees, we can re-use a lot of the information about one car when modelling another car. We think of cars as variations on a template, and that’s exactly what the do() operator provides: we give it some “template” model, and apply modifications to it. The corresponding inverse problem then says: given a world full of things which are variations on some templates, find the templates and match them to the things - i.e. learn to recognize and model “cars” and “trees”. Clunc-ish causal models are a natural fit for this sort of problem; they naturally represent things like “corvette with a flat tire”.

Finally, the main reason I’ve been thinking about this is to handle abstraction. Clunc-ish models make layers of abstraction natural; lower-level behaviors can be encapsulated in submodels, just as we saw above with the ripple-carry adder. If we want to write abstraction-learning algorithms - algorithms which take in raw data and spit out multi-level models with layers of abstraction - then clunc-ish models are a natural form for their output. This is what multi-level world models look like.

57

Ω 18