Writing Causal Models Like We Write Programs

johnswentworth

Clunc

We’ll start with a made-up programming language called Clunc. The distinguishing feature of clunc is that it combines classes and functions into a single type, called a clunc. It looks like this:

quad = Clunc {
    x = 4
    constant = 3
    linear = 2*x + constant
    result = x*x + linear
}

We could then go to a terminal:

>>> quad.result
27
>>> quad.linear
11

In order to use this clunc like a function, we apply the do() operator. For instance,

>>> quad3 = do(quad, x=2)

… creates a new clunc which is just like quad, except that x is 2 rather than 4:

>>> quad3
Clunc {
    x = 2
    constant = 3
    linear = 2*x + constant
    result = x*x + linear
}

When we query fields of quad3, they reflect the new x-value:

>>> quad3.result
11
>>> quad3.linear
7

There’s no designated “input” or “output”; we can use the do() operator to override any values we please. For instance

>>> quad_zero_linear = do(quad, linear=0)
>>> quad_zero_linear
Clunc {
    x = 4
    constant = 3
    linear = 0
    result = x*x + linear
}
>>> quad_zero_linear.result
16

A few quick notes:

Clunc is purely clunctional: everything is immutable, and each variable can only be written once within a clunc. No in-place updates.
Clunc is lazy.
Variables can be set randomly, e.g. “x = random.normal(0, 1)”.
The do() operator creates a new clunc instance with the changes applied. If there are any random variables, they are re-sampled within the new clunc. If we want multiple independent samples of a randomized clunc M, then we can call do(M) (without any changes applied) multiple times.

To make this whole thing Turing complete, we need one more piece: recursion. Cluncs can “call” other cluncs, including themselves:

factorial = Clunc {
    n = 4
    base_result = 1
    recurse_result = do(factorial, n=n-1).result
    result = (n == 0) ? base_result : n * recurse_result
}

… and that’s where things get interesting.

Causal Models

Hopefully the mapping from clunc to probabilistic causal models is obvious: any clunc with random variables in it is a typical Pearl-style causal DAG, and the $d o ()$ operator works exactly like it does for causal models. The “clunc” is really a model, given by structural equations. The one big change is the possibility of recursion: causal models “calling” other models or other instances of themselves.

To get some practice with this idea, let’s build a reasonably-involved analogue model of a ripple-carry adder circuit.

We’ll start at the level of a NAND gate (levels below that involve equilibrium models, which would require a bunch of tangential explanation). We’ll assume that we have some model $M_{N A N D}$ , and we use $d o (M_{N A N D}, a = . . ., b = . . .) . r e s u l t$ to get the (noisy) output voltage of the NAND gate in terms of the input voltages $a$ and $b$ . Since we’re building an analogue model, we’ll be using actual voltages (including noise), not just their binarized values.

We’ll take $M_{N A N D}$ as given (i.e. assume somebody else built that model). Building everything out of NAND gates directly is annoying, so we’ll make an XOR as well: $M_{x o r} = M o d e l {a = 0.0 b = 0.0 i n t e r m e d i a t e = d o (M_{N A N D}, a = a, b = b) . r e s u l t l e f t = d o (M_{N A N D}, a = a, b = i n t e r m e d i a t e) . r e s u l t r i g h t = d o (M_{N A N D}, a = i n t e r m e d i a t e, b = b) . r e s u l t r e s u l t = d o (M_{N A N D}, a = l e f t, b = r i g h t) . r e s u l t}$

This looks like a program which performs an XOR using NAND gates. But really, it’s a Pearl-style causal DAG model which uses a lot of NAND-submodels. We can write out the joint probability distribution $P [a = a^{*}, b = b^{*}, i n t e r m e d i a t e = i^{*}, l e f t = l^{*}, r i g h t = r^{*}, r e s u l t = r e s u l t^{*} | M_{x o r}]$ via the usual method, with each line in the model generating a term in the expansion: $P [a = a^{*} | M_{X O R}] = I [a^{*} = 0] P [b = b^{*} | M_{X O R}] = I [b^{*} = 0] P [i n t e r m e d i a t e = i^{*} | M_{X O R}, a = a^{*}, b = b^{*}] = P [r e s u l t = i^{*} | d o (M_{N A N D}, a = a^{*}, b = b^{*})] P [l e f t = l^{*} | M_{X O R}, a = a^{*}, i n t e r m e d i a t e = i^{*}] = P [r e s u l t = l^{*} | d o (M_{N A N D}, a = a^{*}, b = i^{*})] P [r i g h t = r^{*} | M_{X O R}, i n t e r m e d i a t e = i^{*}, b = b^{*}] = P [r e s u l t = r^{*} | d o (M_{N A N D}, a = i^{*}, b = b^{*})] P [r e s u l t = r e s u l t^{*} | M_{X O R}, l e f t = l^{*}, r i g h t = r^{*}] = P [r e s u l t = r e s u l t^{*} | d o (M_{N A N D}, a = l^{*}, b = r^{*})]$ The full distribution is the product of those terms.

That’s just the first step. Next, we need a full adder, a circuit block which computes the sum and carry bits for one “step” of binary long addition. It looks like this:

$Mfull_adder=Model{a=0b=0c=0sab=do(MXOR,a=a,b=b).results=do(MXOR,a=sab,b=c).resultcarryab=do(MNAND,a=a,b=b)carryc=do(MNAND,a=sab,b=c)carry=do(MNAND,a=carryab,b=carryc)}$

As before, we can write out the components of the joint distribution line-by-line. I’ll just do a few this time:

$P [a = a^{*} | M_{f u l l_a d d e r}] = I [a^{*} = 0] . . . P [s_{a} b = s_{a b}^{*} | M_{f u l l_a d d e r}, a = a^{*}, b = b^{*}] = P [r e s u l t = s_{a b}^{*} | M_{X O R}, a = a^{*}, b = b^{*}] P [s = s^{*} | M_{f u l l_a d d e r}, s_{a b} = s_{a b}^{*}, c = c^{*}] = P [r e s u l t = s^{*} | M_{X O R}, a = s_{a b}^{*}, b = c^{*}] . . .$

Notice that some of these involve probabilities on the model $M_{X O R}$ , which we could further expand using the joint distribution of $M_{X O R}$ variables from earlier.

Finally, we can hook up a bunch of full adders to make our 32-bit ripple-carry adder:

$Mrc=Model{a=[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]b=[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]f0=do(Mfull_adder,a=a[0],b=b[0])fi=do(Mfull_adder,a=a[i],b=b[i],c=fi−1.carry)∀i∈1,…,31result=[f0,…,f31]}$

The components of the joint distribution for this one are left as an exercise for the reader.

Why Is This Useful?

Classes/functions let us re-use code; we don’t have to repeat ourselves. Likewise, clunc-ish causal models let us re-use submodels; we don’t have to repeat ourselves.

Obviously this has many of the same advantages as in programming. We can modularize our models, and fiddle with the internals of one submodel independently of other submodels. We can “subclass” our models via the $d o ()$ -operator, to account for different contexts. Different people can work on different submodels independently - we could even imagine libraries of submodels. An electrical engineer could write a probabilistic causal model representing the low-level behavior of a chip; others could then import that model and use it as a reference when designing things which need to work with the chip, like packaging, accessories, etc.

From a more theoretical perspective, when we write programs with unbounded runtime, we have to have some way to re-use code: there’s only so many lines in the program, so the program must visit some of the lines multiple times in the course of execution. Some lines must be re-used. Likewise for probabilistic models: if we want to define large models - including unbounded models - with small/finite definitions, then we need some way to re-use submodels. We could do that by writing things like “ $\forall i :< s u b m o d e l_{i} >$ ”, but if we want Turing completeness anyway, we might as well go for recursion.

From a pure modelling perspective, the real world contains lots of repeating structures. If we’re modelling things like cars or trees, we can re-use a lot of the information about one car when modelling another car. We think of cars as variations on a template, and that’s exactly what the do() operator provides: we give it some “template” model, and apply modifications to it. The corresponding inverse problem then says: given a world full of things which are variations on some templates, find the templates and match them to the things - i.e. learn to recognize and model “cars” and “trees”. Clunc-ish causal models are a natural fit for this sort of problem; they naturally represent things like “corvette with a flat tire”.

Finally, the main reason I’ve been thinking about this is to handle abstraction. Clunc-ish models make layers of abstraction natural; lower-level behaviors can be encapsulated in submodels, just as we saw above with the ripple-carry adder. If we want to write abstraction-learning algorithms - algorithms which take in raw data and spit out multi-level models with layers of abstraction - then clunc-ish models are a natural form for their output. This is what multi-level world models look like.

Belatedly seeing this post, but I wanted to note that probabilistic programming languages (PPLs) are centered around this basic idea! Some useful links and introductions to PPLs as a whole:
- Probabilistic models of cognition (web book)
- WebPPL
- An introduction to models in Pyro
- Introduction to Modeling in Gen

And here's a really fascinating paper by some of my colleagues that tries to model causal interventions that go beyond Pearl's do-operator, by formalizing causal interventions as (probabilistic) program transformations:

Bayesian causal inference via probabilistic program synthesis
Sam Witty, Alexander Lew, David Jensen, Vikash Mansinghka
https://arxiv.org/abs/1910.14124
Causal inference can be formalized as Bayesian inference that combines a prior distribution over causal models and likelihoods that account for both observations and interventions. We show that it is possible to implement this approach using a sufficiently expressive probabilistic programming language. Priors are represented using probabilistic programs that generate source code in a domain specific language. Interventions are represented using probabilistic programs that edit this source code to modify the original generative process. This approach makes it straightforward to incorporate data from atomic interventions, as well as shift interventions, variance-scaling interventions, and other interventions that modify causal structure. This approach also enables the use of general-purpose inference machinery for probabilistic programs to infer probable causal structures and parameters from data. This abstract describes a prototype of this approach in the Gen probabilistic programming language.

What useful problems do PPLs solve? Ideally some applications that are interesting for us non-corporate people. Can it be used for medical statistics (e.g., in nutrition)? (Any examples?) Is the reason it is not used the illiteracy of the scientists, or are the mainstream methods better?

PPLs are a tool to bring complicated statistical modeling to the masses. Computers are capable of doing much more advanced statistical modeling than appears in every non-statistics paper, but most people don't have the expertise to build them. PPLs allow you to write complicated statistical models and then evaluate them with state-of-the-art methods without having to build everything from scratch.

How is this different from just a regular imperative programming language with imperative assignment?

Causal models are just programs (with random inputs, and certain other restrictions if you want to be able to represent them as DAGs). The do() operator is just imperative assignment.

It's mostly the same as a regular imperative programming language - indeed, that's largely the point of the post. The do() operator isn't quite just imperative assignment, though; it has the wrong type-signature for that. It's more like an operator which creates a subclass on-the-fly, by overriding the getters for a few fields.

John is correct that do() is not imperative assignment. It's a different effect called "lazy dynamic scope."

do() is described fully in our paper on formal semantics for a language with counterfactuals, http://www.jameskoppel.com/files/papers/causal_neurips2019.pdf . The connection with dynamic scope is covered in the appendix, which is not yet online.

Have you used system verilog or some other hardware description language? Your clunk model of the ripple adder looks suspiciously like verilog code I wrote to make a ripple adder in a class. I can't recall enough deets to tell how different they are, but you might gain some insights from investigating.

Good point, HDLs do solve a very similar problem in similar ways. There's probably useful analogies to mine there. Also I just realized that it's been almost ten years since I last used verilog.

Summary
I’ll write the summary in reverse compared to the article:
At some point, we want systems that are capable of “multi-level” world modeling. They should take a complicated world, and spit out multi-level models with layers of abstraction for pieces in that world. Some of these models should also be “variations” of “template” models, and so the outputs should facilitate such an operation. Finally, the output should re-use lower-level models to account for the hierarchical structure of the world.
It turns out that causal models are the correct output structure for this: “variations” of “templates” can be built with the do()-operator. And the local functions in a causal model can also “under the hood” call lower-level causal models themselves.
Finally, the article presents a way of representing causal models that better captures the hierarchical and mechanistic nature of the situation. Causal models are thereby replaced by “clunks” in a “clunctional” programming language that combines classes with functions. Inside “clunks”, other clunks can be called, and variations can be run with a do()-operator. Random variables in clunks are allowed and provide new random samples whenever the clunk is run.

My Opinion:
This is a cool article! What I especially like about the viewpoint that causal models should be “written” like programs is that it emphasizes that objects in the world do things, instead of just “being”: when encountering causal bayesian networks for the first time, people often notice that there doesn’t seem to be a difference to pure probabilistic Bayesian networks except for the word “causal” that is attached to it. The clunc-representation re-emphasizes that causal models model actual functions in the real world, and that these functions are built in a hierarchical way.

Funnily enough, cluncs and the do operator are easily implemented in nix (Merging attribute sets is quite important in nix). Only problem is that nix doesn't support even the most basic numeric operations (like sqrt, pow (so I gave up on implementing your causal examples)).

let
  do = model: overrides: let self = model (self // overrides); in self;

  quad = self: {
    x = 4;
    constant = 3;
    linear = 2 * self.x + self.constant;
    result = self.x * self.x + self.linear;
  };

  quadDefault = do quad { };
  quad2 = do quad { x = 2; };
  quadNoLinear = do quad { linear = 0; };

in {
  default = quadDefault.result;
  overridden = quad2.result;
  noLinear = quadNoLinear.result;
}

The letters I and l look the same. Maybe use 1 instead of upper case i?

Bayesian causal inference via probabilistic program synthesis
Sam Witty, Alexander Lew, David Jensen, Vikash Mansinghka
https://arxiv.org/abs/1910.14124
Causal inference can be formalized as Bayesian inference that combines a prior distribution over causal models and likelihoods that account for both observations and interventions. We show that it is possible to implement this approach using a sufficiently expressive probabilistic programming language. Priors are represented using probabilistic programs that generate source code in a domain specific language. Interventions are represented using probabilistic programs that edit this source code to modify the original generative process. This approach makes it straightforward to incorporate data from atomic interventions, as well as shift interventions, variance-scaling interventions, and other interventions that modify causal structure. This approach also enables the use of general-purpose inference machinery for probabilistic programs to infer probable causal structures and parameters from data. This abstract describes a prototype of this approach in the Gen probabilistic programming language.

How is this different from just a regular imperative programming language with imperative assignment?

Causal models are just programs (with random inputs, and certain other restrictions if you want to be able to represent them as DAGs). The do() operator is just imperative assignment.

John is correct that do() is not imperative assignment. It's a different effect called "lazy dynamic scope."

Good point, HDLs do solve a very similar problem in similar ways. There's probably useful analogies to mine there. Also I just realized that it's been almost ten years since I last used verilog.

let
  do = model: overrides: let self = model (self // overrides); in self;

  quad = self: {
    x = 4;
    constant = 3;
    linear = 2 * self.x + self.constant;
    result = self.x * self.x + self.linear;
  };

  quadDefault = do quad { };
  quad2 = do quad { x = 2; };
  quadNoLinear = do quad { linear = 0; };

in {
  default = quadDefault.result;
  overridden = quad2.result;
  noLinear = quadNoLinear.result;
}

The letters I and l look the same. Maybe use 1 instead of upper case i?

93

Writing Causal Models Like We Write Programs

93

Ω 30

Clunc

Causal Models

Why Is This Useful?

93

Ω 30

93

Ω 30