Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Suppose we have an electronic circuit with two pieces: combinational logic and memory. The combinational logic is designed to behave as memoryless input/output logic gates - it has only short-term state, and computes some function. The memory reads the output of the combinational logic once per clock cycle, stores it, and feeds it back into the combinational logic as input during the next clock cycle. The causal diagram looks something like this:

Key point: even though there’s a path from the combinational logic state at the start of one clock cycle to its state at the start of the next cycle, that path has approximately-zero causal influence. We could intervene on the logic state right after the cycle starts:

… and the state at the start of the next cycle would not change at all.

In other words: in this model, the memory state mediates the influence of the logic state during one cycle on logic states during future cycles. This is not a property of the graph structure alone - it happens because the combinational logic does not have any long-term memory of its own.

Key problem for causal abstraction: how can we algorithmically detect situations like this?

One obvious answer is to probe the circuit experimentally. Counterfactual queries are particularly handy for this: we fix the random noise, then check which nodes would be changed by a counterfactual intervention on the logic state at some time. Try this for a bunch of different samples, and look for long-range mediation in the results.

One problem with this approach: we often want an abstraction which works on all possible “input values” of some variables. There may be exponentially many such values, and we’d have to check all of them to verify that the abstraction works.

An example: consider a ripple-carry adder (i.e. the electronic circuit version of grade-school long addition). Each stage adds two input bits and a carry bit from the previous stage, then passes its own carry bit to the next stage. We want to abstract from the low-level electronics to some higher-level model.

If we’re using a 64-bit adder, then there are 2^(2*64) possible inputs. For almost all of those (specifically all but about 2^64 of them), flipping one low-order bit will not change the high-order bit. But if we want our adder to work on all possible inputs, then we need to account for the one-in-2^64 possibilities when flipping a low-order bit does change the high-order bit. We won’t catch that just by sampling.

In full generality, the problem is presumably NP-complete - we’re asking whether any possible inputs to an arbitrary circuit result in some variable changing in response to a counterfactual intervention on another variable. Yet we’re able to use abstract models an awful lot in practice, so… what tricks are we using?

Suppose we have an electronic circuit with two pieces: combinational logic and memory. The combinational logic is designed to behave as memoryless input/output logic gates - it has only short-term state, and computes some function. The memory reads the output of the combinational logic once per clock cycle, stores it, and feeds it back into the combinational logic as input during the next clock cycle. The causal diagram looks something like this:

Key point: even though there’s a path from the combinational logic state at the start of one clock cycle to its state at the start of the next cycle, that path has approximately-zero causal influence. We could intervene on the logic state right after the cycle starts:

… and the state at the start of the next cycle would not change at all.

In other words: in this model, the memory state mediates the influence of the logic state during one cycle on logic states during future cycles. This is not a property of the graph structure alone - it happens because the combinational logic does not have any long-term memory of its own.

Key problem for causal abstraction: how can we algorithmically detect situations like this?

One obvious answer is to probe the circuit experimentally. Counterfactual queries are particularly handy for this: we fix the random noise, then check which nodes would be changed by a counterfactual intervention on the logic state at some time. Try this for a bunch of different samples, and look for long-range mediation in the results.

One problem with this approach: we often want an abstraction which works on all possible “input values” of some variables. There may be exponentially many such values, and we’d have to check all of them to verify that the abstraction works.

An example: consider a ripple-carry adder (i.e. the electronic circuit version of grade-school long addition). Each stage adds two input bits and a carry bit from the previous stage, then passes its own carry bit to the next stage. We want to abstract from the low-level electronics to some higher-level model.

If we’re using a 64-bit adder, then there are 2^(2*64) possible inputs. For almost all of those (specifically all but about 2^64 of them), flipping one low-order bit will not change the high-order bit. But if we want our adder to work on all possible inputs, then we need to account for the one-in-2^64 possibilities when flipping a low-order bit

doeschange the high-order bit. We won’t catch that just by sampling.In full generality, the problem is presumably NP-complete - we’re asking whether any possible inputs to an arbitrary circuit result in some variable changing in response to a counterfactual intervention on another variable. Yet we’re able to use abstract models an awful lot in practice, so… what tricks are we using?