Research agenda: Formalizing abstractions of computations

4Mateusz Bagiński

4Charlie Steiner

3Erik Jenner

3Charlie Steiner

2scottviteri

1drocta

2Erik Jenner

1drocta

1Nicholas / Heather Kross

1Alexander Gietelink Oldenziel

New Comment

Reading this, it dawned on me that we tend to talk about human-understandability of NNs as if it was magic.^{[1]} I mean, how much is there to human-understandability--in the sense of sufficiently intelligent (but not extremely rare genius-level) humans being able to work with a phenomenon and model it accurately on a sufficient level of granularity--beyond complexity of that phenomenon and computational reducibility of its dynamics?

Sure, we find some ways of representing information and thinking more intuitive and easier to grasp than others but we can learn new ones that would seem pretty alien to a naive hunter-gatherer mind, like linear algebra or category theory.

Or perhaps it's just me and this is background knowledge for everybody else, whatever. ↩︎

You might be interested in "context agents," which was a shot of mine at a formalization of abstraction in world-modeling (which becomes approximate abstraction of computations if the world you're modeling has computations).

One problem I see with it now is that it's rigid - each abstract action gets just one lower-level representation. This is useful for computation, but maybe the incorrect formalism to think in. On the other hand, there's an opposite problem which is maybe it's too disconnected - different abstract actions in the same higher-level ontology may point you to totally different lower-level ontologies. This basically means that if you want to learn how to do abstract reasoning in this way, there's no point learning individual ontologies (there's too many that get used too infrequently), instead you have to learn a generator of ontologies.

Overall, I'm skeptical of the existence of magic bullets when it comes to abstraction - by which I mean that I expect most problems to have multiple solutions and for those solutions to generalize only a little, not that I expect problems to have zero solutions.

Sure, commuting diagrams / non-leaky abstractions have nice properties and are unique points in the space of abstractions, but they don't count as solutions to most problems of interest. Calling commuting diagrams "abstraction" and everything else "approximate abstraction" is I think the wrong move - abstractions are almost as a rule leaky, and all the problems that that causes, all the complicated conceptual terrain that it implies, should be central content of the study of abstraction. An AI safety result that only helps if your diagrams commute, IMO, has only a 20% chance of being useful and is probably missing 90% of the work to get there.

Thanks for your thoughts!

Overall, I'm skeptical of the existence of magic bullets when it comes to abstraction - by which I mean that I expect most problems to have multiple solutions and for those solutions to generalize only a little, not that I expect problems to have zero solutions.

Not sure I follow. Do you mean that you expect there to not be a single nice *framework* for abstractions? Or that in most situations, there won't be one clearly best abstraction?

(FWIW, I'm quite agnostic but hopeful on the existence of a good general framework. And I think in many cases, there are going to be lots of very reasonable abstractions, and which one is "best" depends a lot on what you want to use it for.)

Sure, commuting diagrams / non-leaky abstractions have nice properties and are unique points in the space of abstractions, but they don't count as solutions to most problems of interest. Calling commuting diagrams "abstraction" and everything else "approximate abstraction" is I think the wrong move - abstractions are almost as a rule leaky, and all the problems that that causes, all the complicated conceptual terrain that it implies, should be central content of the study of abstraction. An AI safety result that only helps if your diagrams commute, IMO, has only a 20% chance of being useful and is probably missing 90% of the work to get there.

I absolutely agree abstractions are almost always leaky; I expect ~every abstraction we actually end up using in practice when e.g. analyzing neural networks to not make things commute perfectly. I think I disagree on two points though:

- While I expect the leaky setting to add new difficulties compared to the exact one, I'm not sure how big they'll be. I can imagine scenarios where those problems are the key issue (e.g. errors tend to accumulate and blow up in hard to deal with ways). But there are also lots of potential issues that already occur in an exact setting (e.g. efficiently finding consistent abstractions, how to apply this framework to problems).
- Even if most of the difficulty is in the leaky setting, I think it's reasonable to start by studying the exact case,
*as long as insights are going to transfer*. I.e. if the 10% of problems that occur already in the exact case also occur in a similar way in the leaky one, working on those for a bit isn't a waste of time.

That being said, I do agree investigating the leaky case early on is important (if only to check whether that requires a complete overhaul of the framework, and what kinds of issues it introduces). I've already started working on that a bit, and for now I'm optimistic things will mostly transfer, but I suppose I'll find out soon-ish.

Not sure I follow. Do you mean that you expect there to not be a single nice

frameworkfor abstractions? Or that in most situations, there won't be one clearly best abstraction?

Yeah, I was vague. I think once a framework for abstractions becomes specific enough that it starts recommending specific abstractions to you, the two things you mention become connected - there will be different acceptable specific-frameworks because there are different acceptable abstractions.

I agree, once you have some specification of what you want to use the abstraction for (rather than just "be a good abstraction,") that goes a long way to pinning down how to think about the world. But what we want is itself an abstraction that has multiple acceptable answers :P

Anyhow, best of luck :D

Great job with this post! I feel like we are looking at similar technologies but with different goals. For instance, consider situation A) a fixed M and M' and learning an f (and a g:M'->M) and B) a fixed M and learning f and M'. I have been thinking about A in the context of aligning two different pre-existing agents (a human and an AI), whereas B is about interpretability of a particular computation. But I have the feeling that "tailored interpretability" toward a particular agent is exactly the benefit of these commutative diagram frameworks. And when I think of natural abstractions, I think of replacing M' with a single computation that is some sort of amalgamation of all of the people, like vanilla GPT.

A thought on the "but what if multiple steps in the actual-algorithm correspond to a single step in an abstracted form of the algorithm?" thing :

This reminds me a bit of, in the topic of "Abstract Rewriting Systems", the thing that the vs distinction handles. (the asterisk just indicating taking the transitive reflexive closure)

Suppose we have two abstract rewriting systems and .

(To make it match more closely what you are describing, we can suppose that every node has at most one outgoing arrow, to make it fit with how you have timesteps as functions, rather than being non-deterministic. This probably makes them less interesting as ARSs, but that's not a problem)

Then, we could say that is a homomorphism (I guess) if,

for all such that has an outgoing arrow, there exists such that and .

These should compose appropriately [err, see bit at the end for caveat], and form the morphisms of a category (where the objects are ARSs).

I would think that this should handle any straightforward simulation of one Turing machine be another.

As for whether it can handle complicated disguise-y ones, uh, idk? Well, if it like, actually simulates the other Turing machine, in a way where states of the simulated machine have corresponding states in the simulating machine, which are traversed in order, then I guess yeah. If the universal Turing machine instead does something pathological like, "search for another Turing machine along with a proof that the two have the same output behavior, and then simulate that other one", then I wouldn't think it would be captured by this, but also, I don't think it should be?

This setup should also handle the circuits example fine, and, as a bonus(?), can even handle like, different evaluation orders of the circuit nodes, if you allow multiple outgoing arrows.

And, this setup should, I think, handle anything that the "one step corresponds to one step" version handles?

It seems to me like this set-up, should be able to apply to basically anything of the form "this program implements this rough algorithm (and possibly does other stuff at the same time)"? Though to handle probabilities and such I guess it would have to be amended.

I feel like I'm being overconfident/presumptuous about this, so like,

sorry if I'm going off on something that's clearly not the right type of thing for what you're looking for?

__________

Checking that it composes:

suppose A, B, C,

then for any which has an outgoing arrow and where f(a) has an outgoing arrow, where

so either b' = f(a) or there is some sequence where , and so therefore,

Ah, hm, maybe we need an additional requirement that the maps be surjective?

or, hm, as we have the assumption that has an outward arrow,

then we get that there is an s.t. and s.t.

ok, so, I guess the extra assumption that we need isn't quite "the maps are surjective", so much as, "the maps f are s.t. if f(a) is a non-terminal state, then f(a) is a non-terminal state", where "non-terminal state" means one with an outgoing arrow. This seems like a reasonable condition to assume.

for all such that has an outgoing arrow, there exists such that and

Should it be at the end instead? Otherwise not sure what b is.

I think this could be a reasonable definition but haven't thought about it deeply. One potentially bad thing is that would have to be able to also map any of the intermediate steps between a an a' to . I could imagine you can't do that for some computations and abstractions (of course you could always rewrite the computation and abstraction to make it work, but ideally we'd have a definition that just works).

What I've been imagining instead is that the abstraction can specify a function that determines which are the "high-level steps", i.e. when should be applied. I think that's very flexible and should support everything.

But also, in practice the more important question may just be how to optimize over this choice of high-level steps efficiently, even just in the simple setting of circuits.

Whoops, yes, that should have said , thanks for the catch! I'll edit to make that fix.

Also, yes, what things between and should be sent to, is a difficulty..

A thought I had which, on inspection doesn't work, is that (things between and ) could be sent to , but that doesn't work, because might be terminal, but (thing between and ) isn't terminal. It seems like the only thing that would always work would be for them to be sent to something that has an arrow (in B) to (such as f(a), as you say, but, again as you mention, it might not be viable to determine f(a) from the intermediary state).

I suppose if were a partial function, and one such that all states not in its domain have a path to a state which is in its domain, then that could resolve that?

I think equivalently to that, if you modified the abstraction to get, defined as, and

so that B' has a state for each state of B, along with a second copy of it with no in-going arrows and a single arrow going to the normal version,

uh, I think that would also handle it ok? But this would involve modifying the abstraction, which isn't nice. At least the abstraction embeds in the modified version of it though.

Though, if this *is* equivalent to allowing f to be partial, with the condition that anything not in its domain have arrows leading to things that are in its domain, then I guess it might help to justify a definition allowing to be partial, provided it satisfies that condition.

Are these actually equivalent?

Suppose is partial and satisfies that condition.

Then, define to agree with f on the domain of f, and for other , pick an in the domain of f such that , and define

In a deterministic case, should pick to be the first state in the path starting with to be in the domain of . (In the non-deterministic case, this feels like something that doesn't work very well...)

Then, for any non-terminal , either it is in the domain of f, in which case we have the existence of such that a and where , or it isn't, in which case we have that there exists such that a and where , and so f' satisfies the required condition.

On the other hand, if we have an satisfying the required condition, we can co-restrict it to B, giving a partial function , where, if isn't in the domain of f, then, assuming a is non-terminal, we get some s.t. a and where , and, as the only things in B' which have any arrows going in are elements of B, we get that f'(a'') is in B, and therefore that a'' is in the domain of f.

But what if a is terminal? Well, if we require that non-terminal implies non-terminal, then this won't be an issue because pre(b) is always non-terminal, and so anything terminal is in the domain.

Therefore the co-restriction f of f', does yield a partial function satisfying the proposed condition to require for partial maps.

So, this gives a pair of maps, one sending functions satisfying appropriate conditions to partial function satisfying other conditions, and one sending the other way around, and where, if the computations are deterministic, these two are inverses of each-other. (if not assuming deterministic, then I guess potentially only a one-sided inverse?)

So, these two things are equivalent.

uh... this seems a bit like an adjunction, but, not quite. hm.

I independently thought this agenda would be very useful, potentially in combination with Wentworth-style work on the nature of abstractions. I'm glad to see this has been written up and may be worked on!

I liked this post. It correctly imho emphasizes theoretical & mathematically substantive approaches, a 'big tent' approach and ties it in with the kind of conceptual questions we'd like to attack in AI alignment.

You might like to look at Crutchfield's epsilon machines, see e.g. https://warwick.ac.uk/fac/cross_fac/complexity/study/msc_and_phd/co907/resources/dpf.warwick.slides.pdf

Especially with regard to a good definition of abstraction I suggest taking a look at this paper: https://arxiv.org/pdf/cond-mat/0303625.pdf

Big thanks to Leon Lang, Jérémy Scheurer, Adam Gleave, and Shoshannah Tekofsky for their feedback on a draft of this post, to Euan McLean (via FAR AI) for his feedback and a lot of help with editing, and to everyone else who discussed this agenda with me, in particular Johannes Treutlein for frequent research check-ins!## Summary

abstractions of computations. These abstractions are ways topartiallydescribe the “algorithm” a neural network or other computation is using, while throwing away irrelevant details.automating interpretabilityormechanistic anomaly detection. There are also potential connections to other alignment topics, such asnatural abstractionsordefining terms like “search process”.I encourage you to skip around and/or only read parts of the post. Here’s an overview:

IntroductionandQ&Amostly talk about motivation and connections to alignment..What are abstractions of computations?discusses my current guess as to what the framework should look like.Some potential projects.Appendix: Related workgives a quick overview of relevant work in academia, and the relation between this agenda and other alignment research.This post doesn’t contain any actual theorems or experiments, so if you’re only interested in that, you can stop reading.

## Introduction

Humans can’t just look at the weights of a neural network and tell what it’s doing. There are at least two reasons for this:

The second point would likely apply to any system that does well on complicated tasks. For example,

neural networks are decision trees, but this doesn’t mean we can look at the decision tree corresponding to a network and understand how it works. To reason about these systems, we will likely have tosimplifythem, i.e. throw away details that are irrelevant for whatever we want to find out. In other words, we are looking forabstractionsof computations (such as neural networks).Abstractions are already how we successfully reason about many other complicated systems. For example, if you want to understand the Linux kernel, you wouldn’t start by reading the entire source code top to bottom. Instead, you’d try to get a high-level understanding—what are the different modules, how do they interact? Similarly, we use pseudocode to more easily communicate how an algorithm works, abstracting away low-level details.

If we could figure out a general way to find useful abstractions of computations, or at least of neural networks, perhaps we could apply this to understand them in a similar way. We could even automate this process and mechanically search for human-understandable abstractions.

Making things easier to understand for humans isn’t the only application of abstractions. For example, abstractions have been used for more efficient

theorem provingand in model checking (e.g.abstract interpretation). These abstractions need not ever be used by humans. While I don’t think we will be able to formally verify the properties we are most interested in for neural networks, the underlying ideas might still apply. Another example ismechanistic anomaly detection, where we only care about whether a computation “uses the same mechanism” on a new input as on earlier ones. Again, this doesn’t require that weunderstandthose mechanisms. Because of this, I’m looking for ageneraltheory of abstractions and what makes them useful, not just a theory ofhuman-understandableabstractions.## What are abstractions of computations?

## Examples

I’m thinking about ways we can throw away some information about a computational process, while keeping other parts. There are (roughly speaking) two kinds of information we can throw away:

input-output behavior. For example, perhaps we only care about the 5 most significant digits of the output.internal implementation.For example, the description of a program in pseudocode might exactly specify the input-output behavior without specifying low-level implementation details (e.g. which sorting algorithm to use when the pseudocode says “sort this list”).In practice we are often interested in abstractions that remove both types of information. (In fact, you

haveto throw away some mechanistic information to throw away information about behavior—given the exact implementation, the input-output behavior can be recovered.)Let’s look at some examples:

machine code is an abstraction of logic gates.higher-level languages are abstractions of lower-level languages/machine code.Type signatures: we can forget the behavior of parts of a computation and only remember the input and output types.Game of Lifewithin Game of Life, then the simulated layer is a good abstraction of the low-level layer. More generally,using one computational model to simulate another one gives good abstractions.looking at a subcircuit of a network that explains a specific behavior(such as theindirect object identification circuit in GPT2-small). This throws away all information about parts of the network outside that subcircuit, and consequently about behavior of the network on most inputs. Additionally, essentially all mechanistic interpretations throw away some information about low-level details (such as the exact weights), which means they onlyapproximatelyexplain the network output.Modularity/clusterabilityis a different way of abstracting networks (or other computations). For example, imagine a network that can play chess via learned AlphaZero-style MCTS. You could say “at a very high level, this network runs MCTS over moves—this submodule is the rollout policy, and this one the value function”. This throws away most of the information about both the algorithm of the network and its input-output behavior. But it does retainsomeinteresting information.## Formal framework

I’m uncertain about what the right formal framework looks like. In this subsection and the next, I’ll describe my current guess, and then mention some issues it has in

Limitations of this framework.## Computations

To formalize abstractions of computations, let’s first formalize what we mean by “computations”. I’m currently using a pretty simple and general approach: computations have a set M of

memory statesand a function f:M→M that performs a singlecomputational step. We also need a function M→{True,False} that tells us whether the computation has finished. Finally, we have a set I of possible input values and a set O of possible output values, with functions I→M and M→O that encode inputs as memory states and decode back to outputs respectively. Any computation induces a function I→O by executing f until the computation terminates.^{[1]}This induced function describes the input-output behavior.For example, in a

deterministic finite automaton, the memory state is the current state of the automaton plus the remaining input. In a Turing machine, the memory state is the tape and the state register.## Abstractions

We can abstract a computation by throwing away information about its memory state. This means an abstraction is a function A:M→M′ for some set of

abstract statesM′. Since the abstract state is a deterministic function of the full state, it can contain at most the same information as the full state, and typically less (if A is non-injective).An arbitrary function A:M→M′ might not have anything to do with the “structure” of the computation, since it ignores the computational step f. To make A an abstraction

of the computationdefined by f, we need to be able to “mirror” the computational step within the abstraction.We can show all of this in a commutative diagram as follows:

f in the top row is the full computational step, and A is the abstraction mapping on memory states. For a good abstraction, there should exist a dashed map f′ that makes the diagram commute. Note that we can assume without loss of generality that A is surjective, in which case f′ is uniquely determined by f and A.

Commutativity of this diagram means we can either execute a step in the full computation and then abstract the state, or first abstract the state and then execute the computational step in the abstraction. You can think of this as a

consistencycondition.This type of commutative diagram is used to describe abstractions in contexts other than computations as well, see

Appendix: Related workfor several examples, as well as myearlierposts.Appendix: Examples of abstractions of computationsdescribes several examples of abstractions within this framework.In practice, most abstractions will not achieve exact commutativity: they will be

leaky. So a full version of this framework should formalizeapproximateconsistency. This likely means we want a distribution over inputs or memory states, and that the abstract computational step f′ will compute aprobability distributionover next states instead of just a single one.## What makes some abstractions better than others?

The consistency condition from the previous section is not sufficient to get “good abstractions”. There are two trivial ways to get commutativity with clearly useless abstractions:

anyinformation, by using a singleton set for M′.We can rule out these failure modes with two additional desiderata:

minimalabstraction, under some measure of size or complexity.I currently think these two desiderata, along with the consistency condition, might be enough. Admittedly, a lot of the difficulty in formalizing good abstractions is hidden in the vagueness of these goals.

It’s also possible we need additional criteria to actually get “white box” abstractions that really reflect the computational structure. But I’m hoping we can get that “for free” with the right formalization of the desiderata above.

## Limitations of this framework

I’m aware of three issues with the current framework:

Approximate consistency:As mentioned, the consistency condition ought to be approximate, and making that modification will introduce some additional changes, such as making the abstract computational step stochastic.Temporal abstraction:The commutative diagram above requires aone-to-onemapping between computational steps in the full and abstracted computation. But in many cases, several steps in the full computation correspond to just a single step in the abstract one. Sometimes, we can work around this in hacky ways, but even that may fail (see theTuring machinessection in the appendix for a potential example). A better framework would incorporate not just abstractions of states, but alsotemporal abstractions, i.e. combining multiple steps into one.Encoding:There’s no mention of the “encoding” or “representation” of a computation. Forcing everything into the format of memory states and computational steps already discards some aspects. Put a different way: I think this is a good framework for thinking about problem 2. from the introduction (neural networks are too large to understand), but not as good for thinking about problem 1. (weights are an unintuitive representation for humans).I’m optimistic that limitation 1. and 2. can be resolved by moderate extensions of the framework, leaving the key ideas intact. This is much less clear for 3., but it’s also not obvious that this should be “fixed” within the framework—maybe it’s better to think of 3. as something to be addressed completely separately.

## Q&A

## Isn’t this just mechanistic interpretability?

Yeah, kind of. I think of it as a more theoretical approach that complements empirical interpretability research and has a similar impact story for alignment. If you want to improve circuit-style analysis, empirical work seems more promising. But my guess is that there are other useful ways of abstracting neural networks that we currently don’t make enough use of. The clearest example is looking for submodules, and there could also be less obvious ones that only a more general study of abstractions of computations will reveal.

Compared to typical interpretability research, I’d say this agenda …

That being said, mechanistic interpretability, especially some flavors such as

causal scrubbing, are very closely related to this agenda. See theRelated worksection for more discussion.## Isn’t this just NN modularity/clusterability?

It’s meant to be more general than just modularity, though I’m currently unsure how much.

Modularity is a prototypical example of the kinds of abstractions I have in mind. Abstractions of computations often work by dividing big computations into submodules. Functions in programming languages are a clear example. I am definitely interested in applying this type of abstraction to neural networks too. However, I am hoping there might be other useful abstractions, so for now I want to stay on the lookout for those.

## Does this have anything to do with

natural abstractions?Probably, but I’m unsure about the details. I discuss this a bit more in the

related work section. In brief: a key similarity is the assumption that things “abstract well” and that some abstractions are “better” or “more natural” than others.^{[2]}But in terms of the mathematical formalism, the framework I described above seems closer to certain other work than to John’s definition of natural abstractions.## What about

mechanistic anomaly detection?I haven’t thought about this much yet, but it seems to me that abstractions of computations could be a candidate for defining what it means for “mechanisms” to “explain” an output. Abstractions of programs could also be closely related to abstractions of proofs because of the

Curry-Howard isomorphism. And abstractions of proofs might be a type ofheuristic argument(though probably not quite the type ARC is studying). So that’s another reason to think there’s a connection here. See a bit more discussion in therelated work section.To me, mechanistic interpretability,

ELK, and mechanistic anomaly detection seem to overlap a lot (this isn’t a new take of course, see e.g.Paul on why indistinguishable mechanisms would also be bad news for other agendas). I basically think of abstractions of computations as adding yet another perspective on similar problems. I don’t have clear reasons to think it’s abetterperspective, it just seems about as reasonable and less explored. But it wouldn’t shock me if I just ended up converging to e.g. a mechanistic anomaly detection view.## Why abstractions of computations, and not just abstractions of neural networks?

A few responses:

Generalizing can sometimes make problems easier:ifthere is a simple general framework for abstractions of computations, then this may be easier to find than the special case for neural networks. The main reason is that if the framework has to work for more cases, that gives you more examples you can test it on.^{[3]}It could also turn out that abstracting computations in general is too messy to be useful. If this turns out to be the case, I’ll specialize more.Iamputting extra weight on NN-like computations:while I often talk about “abstractions of computations” and draw on examples from different computational models, I’m especially interested in fixed computational graphs/circuits. These are still more general than feed-forward NNs, but not by nearly as much.It’s unclear how much generality we need:should we focus on transformers? Feed-forward neural networks? Include networks with some recurrent structure? Even that might not be general enough if the AI as a whole has some scaffolding that connects multiple neural networks (or multiple instances of the same network). AlphaZero is an example. It’s far from obvious to me that focusing on, say, monolithic feedforward networks is sufficient.## So what’s the plan in more detail?

I’m currently thinking about two broad questions:

I outlined my current guess for question 1

aboveand alsobriefly discussedquestion 2. But there’s definitely lots of work to be done on both.If I had answers to these questions, the ideal next step would be to develop efficient algorithms that can find good abstractions of a given network or other computation. There is a spectrum of different outcomes, depending on how hard this turns out to be:

The abstractions we find this way could either be human-understandable ones, which would mean automating interpretability, or they might be useful for an approach like mechanistic anomaly detection (as discussed

earlier in the Q&A).This is the most straightforward path to impact, but there are some speculative connections to other problems in alignment:

of computations.^{[4]}So maybe thinking about abstractions of computations could help nail down these elusive concepts. John Wentworth made roughly that pointon the AXRP podcast(and maybe elsewhere), though he talks about “mathematical abstractions” and those are probably not quite the same thing. Personally, I think if your main goal is to define these concepts, you should tackle them directly, rather than first trying to find a theory of abstractions and then using that. But it’s a nice potential side payoff, and if it does work, it might give us definitions formanyconcepts in one swoop.## This sounds like a difficult philosophical problem that’s really hard to make progress on.

I think that’s one of the biggest potential weaknesses of this agenda, but I don’t think it’s a decisive one. A few counterpoints (not meant to completely refute this criticism):

Appendix: Related work). Most of the relevant work is just in a similar cluster, or looking at some special case or application. Maybe that’s because this problem is obviously really hard or mis-guided, but maybe there just hasn’t been that much effort trying to solve this problem in particular. In that case we don’t know how hard it is. One sidenote: I wouldn’t be surprised if I missed a lot of existing work and would be quite interested in that.someproblems that are hard in this way to solve alignment. I’m personally a bit on the fence about that. I would obviously like to avoid solving hard problems if possible, but there’s also a risk that avoiding anything philosophically tricky just means we end up ignoring the core problems.## What about empirical tests?

No concrete plans yet, but doing so is definitely part of the goal! At the moment, the main way to test a proposed definition of good abstractions is to check if(a) it captures examples of good abstractions we know of, and (b) it doesn’t produce any nonsensical ones. It’s especially exciting if it produces intuitive abstractions that we didn’t think of in advance.

At some point, experiments are vital because otherwise we can at best test a definition against our intuition of what is “good”, not against what’s actually helpful in practice. My current sense is that I’ll need to work on the theoretical side a bit longer before I even have anything to implement though.

A good testing ground in the future could be

Tracr. For example: write a program, compile it into a transformer, then see whether we can automatically recover the abstractions we had on the program side. I’d guess the first experiments should be a bit simpler though.## So in summary, why are you excited about this agenda?

Three main reasons:

someideas applicable to one of these. At least, I expect working on this will help me a lot if I later think about one of these other topics.Point 3 is some evidence that other approaches might be better, but I don’t put too much weight on that since the explanation of “there aren’t enough people to explore every promising avenue” seems sufficient. Still, I would be very interested to hear from people who’ve thought about something close to this agenda and then discarded it.

Points 2 and 3 are in some conflict, and I think one reasonable criticism of this agenda is that it’s in a cluster that’s already a bit crowded relative to alignment overall, and that it would be more important to look for completely different agendas.

## And what criticisms are you most worried about?

All of the following seem similarly important to me:

This sounds like one of these difficult philosophical problems that’s really hard to make progress on.## Interlude: why

mechanisticabstractions?Ultimately, we care about having a model with good input-output behavior. So why do we want to mirror the internal structure of our network at all? In other words, why do we want abstractions of computations, instead of just abstractions of black-box functions? For example, say I give you a simple Python program and guarantee you that its input-output behavior is a good approximation of the neural network. But there’s a caveat: the

wayin which the program achieves this behavior is totally different—even at high levels of abstractions, it uses an entirely different algorithm. This doesn’t seem like a huge issue: having this program, along with the guarantee of sufficiently similar behavior, would still be massively useful.I think this is a pretty important question for this entire agenda (and for mechanistic interpretability). Here’s my current hypothesis:

I won’t justify this much here, partly because I don’t have great justifications beyond “seems intuitively plausible” and “seems true in some examples”. Investigating this more is

something I'm interested in.At best, this hypothesis holds when we want a good approximation

across many inputs. For example, if our goal is only to find a simpler approximation that’s valid on a narrow distribution, we could just train a simpler model from scratch. But we have no guarantees that this simpler model also generalizes similarly to the more sophisticated one.## Some potential projects

Here’s a (non comprehensive) list of subproblems that seem mostly self-contained and of reasonable scope. If you are interested in tackling any of these, please let me know! (I can at least make sure you’re not duplicating work that I or someone else has done in the meantime, and I might be able to provide some guidance.)

## Characterizing abstractions for specific computational models

The framework

abovegives a potential definition of abstractions, and in principle this determines all the abstractions of any given computational model. But it’s a very indirect specification, so it may be hard to work with. For example, given a binary circuit, it’s not obvious how we should think about the set of all its abstractions.It would be nice to find other characterizations of the set of all abstractions. This seems most likely to be fruitful when first specializing to one computational model (such as circuits). If characterizing the entire set in a nice way isn’t possible, it would also be interesting to look for easy-to-think-about classes of abstractions (similar to

Appendix: Examples of abstractions of computations).This is an important direction because it can address one of my key uncertainties: are there any non-obvious useful abstractions that we should be using? Or are modularity and interpreting subcircuits the only relevant ones? Having more examples should also help with other open questions (for example, the next one).

## Do behavioral approximations “factor through” mechanistic ones?

As discussed in

Why mechanistic abstractions?, my hypothesis for why we care about mechanistic abstractions/interpretability is:The most tractable way to get robust approximations of input-output behavior for complicated computations is to approximate their internal structure.A conjecture related to this is:

The idea is that there is a connection between minimal proof length and “mechanistic similarity”: if two circuits are equal at a low level of (mechanistic) abstraction, it ought to be easier to prove they have the same input-output behavior, and perhaps vice versa.

This is of course very informal, and the most important part of answering this question may be getting to the point where we can formalize the conjecture, rather than actually (dis)proving it. It also wouldn’t surprise me if (any reasonable formalization of) the conjecture is just false (though I would guess it is often true for circuits/programs we encounter in practice, so it might still be true with additional assumptions).

This feels similar in spirit to the conjecture that “everything happens for a reason” (from the

Heuristic arguments paper, appendix C). If I give you two boolean circuits that have exactly the same input-output behavior, should you expect that there is some structure in the circuits themselves that compactly explains this? If so, that explanation might be effectively saying “these two circuits are identical under the following abstraction”. I haven’t thought about whether there is a strong formal connection here or not.## Test the framework on more examples

This is essentially dual to

Characterizing abstractions for specific computational models: look for ways in which we intuitively want to abstract computations, then see if the framework can capture them or not. (If not, potentially adapt the framework.)## Do unique minimal abstractions always exist?

ETA: The answer to this question is "yes", seethis shortform. I don't think this is important progress on the agenda (see the shortform), just want to make sure no one wastes time on this subproblem.We can define a partial ordering on abstractions as follows: let Ai:M→Mi, for i=1,2 be abstractions, where M is the set of memory states, and Mi some set of abstract states. We say that A1 is

more fine-grained(i.e. lower-level/more informative) than A2, written A1≥A2, if there is some map h:M1→M2 such that A2=h∘A1. In words, M1 contains all the information we need to compute M2.We might want to use this ordering as a definition of “minimal” abstractions, which was one desideratum in

What makes some abstractions better than others?. In other words, out of all the abstractions that satisfy the consistency condition, and that contain a certain piece of information about the output of the computation, we might want to pick the minimal one according to this ordering. However, since this is only apartialorder, it’s not obvious that a unique minimal abstraction always exists, even if everything is finite. I have some preliminary results that suggest these unique minimal abstractions do in fact exist, but it’s for a slightly different setting and not yet fully worked out.Even if this was successful, it probably wouldn’t be the definition of “good abstraction” we ultimately want to use. It effectively puts infinite weight on the consistency condition and on representing some information, which in practice might often lead to using the full computation as the “abstraction”. Still, it could be a step in the right direction.

## Incorporate temporal abstractions and approximate consistency

As mentioned

above, we might want to collapse multiple computational steps in the full model to a single step in the abstraction. The current framework doesn’t allow this. I think it should likely be extended (or, if that’s not possible, the entire framework might need to be re-assessed). Another important step is to incorporateapproximateconsistency, i.e. only demand that the commutative diagram commutesapproximately. This will mean additional extensions of the framework, likely making abstract computational steps stochastic as well as requiring an input distribution.## Work out connections to existing work more precisely

Appendix: Related workmentions many existing notions of abstractions, many of which I suspect should fit into the framework I’ve described (or suitable modifications thereof). Working this out in more detail could be a good test, similar toTest the framework on more examples.On a more conceptual level, I still feel confused about some aspects of how this agenda and the specific framework relate to other alignment agendas, such as

John Wentworth’s work on (natural) abstractionsandARC’s mechanistic anomaly detection and heuristic arguments agenda. I think it should be possible to draw much crisper connections than I do in this post.## Final words

This agenda is in an early stage (I’ve been thinking about it for ~2 months). So if you have any feedback, either on the entire approach or on details, I highly appreciate it!

Also, if you’re working on similar things, or might want to start working on similar things, please feel free to reach out! (Just to chat, or to potentially collaborate.) LessWrong DM or

erik@ejenner.comfor example.## Appendix: Related work

In this appendix, I’ll give a brief overview of existing research I’ve looked at. The AI alignment section below is reasonably complete, I think (though with many open questions). You should consider all the other sections a preliminary list of potentially relevant fields, rather than a full literature review.

## AI alignment

## John Wentworth’s work on (natural) abstractions

JohnWentworthhaswrittenalotaboutabstractions. (Andeven more.) Going through all his ideas and trying to connect them with this agenda would be a significant project in itself (though one potentially worth doing). For now, I’ll just give my most important thoughts without too much justification:the samerole as in John’s work. For example, if you just have fully known deterministic computations, there’s no noise that can wipe out low-level details over a distance. It’s possible though that there is a deeper version of those ideas that does carry over.^{[6]}Abstractions of causal modelsorBisimulation. It would be exciting if these approaches converged at some point, which seems speculative but not impossible.## Mechanistic interpretability

Mechanistic interpretability broadly construed (e.g. as in

Daniel Filan's early post) seems very similar to looking for good abstractions of neural networks, except with the stronger focus of aiming for abstractions that are human-understandable. This is related to one of thelimitations of my framework: I don’t think as much about how to encode things in an easy to understand way, whereas this is an important aspect of interpretability (in addition to reducing the size of the thing that needs to be understood).In current practice, my sense is that a lot of mechanistic interpretability can be seen as finding subcircuits of a neural network that describe some small part of its behavior (see my

Subsets and quotients in interpretabilitypost). Focusing on these subcircuits throws away a lot of other information about the computational graph. For lots of subcircuits, this would leave you with basically no predictive power. Subcircuits like theindirect object identificationone are interesting because they still let you make some predictions about behavior without knowing about the rest of the network, so they’re good ways of throwing away parts of the computational graph, i.e. good abstractions. Compared to this type of “circuit-style” mechanistic interpretability research, abstractions of computations differ mainly by aiming to use more general types of abstraction.One other post I want to highlight is Daniel Filan’s

Verification and Transparency. He points out how both verification and transparency serve similar purposes and how some things (such as type annotations in programming languages) can be seen as instances of both. My current guess is that abstractions of computations are the more general view that connects the two: they make things more transparent (because they are simpler) and they help with verification (again because they are simpler, as hypothesized inDo extensional approximations “factor through” mechanistic ones?).## Modularity/Clusterability of neural networks

Daniel Filan and others have worked on

clusterability of neural networks, i.e. dividing the neurons in a network into natural clusters or submodules using graph cuts. There also seems to benewer work on NN modularitythat I’m not familiar with yet.This surveyhas a brief overview. Additionally, there areseveral postsby the 2022 AI safety camp modularity team andmorepostsby 2022 SERI MATS scholars exploring how to define and look for modularity in neural networks. Thane Ruthenishas arguedfor focusing on interfaces between certain submodules, such as world models and search/planning modules.As discussed in

the Q&A, modularity is a key example of the type of abstraction I’m interested in, but I’m hoping to also study a broader class of abstractions.## Causal scrubbing

In the

Mechanistic interpretabilitysection, I talked informally about the “predictive power” of a subcircuit.Causal scrubbingformalizes this by defining whether an interpretation of a subcircuit “explains” a given behavior. So I think of this as formalizing one desideratum for a specific type of “good abstractions” of computation. (Other desiderata are still informal; for example, we might want the proposed interpretation to be as “simple” as possible.) I tentatively think it should be possible to model causal scrubbing as a commutativity condition in my framework but haven’t worked it out yet (doing so is on my agenda).An important similarity between causal scrubbing and this agenda is that both are aiming at automating interpretability research. (In my case, I’m less firm on the exact end game, but this is at least one major motivation.) They also both seem like candidates for formalizing when mechanisms explain a behavior for mechanistic anomaly detection (or at least I think abstractions seem potentially applicable to that).

## ARC’s mechanistic anomaly detection and heuristic arguments agenda

ARC wants to develop a method for

mechanistic anomaly detection: say we have a function f and an input distribution D (the training distribution in an ML context). Given some new input to f, we would like to say whether the output of f is due to the “same reasons” that explain its behavior on D, or due to some other “mechanism” that wasn’t necessary to explain the behavior on D. (Seetheir postfor an explanation of how this relates to ELK and other alignment problems.)Developing an algorithm for mechanistic anomaly detection first requires clarifying what we mean by “mechanisms” “explaining” model behavior. ARC’s approach is trying to formalize

heuristic argumentsand then use those as the definition of “explanations”.There are two reasons to think that abstractions of computations could be relevant for this agenda (and vice versa):

causal scrubbingan important example of what a framework for abstractions should capture, andPaul mentions causal scrubbing as a candidate for heuristic argumentstoo.Curry-Howard isomorphism. And abstractions of proofs might be a type of heuristic argument: they are not themselves proofs, but approximate them in some sense; they are shorter than full proofs; and they might give a good high-level explanation for “why” something is true. (Even if this is the case, it’s not clear they are the kind of heuristic arguments ARC is currently working on, e.g. they might not be defeasible.)## Abstractions of causal models

Starting with

Rubenstein et al. (2017), there have been a few papers on when a structural causal model (SCM) is an abstraction of another SCM. Follow-up papers areBeckers & Halpern (2019a),Beckers & Halpern (2019b)(which introducesapproximateabstraction),Rischel & Weichwald (2021), andOtsuka & Saigo (2022)(the last two use a category theoretical framework).Zennaro (2022)reviews these different frameworks. John Wentworth has also written asummary/review of Beckers & Halpern.I haven’t studied all of these in detail, but as far as I can tell, the approach behind all of them is the same type of commutative diagram that I’m using (and that occurs elsewhere, e.g. in

Bisimulation). The idea that’s specific tocausalabstraction is that the horizontal arrows are causalinterventions. The same type of diagram is also used in arecent paperby Taco Cohen to define causal models of an underlying true causal structure (he refers to this as a “naturality” condition, but doesn’t make explicit any category theoretical interpretation). In that paper, there are two horizontal arrows, one a causal intervention and one some other process that turns states into “outcomes”.Building on these causality papers, there is also a

line of work from Stanfordthat applies causal abstractions to neural networks. The idea is: we can think of the computational graph of a neural network as a causal model. Given some hypothesis for a higher-level causal model that we think the network might be implementing, and a specification for how the nodes of that higher-level model map into the network, we can test interventional consistency (i.e. check whether the high-level hypothesis is indeed a causal abstraction of the network). This is very reminiscent ofCausal scrubbing.Overall, causal abstractions seem highly relevant for abstractions of computations, given the connection between causal models and computational graphs.

## Artificial intelligence

Abstractions in general are relevant all over the place in AI, but I’ll focus on two directions that seem potentially related to abstractions

of computationsspecifically.## Reinforcement learning

Reinforcement learning is interesting because it combines both

state abstractionsandtemporal abstractions(i.e. high-level descriptions of state space and high-level descriptions of actions to take). As mentioned inLimitations of this framework, it seems that my current framework is missing temporal abstraction, and RL could provide inspiration on how to combine this with (memory) state abstraction. See for exampleDavid Abel's thesisand its Background section (note that it uses the nameaction abstractionsinstead of temporal abstractions).## Logical reasoning and theorem proving

There is a 1992 paper simply called

A theory of abstraction, which introduces a framework for abstractions of formal systems. Section 8 has an extensive list of examples from earlier work described in this framework, lots of it on theorem proving but also other areas of AI. One reason why abstractions for theorem proving may be interesting is the connection to heuristic arguments and mechanistic anomaly detection I discussedabove.## Computer Science

Abstractions of computations are obviously a big topic in CS. It’s clear there are tons of

examplesof abstraction to be found,Wikipediahas an overview page. Less clear to me is to what extent people have explored what abstractions of computations arein general.Below, I’ve collected the subfields that seem most relevant.

## Programming languages

As mentioned earlier, programming languages are abstractions of lower-level computational models, such as machine code, but this is just a source of examples rather than a definition of abstractions in general. Programming languages also provide methods for creating computations with obvious abstractions. For example, functions let you decompose a program into submodules, and a very obvious abstraction is to simply specify

whatsome of these functions do instead ofhowthey work. Even this seems mostly just like a source of examples though. Individual programming languages will have less to say on the subject ofwhat kind of thing abstractions are in general.^{[7]}However, the field ofprogramming language designmight have things to say about that, since it is essentially in the business of finding good abstractions of computational models.## Formal specification

One important type of abstraction I discussed is to “black box” parts of a computation, i.e. forget

howthat part is computed and just describewhatthe input-output behavior is. One question is then: how should the input-output behavior be represented? We probably don’t want to actually use a giant lookup table (and we shouldn’t have to if the program implements some conceptually simple behavior).Formal specificationcould hold an answer to that subproblem.Abstract interpretation(and other uses of abstraction for model checking)Based on a quick glance, this seems highly relevant, but I haven’t spent much time looking into it yet (in particular, whether the formalism there is promising as a general definition of abstractions).

BisimulationBisimulation is a notion of

intensional/mechanistic equivalencebetween transition systems, i.e. it describes when two transition systems should be considered “the same” in terms of their computational structure. Probably even more importantly for this agenda, there is also the one-sided version of this, one systemsimulatinganother one (but not necessarily vice versa). This seems like a good candidate for abstractions specifically of transition systems.People have studied generalizations of (bi-)simulation using category theory (“coalgebraic bisimulation”) and this was in fact what led me to

think about abstractions using (co)algebras.## Plagiarism detectors

If you want to tell whether a piece of software plagiarizes another one, you probably don’t just want to check whether their source code is literally the same. Instead, you’re more interested in whether they’re equivalent at some higher level of abstraction. So I’d hoped that plagiarism detectors for code might contain some interesting ideas for defining “mechanistic similarity” of programs (h/t to Stuart Russell and Shreyas Kapur for independently making that suggestion). Unfortunately this doesn’t look too promising based on a cursory exploration: it seems plagiarism detectors often focus on efficiently checking whether two programs have subsnippets that match exactly. They do often “normalize” source code by removing whitespace etc., but that happens in an ad-hoc way that differs from method to method.

## Philosophy

I assume that philosophers have thought, if not about exactly my question, then at least related ones. However, this is one of the many things I haven’t gotten around to looking into yet. I expect this might be less helpful than more mathematical approaches like the ones in causal abstraction, but still worth checking.

## Appendix: Examples of abstractions of computations

In this appendix, we’ll look at some examples of abstractions within

the framework outlined above. Most of them are in circuits, though we also briefly discuss Turing machines. This part may be a bit harder to follow than the rest of the post since I’ll only give rough descriptions of how these examples work.## Circuits

Circuits are directed acyclic graphs where each node is equipped with …

To turn this into a concrete model of how computations are performed step by step, we also choose a topological ordering of the circuit.

^{[8]}The state/memory of the computation assigns to each node either an output value or an “undefined” label. Initially, all nodes except inputs are undefined. Then we go over nodes in the chosen topological order and determine one new value in each computational step.Black boxLet’s start by considering the trivial “black box” abstraction that exists for any computation (not just circuits). This abstraction takes a memory state and maps it to the final output of the circuit (by running through the entire remaining computation). The abstract computational step (f′ in the consistency diagram), is simply the identity function. This makes the consistency diagram commute perfectly.

Quotienting output spaceWe can throw away even more information. Namely, instead of mapping memory states to the output, we can map them to some

equivalence classof outputs. For a binary circuit, we may be interested e.g. in only the parity of all the output bits. In an arithmetic circuit, perhaps we only want the output up to some precision. Again, this abstraction is applicable to any computational model, not just circuits.^{[9]}SubcircuitsNext, let’s start keeping some information about the actual circuit, instead of treating it as a black box. One key class of examples are “subcircuit” abstractions. This simply means we keep the values for some of the nodes while forgetting all the other ones. To perform computational steps within our abstraction, we might sometimes need input values we don’t have (if there are nodes we represent in the abstraction but which have parents we discarded). There are different choices we could make here, for example just setting assuming these inputs are zero (or some other default value that makes sense in the context of the types of values the circuit works with). This is essentially the same as the choice of

ablationsin mechanistic interpretability. It seems likely thatresampling ablations(as used by causal scrubbing)are generally a better ideathan e.g. zero ablations. This requires us to have access to some input distribution, but that seems necessary anyway in order to talk about approximate consistency, as discussed in the main post.Quotienting the circuitAnother important type of abstraction are quotients of circuits, which correspond to division into submodules. I think of this as black-boxing not the entire circuits, but only certain subcircuits. This abstraction is defined by a

partitionof the set of nodes (as opposed to specifying onesubsetof the set of nodes, as in the subcircuit abstraction we just discussed). The intuitive version of what we want to do is the following: we construct a new circuit by “quotienting out” the partition, i.e. turning each equivalence class of nodes into a single node, which implements the function previously implemented by the entire equivalence class. Note that this does put some restrictions on what the partition can be (because we need this quotient circuit to still be acyclic). To formalize this within the current framework, we need to decide how to map memory states of the full circuit to memory states for the quotient circuit. One simple way is to map an equivalence class in which every node is computed to the appropriate output values, and to map any equivalence class with any “undefined” values to “undefined”. To be able to mirror computational steps in the quotient circuit, we also need to add a memory field in the abstraction that tells us whether each equivalence class is about to be “completed” (e.g. just store the step of the topological sort we’re currently at—this can just be increased by 1 in each abstract computational step). That seems a bit inelegant and could be handled more sensibly with a more general framework that allows temporal abstractions.Associative treesI’ll finish with a somewhat less central example for some variety. Say I have a boolean circuit that computes the AND of n inputs. A simple way to do this would be a binary tree of AND gates. But you could also imagine other implementations that compute the same function using only NOT and OR. So a reasonable abstraction is “a tree using only AND gates”, which tells us something about the algorithm but ignores the details of how the tree is structured (e.g. is it a tree with depth logn, or one with depth n?)

One reason I’m interested in this example is that I initially thought it would be hard to describe in my current framework. But here’s an approach that I think should work: given the memory state of the circuit, i.e. node values for some prefix of the topological order, we can determine the current “frontier” of nodes, i.e. all the nodes we still need to compute future values. Our abstraction will be to take the AND of all the values on this frontier. For a tree of AND gates, this is always the same as the output of the circuit, so we can use the identity as the computational step within this abstraction. Also note that the abstraction will be the same for any tree of AND gates. On the other hand, assume there is a NOT gate at any point in the circuit. Then for some inputs, the AND along the “frontier” will flip once we reach that NOT gate. The abstraction is thus different from the pure AND tree (and also no longer satisfies the consistency condition).

The same approach can be applied to gates other than AND of course, as long as they’re associative (which ensures that all trees consisting only of that gate type are extensionally equal).

## Turing machines

Consider a

Universal Turing machineU, i.e. a Turing machine that takes as input an encoding of some Turing machine M and an input x to M, and then outputs what M would output on x. If we fix a Turing machine M and “partially apply” U to M, we get a Turing machine that’s extensionally equal to M (except that its input/output may be encoded in a different way). This Turing machine essentially simulates M within U. Intuitively, M itself is a very natural abstraction of this more complicated computation.Typical constructions of universal Turing machines encode the state and tape of the simulated machine M in a pretty straightforward way (see e.g.

this one). That means it’s easy to map from the state & tape of U to the corresponding state & tape of M, which gives us our abstraction. To mirror computational steps, we essentially just want to do computations in M as usual.There’s one obstacle here, which we already discussed in

Limitations of this framework: multiple computational steps in U correspond to a single step in M. I’m not sure whether there’s a hack to work around this, like there was forQuotienting the circuit. In any case, this is a strong hint that temporal abstraction should be part of the framework.One other open question is to what extent this works for

arbitraryuniversal Turing machines. While I would guess that any explicit construction of UTMs encodes the state of the simulated machine, it's not obvious to me that this is necessary. You could for example imagine a universal Turing machine that first transforms the simulated machine into some different, extensionally equal one, and then simulates this transformed machine. Even then, the transformed machine might have to be mechanistically quite similar to the original one, so we would still get a good abstraction at some higher level. I’m not sure if this can ever break down completely. (If it does, that’s not necessarily a counterexample to the framework—it seems quite good for this framework to be able to detect whether a universal Turing machine works the “normal” way or does something crazy.)^{^}We have to either restrict ourselves to computations that terminate on all inputs, or have a default output for computations that don’t terminate.

^{^}I think there probably isn’t a clean division into “natural” and “not natural”, so maybe John and I aren’t quite on the same page here—I’m not sure what exactly his take on this is.

^{^}For a closely related point, see my

Too much structurepost.^{^}This seems likely in the case of search processes. Goal-directedness is less clear, you could go for purely behavioral definitions as well.

^{^}A weaker and easier to formalize version would be: in any such case, there

existsa proof of similar length that shows equality up to an abstraction (but which can be different from the original proof).^{^}Though I do also have a fairly strong intuition that there is

somethinginput-independent we can say—the presence of a tree detector is extremely unlikely in a random network, and seeing one is a strong hint that there were trees in the training distribution. So I shouldn’t need to tell you what the input distribution is for you to say “this is an interesting subcomponent of the network”.^{^}An exception might be if some languages are specifically designed to make it easy to build new abstractions and work with them?

^{^}Alternatively, we could compute values in “maximal layers”, I haven’t checked yet whether that changes the picture much.

^{^}Things seem slightly messier if we want to only represent the circuit as a black box on a subset of inputs, i.e. “forget” the behavior on other inputs. This is possible under the computational model where we keep around old node values. But if we introduce “garbage collection” and throw away inputs once they’re no longer needed, it might be impossible to determine whether a memory state came from one of the inputs we’re interested in or not. I’m not entirely sure how to interpret this: it could mean that this abstraction is in fact somewhat weird from a mechanistic perspective, or that this framework isn’t as clean as one might hope, and the precise computational model matters a lot. Perhaps a better approach is to handle subsets of inputs by restricting the input distribution that we’ll probably want to introduce at some point.