Reading this, it dawned on me that we tend to talk about human-understandability of NNs as if it was magic.[1] I mean, how much is there to human-understandability--in the sense of sufficiently intelligent (but not extremely rare genius-level) humans being able to work with a phenomenon and model it accurately on a sufficient level of granularity--beyond complexity of that phenomenon and computational reducibility of its dynamics?
Sure, we find some ways of representing information and thinking more intuitive and easier to grasp than others but we can learn new ones that would seem pretty alien to a naive hunter-gatherer mind, like linear algebra or category theory.
Or perhaps it's just me and this is background knowledge for everybody else, whatever. ↩︎
You might be interested in "context agents," which was a shot of mine at a formalization of abstraction in world-modeling (which becomes approximate abstraction of computations if the world you're modeling has computations).
One problem I see with it now is that it's rigid - each abstract action gets just one lower-level representation. This is useful for computation, but maybe the incorrect formalism to think in. On the other hand, there's an opposite problem which is maybe it's too disconnected - different abstract actions in the same higher-level ontology may point you to totally different lower-level ontologies. This basically means that if you want to learn how to do abstract reasoning in this way, there's no point learning individual ontologies (there's too many that get used too infrequently), instead you have to learn a generator of ontologies.
Overall, I'm skeptical of the existence of magic bullets when it comes to abstraction - by which I mean that I expect most problems to have multiple solutions and for those solutions to generalize only a little, not that I expect problems to have zero solutions.
Sure, commuting diagrams / non-leaky abstractions have nice properties and are unique points in the space of abstractions, but they don't count as solutions to most problems of interest. Calling commuting diagrams "abstraction" and everything else "approximate abstraction" is I think the wrong move - abstractions are almost as a rule leaky, and all the problems that that causes, all the complicated conceptual terrain that it implies, should be central content of the study of abstraction. An AI safety result that only helps if your diagrams commute, IMO, has only a 20% chance of being useful and is probably missing 90% of the work to get there.
Thanks for your thoughts!
Overall, I'm skeptical of the existence of magic bullets when it comes to abstraction - by which I mean that I expect most problems to have multiple solutions and for those solutions to generalize only a little, not that I expect problems to have zero solutions.
Not sure I follow. Do you mean that you expect there to not be a single nice framework for abstractions? Or that in most situations, there won't be one clearly best abstraction?
(FWIW, I'm quite agnostic but hopeful on the existence of a good general framework. And I think in many cases, there are going to be lots of very reasonable abstractions, and which one is "best" depends a lot on what you want to use it for.)
Sure, commuting diagrams / non-leaky abstractions have nice properties and are unique points in the space of abstractions, but they don't count as solutions to most problems of interest. Calling commuting diagrams "abstraction" and everything else "approximate abstraction" is I think the wrong move - abstractions are almost as a rule leaky, and all the problems that that causes, all the complicated conceptual terrain that it implies, should be central content of the study of abstraction. An AI safety result that only helps if your diagrams commute, IMO, has only a 20% chance of being useful and is probably missing 90% of the work to get there.
I absolutely agree abstractions are almost always leaky; I expect ~every abstraction we actually end up using in practice when e.g. analyzing neural networks to not make things commute perfectly. I think I disagree on two points though:
That being said, I do agree investigating the leaky case early on is important (if only to check whether that requires a complete overhaul of the framework, and what kinds of issues it introduces). I've already started working on that a bit, and for now I'm optimistic things will mostly transfer, but I suppose I'll find out soon-ish.
Not sure I follow. Do you mean that you expect there to not be a single nice framework for abstractions? Or that in most situations, there won't be one clearly best abstraction?
Yeah, I was vague. I think once a framework for abstractions becomes specific enough that it starts recommending specific abstractions to you, the two things you mention become connected - there will be different acceptable specific-frameworks because there are different acceptable abstractions.
I agree, once you have some specification of what you want to use the abstraction for (rather than just "be a good abstraction,") that goes a long way to pinning down how to think about the world. But what we want is itself an abstraction that has multiple acceptable answers :P
Anyhow, best of luck :D
Great job with this post! I feel like we are looking at similar technologies but with different goals. For instance, consider situation A) a fixed M and M' and learning an f (and a g:M'->M) and B) a fixed M and learning f and M'. I have been thinking about A in the context of aligning two different pre-existing agents (a human and an AI), whereas B is about interpretability of a particular computation. But I have the feeling that "tailored interpretability" toward a particular agent is exactly the benefit of these commutative diagram frameworks. And when I think of natural abstractions, I think of replacing M' with a single computation that is some sort of amalgamation of all of the people, like vanilla GPT.
A thought on the "but what if multiple steps in the actual-algorithm correspond to a single step in an abstracted form of the algorithm?" thing :
This reminds me a bit of, in the topic of "Abstract Rewriting Systems", the thing that the vs distinction handles. (the asterisk just indicating taking the transitive reflexive closure)
Suppose we have two abstract rewriting systems and .
(To make it match more closely what you are describing, we can suppose that every node has at most one outgoing arrow, to make it fit with how you have timesteps as functions, rather than being non-deterministic. This probably makes them less interesting as ARSs, but that's not a problem)
Then, we could say that is a homomorphism (I guess) if,
for all such that has an outgoing arrow, there exists such that and .
These should compose appropriately [err, see bit at the end for caveat], and form the morphisms of a category (where the objects are ARSs).
I would think that this should handle any straightforward simulation of one Turing machine be another.
As for whether it can handle complicated disguise-y ones, uh, idk? Well, if it like, actually simulates the other Turing machine, in a way where states of the simulated machine have corresponding states in the simulating machine, which are traversed in order, then I guess yeah. If the universal Turing machine instead does something pathological like, "search for another Turing machine along with a proof that the two have the same output behavior, and then simulate that other one", then I wouldn't think it would be captured by this, but also, I don't think it should be?
This setup should also handle the circuits example fine, and, as a bonus(?), can even handle like, different evaluation orders of the circuit nodes, if you allow multiple outgoing arrows.
And, this setup should, I think, handle anything that the "one step corresponds to one step" version handles?
It seems to me like this set-up, should be able to apply to basically anything of the form "this program implements this rough algorithm (and possibly does other stuff at the same time)"? Though to handle probabilities and such I guess it would have to be amended.
I feel like I'm being overconfident/presumptuous about this, so like,
sorry if I'm going off on something that's clearly not the right type of thing for what you're looking for?
__________
Checking that it composes:
suppose A, B, C,
then for any which has an outgoing arrow and where f(a) has an outgoing arrow, where
so either b' = f(a) or there is some sequence where , and so therefore,
Ah, hm, maybe we need an additional requirement that the maps be surjective?
or, hm, as we have the assumption that has an outward arrow,
then we get that there is an s.t. and s.t.
ok, so, I guess the extra assumption that we need isn't quite "the maps are surjective", so much as, "the maps f are s.t. if f(a) is a non-terminal state, then f(a) is a non-terminal state", where "non-terminal state" means one with an outgoing arrow. This seems like a reasonable condition to assume.
for all such that has an outgoing arrow, there exists such that and
Should it be at the end instead? Otherwise not sure what b is.
I think this could be a reasonable definition but haven't thought about it deeply. One potentially bad thing is that would have to be able to also map any of the intermediate steps between a an a' to . I could imagine you can't do that for some computations and abstractions (of course you could always rewrite the computation and abstraction to make it work, but ideally we'd have a definition that just works).
What I've been imagining instead is that the abstraction can specify a function that determines which are the "high-level steps", i.e. when should be applied. I think that's very flexible and should support everything.
But also, in practice the more important question may just be how to optimize over this choice of high-level steps efficiently, even just in the simple setting of circuits.
Whoops, yes, that should have said , thanks for the catch! I'll edit to make that fix.
Also, yes, what things between and should be sent to, is a difficulty..
A thought I had which, on inspection doesn't work, is that (things between and ) could be sent to , but that doesn't work, because might be terminal, but (thing between and ) isn't terminal. It seems like the only thing that would always work would be for them to be sent to something that has an arrow (in B) to (such as f(a), as you say, but, again as you mention, it might not be viable to determine f(a) from the intermediary state).
I suppose if were a partial function, and one such that all states not in its domain have a path to a state which is in its domain, then that could resolve that?
I think equivalently to that, if you modified the abstraction to get, defined as, and
so that B' has a state for each state of B, along with a second copy of it with no in-going arrows and a single arrow going to the normal version,
uh, I think that would also handle it ok? But this would involve modifying the abstraction, which isn't nice. At least the abstraction embeds in the modified version of it though.
Though, if this is equivalent to allowing f to be partial, with the condition that anything not in its domain have arrows leading to things that are in its domain, then I guess it might help to justify a definition allowing to be partial, provided it satisfies that condition.
Are these actually equivalent?
Suppose is partial and satisfies that condition.
Then, define to agree with f on the domain of f, and for other , pick an in the domain of f such that , and define
In a deterministic case, should pick to be the first state in the path starting with to be in the domain of . (In the non-deterministic case, this feels like something that doesn't work very well...)
Then, for any non-terminal , either it is in the domain of f, in which case we have the existence of such that a and where , or it isn't, in which case we have that there exists such that a and where , and so f' satisfies the required condition.
On the other hand, if we have an satisfying the required condition, we can co-restrict it to B, giving a partial function , where, if isn't in the domain of f, then, assuming a is non-terminal, we get some s.t. a and where , and, as the only things in B' which have any arrows going in are elements of B, we get that f'(a'') is in B, and therefore that a'' is in the domain of f.
But what if a is terminal? Well, if we require that non-terminal implies non-terminal, then this won't be an issue because pre(b) is always non-terminal, and so anything terminal is in the domain.
Therefore the co-restriction f of f', does yield a partial function satisfying the proposed condition to require for partial maps.
So, this gives a pair of maps, one sending functions satisfying appropriate conditions to partial function satisfying other conditions, and one sending the other way around, and where, if the computations are deterministic, these two are inverses of each-other. (if not assuming deterministic, then I guess potentially only a one-sided inverse?)
So, these two things are equivalent.
uh... this seems a bit like an adjunction, but, not quite. hm.
I independently thought this agenda would be very useful, potentially in combination with Wentworth-style work on the nature of abstractions. I'm glad to see this has been written up and may be worked on!
I liked this post. It correctly imho emphasizes theoretical & mathematically substantive approaches, a 'big tent' approach and ties it in with the kind of conceptual questions we'd like to attack in AI alignment.
You might like to look at Crutchfield's epsilon machines, see e.g. https://warwick.ac.uk/fac/cross_fac/complexity/study/msc_and_phd/co907/resources/dpf.warwick.slides.pdf
Especially with regard to a good definition of abstraction I suggest taking a look at this paper: https://arxiv.org/pdf/cond-mat/0303625.pdf
Big thanks to Leon Lang, Jérémy Scheurer, Adam Gleave, and Shoshannah Tekofsky for their feedback on a draft of this post, to Euan McLean (via FAR AI) for his feedback and a lot of help with editing, and to everyone else who discussed this agenda with me, in particular Johannes Treutlein for frequent research check-ins!
Summary
I encourage you to skip around and/or only read parts of the post. Here’s an overview:
This post doesn’t contain any actual theorems or experiments, so if you’re only interested in that, you can stop reading.
Introduction
Humans can’t just look at the weights of a neural network and tell what it’s doing. There are at least two reasons for this:
The second point would likely apply to any system that does well on complicated tasks. For example, neural networks are decision trees, but this doesn’t mean we can look at the decision tree corresponding to a network and understand how it works. To reason about these systems, we will likely have to simplify them, i.e. throw away details that are irrelevant for whatever we want to find out. In other words, we are looking for abstractions of computations (such as neural networks).
Abstractions are already how we successfully reason about many other complicated systems. For example, if you want to understand the Linux kernel, you wouldn’t start by reading the entire source code top to bottom. Instead, you’d try to get a high-level understanding—what are the different modules, how do they interact? Similarly, we use pseudocode to more easily communicate how an algorithm works, abstracting away low-level details.
If we could figure out a general way to find useful abstractions of computations, or at least of neural networks, perhaps we could apply this to understand them in a similar way. We could even automate this process and mechanically search for human-understandable abstractions.
Making things easier to understand for humans isn’t the only application of abstractions. For example, abstractions have been used for more efficient theorem proving and in model checking (e.g. abstract interpretation). These abstractions need not ever be used by humans. While I don’t think we will be able to formally verify the properties we are most interested in for neural networks, the underlying ideas might still apply. Another example is mechanistic anomaly detection, where we only care about whether a computation “uses the same mechanism” on a new input as on earlier ones. Again, this doesn’t require that we understand those mechanisms. Because of this, I’m looking for a general theory of abstractions and what makes them useful, not just a theory of human-understandable abstractions.
What are abstractions of computations?
Examples
I’m thinking about ways we can throw away some information about a computational process, while keeping other parts. There are (roughly speaking) two kinds of information we can throw away:
In practice we are often interested in abstractions that remove both types of information. (In fact, you have to throw away some mechanistic information to throw away information about behavior—given the exact implementation, the input-output behavior can be recovered.)
Let’s look at some examples:
Formal framework
I’m uncertain about what the right formal framework looks like. In this subsection and the next, I’ll describe my current guess, and then mention some issues it has in Limitations of this framework.
Computations
To formalize abstractions of computations, let’s first formalize what we mean by “computations”. I’m currently using a pretty simple and general approach: computations have a set M of memory states and a function f:M→M that performs a single computational step. We also need a function M→{True,False} that tells us whether the computation has finished. Finally, we have a set I of possible input values and a set O of possible output values, with functions I→M and M→O that encode inputs as memory states and decode back to outputs respectively. Any computation induces a function I→O by executing f until the computation terminates.[1] This induced function describes the input-output behavior.
For example, in a deterministic finite automaton, the memory state is the current state of the automaton plus the remaining input. In a Turing machine, the memory state is the tape and the state register.
Abstractions
We can abstract a computation by throwing away information about its memory state. This means an abstraction is a function A:M→M′ for some set of abstract states M′. Since the abstract state is a deterministic function of the full state, it can contain at most the same information as the full state, and typically less (if A is non-injective).
An arbitrary function A:M→M′ might not have anything to do with the “structure” of the computation, since it ignores the computational step f. To make A an abstraction of the computation defined by f, we need to be able to “mirror” the computational step within the abstraction.
We can show all of this in a commutative diagram as follows:
f in the top row is the full computational step, and A is the abstraction mapping on memory states. For a good abstraction, there should exist a dashed map f′ that makes the diagram commute. Note that we can assume without loss of generality that A is surjective, in which case f′ is uniquely determined by f and A.
Commutativity of this diagram means we can either execute a step in the full computation and then abstract the state, or first abstract the state and then execute the computational step in the abstraction. You can think of this as a consistency condition.
This type of commutative diagram is used to describe abstractions in contexts other than computations as well, see Appendix: Related work for several examples, as well as my earlier posts. Appendix: Examples of abstractions of computations describes several examples of abstractions within this framework.
In practice, most abstractions will not achieve exact commutativity: they will be leaky. So a full version of this framework should formalize approximate consistency. This likely means we want a distribution over inputs or memory states, and that the abstract computational step f′ will compute a probability distribution over next states instead of just a single one.
What makes some abstractions better than others?
The consistency condition from the previous section is not sufficient to get “good abstractions”. There are two trivial ways to get commutativity with clearly useless abstractions:
We can rule out these failure modes with two additional desiderata:
I currently think these two desiderata, along with the consistency condition, might be enough. Admittedly, a lot of the difficulty in formalizing good abstractions is hidden in the vagueness of these goals.
It’s also possible we need additional criteria to actually get “white box” abstractions that really reflect the computational structure. But I’m hoping we can get that “for free” with the right formalization of the desiderata above.
Limitations of this framework
I’m aware of three issues with the current framework:
I’m optimistic that limitation 1. and 2. can be resolved by moderate extensions of the framework, leaving the key ideas intact. This is much less clear for 3., but it’s also not obvious that this should be “fixed” within the framework—maybe it’s better to think of 3. as something to be addressed completely separately.
Q&A
Isn’t this just mechanistic interpretability?
Yeah, kind of. I think of it as a more theoretical approach that complements empirical interpretability research and has a similar impact story for alignment. If you want to improve circuit-style analysis, empirical work seems more promising. But my guess is that there are other useful ways of abstracting neural networks that we currently don’t make enough use of. The clearest example is looking for submodules, and there could also be less obvious ones that only a more general study of abstractions of computations will reveal.
Compared to typical interpretability research, I’d say this agenda …
That being said, mechanistic interpretability, especially some flavors such as causal scrubbing, are very closely related to this agenda. See the Related work section for more discussion.
Isn’t this just NN modularity/clusterability?
It’s meant to be more general than just modularity, though I’m currently unsure how much.
Modularity is a prototypical example of the kinds of abstractions I have in mind. Abstractions of computations often work by dividing big computations into submodules. Functions in programming languages are a clear example. I am definitely interested in applying this type of abstraction to neural networks too. However, I am hoping there might be other useful abstractions, so for now I want to stay on the lookout for those.
Does this have anything to do with natural abstractions?
Probably, but I’m unsure about the details. I discuss this a bit more in the related work section. In brief: a key similarity is the assumption that things “abstract well” and that some abstractions are “better” or “more natural” than others.[2] But in terms of the mathematical formalism, the framework I described above seems closer to certain other work than to John’s definition of natural abstractions.
What about mechanistic anomaly detection?
I haven’t thought about this much yet, but it seems to me that abstractions of computations could be a candidate for defining what it means for “mechanisms” to “explain” an output. Abstractions of programs could also be closely related to abstractions of proofs because of the Curry-Howard isomorphism. And abstractions of proofs might be a type of heuristic argument (though probably not quite the type ARC is studying). So that’s another reason to think there’s a connection here. See a bit more discussion in the related work section.
To me, mechanistic interpretability, ELK, and mechanistic anomaly detection seem to overlap a lot (this isn’t a new take of course, see e.g. Paul on why indistinguishable mechanisms would also be bad news for other agendas). I basically think of abstractions of computations as adding yet another perspective on similar problems. I don’t have clear reasons to think it’s a better perspective, it just seems about as reasonable and less explored. But it wouldn’t shock me if I just ended up converging to e.g. a mechanistic anomaly detection view.
Why abstractions of computations, and not just abstractions of neural networks?
A few responses:
So what’s the plan in more detail?
I’m currently thinking about two broad questions:
I outlined my current guess for question 1 above and also briefly discussed question 2. But there’s definitely lots of work to be done on both.
If I had answers to these questions, the ideal next step would be to develop efficient algorithms that can find good abstractions of a given network or other computation. There is a spectrum of different outcomes, depending on how hard this turns out to be:
The abstractions we find this way could either be human-understandable ones, which would mean automating interpretability, or they might be useful for an approach like mechanistic anomaly detection (as discussed earlier in the Q&A).
This is the most straightforward path to impact, but there are some speculative connections to other problems in alignment:
This sounds like a difficult philosophical problem that’s really hard to make progress on.
I think that’s one of the biggest potential weaknesses of this agenda, but I don’t think it’s a decisive one. A few counterpoints (not meant to completely refute this criticism):
What about empirical tests?
No concrete plans yet, but doing so is definitely part of the goal! At the moment, the main way to test a proposed definition of good abstractions is to check if(a) it captures examples of good abstractions we know of, and (b) it doesn’t produce any nonsensical ones. It’s especially exciting if it produces intuitive abstractions that we didn’t think of in advance.
At some point, experiments are vital because otherwise we can at best test a definition against our intuition of what is “good”, not against what’s actually helpful in practice. My current sense is that I’ll need to work on the theoretical side a bit longer before I even have anything to implement though.
A good testing ground in the future could be Tracr. For example: write a program, compile it into a transformer, then see whether we can automatically recover the abstractions we had on the program side. I’d guess the first experiments should be a bit simpler though.
So in summary, why are you excited about this agenda?
Three main reasons:
Point 3 is some evidence that other approaches might be better, but I don’t put too much weight on that since the explanation of “there aren’t enough people to explore every promising avenue” seems sufficient. Still, I would be very interested to hear from people who’ve thought about something close to this agenda and then discarded it.
Points 2 and 3 are in some conflict, and I think one reasonable criticism of this agenda is that it’s in a cluster that’s already a bit crowded relative to alignment overall, and that it would be more important to look for completely different agendas.
And what criticisms are you most worried about?
All of the following seem similarly important to me:
Interlude: why mechanistic abstractions?
Ultimately, we care about having a model with good input-output behavior. So why do we want to mirror the internal structure of our network at all? In other words, why do we want abstractions of computations, instead of just abstractions of black-box functions? For example, say I give you a simple Python program and guarantee you that its input-output behavior is a good approximation of the neural network. But there’s a caveat: the way in which the program achieves this behavior is totally different—even at high levels of abstractions, it uses an entirely different algorithm. This doesn’t seem like a huge issue: having this program, along with the guarantee of sufficiently similar behavior, would still be massively useful.
I think this is a pretty important question for this entire agenda (and for mechanistic interpretability). Here’s my current hypothesis:
I won’t justify this much here, partly because I don’t have great justifications beyond “seems intuitively plausible” and “seems true in some examples”. Investigating this more is something I'm interested in.
At best, this hypothesis holds when we want a good approximation across many inputs. For example, if our goal is only to find a simpler approximation that’s valid on a narrow distribution, we could just train a simpler model from scratch. But we have no guarantees that this simpler model also generalizes similarly to the more sophisticated one.
Some potential projects
Here’s a (non comprehensive) list of subproblems that seem mostly self-contained and of reasonable scope. If you are interested in tackling any of these, please let me know! (I can at least make sure you’re not duplicating work that I or someone else has done in the meantime, and I might be able to provide some guidance.)
Characterizing abstractions for specific computational models
The framework above gives a potential definition of abstractions, and in principle this determines all the abstractions of any given computational model. But it’s a very indirect specification, so it may be hard to work with. For example, given a binary circuit, it’s not obvious how we should think about the set of all its abstractions.
It would be nice to find other characterizations of the set of all abstractions. This seems most likely to be fruitful when first specializing to one computational model (such as circuits). If characterizing the entire set in a nice way isn’t possible, it would also be interesting to look for easy-to-think-about classes of abstractions (similar to Appendix: Examples of abstractions of computations).
This is an important direction because it can address one of my key uncertainties: are there any non-obvious useful abstractions that we should be using? Or are modularity and interpreting subcircuits the only relevant ones? Having more examples should also help with other open questions (for example, the next one).
Do behavioral approximations “factor through” mechanistic ones?
As discussed in Why mechanistic abstractions?, my hypothesis for why we care about mechanistic abstractions/interpretability is: The most tractable way to get robust approximations of input-output behavior for complicated computations is to approximate their internal structure.
A conjecture related to this is:
The idea is that there is a connection between minimal proof length and “mechanistic similarity”: if two circuits are equal at a low level of (mechanistic) abstraction, it ought to be easier to prove they have the same input-output behavior, and perhaps vice versa.
This is of course very informal, and the most important part of answering this question may be getting to the point where we can formalize the conjecture, rather than actually (dis)proving it. It also wouldn’t surprise me if (any reasonable formalization of) the conjecture is just false (though I would guess it is often true for circuits/programs we encounter in practice, so it might still be true with additional assumptions).
This feels similar in spirit to the conjecture that “everything happens for a reason” (from the Heuristic arguments paper, appendix C). If I give you two boolean circuits that have exactly the same input-output behavior, should you expect that there is some structure in the circuits themselves that compactly explains this? If so, that explanation might be effectively saying “these two circuits are identical under the following abstraction”. I haven’t thought about whether there is a strong formal connection here or not.
Test the framework on more examples
This is essentially dual to Characterizing abstractions for specific computational models: look for ways in which we intuitively want to abstract computations, then see if the framework can capture them or not. (If not, potentially adapt the framework.)
Do unique minimal abstractions always exist?
ETA: The answer to this question is "yes", see this shortform. I don't think this is important progress on the agenda (see the shortform), just want to make sure no one wastes time on this subproblem.
We can define a partial ordering on abstractions as follows: let Ai:M→Mi, for i=1,2 be abstractions, where M is the set of memory states, and Mi some set of abstract states. We say that A1 is more fine-grained (i.e. lower-level/more informative) than A2, written A1≥A2, if there is some map h:M1→M2 such that A2=h∘A1. In words, M1 contains all the information we need to compute M2.
We might want to use this ordering as a definition of “minimal” abstractions, which was one desideratum in What makes some abstractions better than others?. In other words, out of all the abstractions that satisfy the consistency condition, and that contain a certain piece of information about the output of the computation, we might want to pick the minimal one according to this ordering. However, since this is only a partial order, it’s not obvious that a unique minimal abstraction always exists, even if everything is finite. I have some preliminary results that suggest these unique minimal abstractions do in fact exist, but it’s for a slightly different setting and not yet fully worked out.
Even if this was successful, it probably wouldn’t be the definition of “good abstraction” we ultimately want to use. It effectively puts infinite weight on the consistency condition and on representing some information, which in practice might often lead to using the full computation as the “abstraction”. Still, it could be a step in the right direction.
Incorporate temporal abstractions and approximate consistency
As mentioned above, we might want to collapse multiple computational steps in the full model to a single step in the abstraction. The current framework doesn’t allow this. I think it should likely be extended (or, if that’s not possible, the entire framework might need to be re-assessed). Another important step is to incorporate approximate consistency, i.e. only demand that the commutative diagram commutes approximately. This will mean additional extensions of the framework, likely making abstract computational steps stochastic as well as requiring an input distribution.
Work out connections to existing work more precisely
Appendix: Related work mentions many existing notions of abstractions, many of which I suspect should fit into the framework I’ve described (or suitable modifications thereof). Working this out in more detail could be a good test, similar to Test the framework on more examples.
On a more conceptual level, I still feel confused about some aspects of how this agenda and the specific framework relate to other alignment agendas, such as John Wentworth’s work on (natural) abstractions and ARC’s mechanistic anomaly detection and heuristic arguments agenda. I think it should be possible to draw much crisper connections than I do in this post.
Final words
This agenda is in an early stage (I’ve been thinking about it for ~2 months). So if you have any feedback, either on the entire approach or on details, I highly appreciate it!
Also, if you’re working on similar things, or might want to start working on similar things, please feel free to reach out! (Just to chat, or to potentially collaborate.) LessWrong DM or erik@ejenner.com for example.
Appendix: Related work
In this appendix, I’ll give a brief overview of existing research I’ve looked at. The AI alignment section below is reasonably complete, I think (though with many open questions). You should consider all the other sections a preliminary list of potentially relevant fields, rather than a full literature review.
AI alignment
John Wentworth’s work on (natural) abstractions
John Wentworth has written a lot about abstractions. (And even more.) Going through all his ideas and trying to connect them with this agenda would be a significant project in itself (though one potentially worth doing). For now, I’ll just give my most important thoughts without too much justification:
Mechanistic interpretability
Mechanistic interpretability broadly construed (e.g. as in Daniel Filan's early post) seems very similar to looking for good abstractions of neural networks, except with the stronger focus of aiming for abstractions that are human-understandable. This is related to one of the limitations of my framework: I don’t think as much about how to encode things in an easy to understand way, whereas this is an important aspect of interpretability (in addition to reducing the size of the thing that needs to be understood).
In current practice, my sense is that a lot of mechanistic interpretability can be seen as finding subcircuits of a neural network that describe some small part of its behavior (see my Subsets and quotients in interpretability post). Focusing on these subcircuits throws away a lot of other information about the computational graph. For lots of subcircuits, this would leave you with basically no predictive power. Subcircuits like the indirect object identification one are interesting because they still let you make some predictions about behavior without knowing about the rest of the network, so they’re good ways of throwing away parts of the computational graph, i.e. good abstractions. Compared to this type of “circuit-style” mechanistic interpretability research, abstractions of computations differ mainly by aiming to use more general types of abstraction.
One other post I want to highlight is Daniel Filan’s Verification and Transparency. He points out how both verification and transparency serve similar purposes and how some things (such as type annotations in programming languages) can be seen as instances of both. My current guess is that abstractions of computations are the more general view that connects the two: they make things more transparent (because they are simpler) and they help with verification (again because they are simpler, as hypothesized in Do extensional approximations “factor through” mechanistic ones?).
Modularity/Clusterability of neural networks
Daniel Filan and others have worked on clusterability of neural networks, i.e. dividing the neurons in a network into natural clusters or submodules using graph cuts. There also seems to be newer work on NN modularity that I’m not familiar with yet. This survey has a brief overview. Additionally, there are several posts by the 2022 AI safety camp modularity team and more posts by 2022 SERI MATS scholars exploring how to define and look for modularity in neural networks. Thane Ruthenis has argued for focusing on interfaces between certain submodules, such as world models and search/planning modules.
As discussed in the Q&A, modularity is a key example of the type of abstraction I’m interested in, but I’m hoping to also study a broader class of abstractions.
Causal scrubbing
In the Mechanistic interpretability section, I talked informally about the “predictive power” of a subcircuit. Causal scrubbing formalizes this by defining whether an interpretation of a subcircuit “explains” a given behavior. So I think of this as formalizing one desideratum for a specific type of “good abstractions” of computation. (Other desiderata are still informal; for example, we might want the proposed interpretation to be as “simple” as possible.) I tentatively think it should be possible to model causal scrubbing as a commutativity condition in my framework but haven’t worked it out yet (doing so is on my agenda).
An important similarity between causal scrubbing and this agenda is that both are aiming at automating interpretability research. (In my case, I’m less firm on the exact end game, but this is at least one major motivation.) They also both seem like candidates for formalizing when mechanisms explain a behavior for mechanistic anomaly detection (or at least I think abstractions seem potentially applicable to that).
ARC’s mechanistic anomaly detection and heuristic arguments agenda
ARC wants to develop a method for mechanistic anomaly detection: say we have a function f and an input distribution D (the training distribution in an ML context). Given some new input to f, we would like to say whether the output of f is due to the “same reasons” that explain its behavior on D, or due to some other “mechanism” that wasn’t necessary to explain the behavior on D. (See their post for an explanation of how this relates to ELK and other alignment problems.)
Developing an algorithm for mechanistic anomaly detection first requires clarifying what we mean by “mechanisms” “explaining” model behavior. ARC’s approach is trying to formalize heuristic arguments and then use those as the definition of “explanations”.
There are two reasons to think that abstractions of computations could be relevant for this agenda (and vice versa):
Abstractions of causal models
Starting with Rubenstein et al. (2017), there have been a few papers on when a structural causal model (SCM) is an abstraction of another SCM. Follow-up papers are Beckers & Halpern (2019a), Beckers & Halpern (2019b) (which introduces approximate abstraction), Rischel & Weichwald (2021), and Otsuka & Saigo (2022) (the last two use a category theoretical framework). Zennaro (2022) reviews these different frameworks. John Wentworth has also written a summary/review of Beckers & Halpern.
I haven’t studied all of these in detail, but as far as I can tell, the approach behind all of them is the same type of commutative diagram that I’m using (and that occurs elsewhere, e.g. in Bisimulation). The idea that’s specific to causal abstraction is that the horizontal arrows are causal interventions. The same type of diagram is also used in a recent paper by Taco Cohen to define causal models of an underlying true causal structure (he refers to this as a “naturality” condition, but doesn’t make explicit any category theoretical interpretation). In that paper, there are two horizontal arrows, one a causal intervention and one some other process that turns states into “outcomes”.
Building on these causality papers, there is also a line of work from Stanford that applies causal abstractions to neural networks. The idea is: we can think of the computational graph of a neural network as a causal model. Given some hypothesis for a higher-level causal model that we think the network might be implementing, and a specification for how the nodes of that higher-level model map into the network, we can test interventional consistency (i.e. check whether the high-level hypothesis is indeed a causal abstraction of the network). This is very reminiscent of Causal scrubbing.
Overall, causal abstractions seem highly relevant for abstractions of computations, given the connection between causal models and computational graphs.
Artificial intelligence
Abstractions in general are relevant all over the place in AI, but I’ll focus on two directions that seem potentially related to abstractions of computations specifically.
Reinforcement learning
Reinforcement learning is interesting because it combines both state abstractions and temporal abstractions (i.e. high-level descriptions of state space and high-level descriptions of actions to take). As mentioned in Limitations of this framework, it seems that my current framework is missing temporal abstraction, and RL could provide inspiration on how to combine this with (memory) state abstraction. See for example David Abel's thesis and its Background section (note that it uses the name action abstractions instead of temporal abstractions).
Logical reasoning and theorem proving
There is a 1992 paper simply called A theory of abstraction, which introduces a framework for abstractions of formal systems. Section 8 has an extensive list of examples from earlier work described in this framework, lots of it on theorem proving but also other areas of AI. One reason why abstractions for theorem proving may be interesting is the connection to heuristic arguments and mechanistic anomaly detection I discussed above.
Computer Science
Abstractions of computations are obviously a big topic in CS. It’s clear there are tons of examples of abstraction to be found, Wikipedia has an overview page. Less clear to me is to what extent people have explored what abstractions of computations are in general.
Below, I’ve collected the subfields that seem most relevant.
Programming languages
As mentioned earlier, programming languages are abstractions of lower-level computational models, such as machine code, but this is just a source of examples rather than a definition of abstractions in general. Programming languages also provide methods for creating computations with obvious abstractions. For example, functions let you decompose a program into submodules, and a very obvious abstraction is to simply specify what some of these functions do instead of how they work. Even this seems mostly just like a source of examples though. Individual programming languages will have less to say on the subject of what kind of thing abstractions are in general.[7] However, the field of programming language design might have things to say about that, since it is essentially in the business of finding good abstractions of computational models.
Formal specification
One important type of abstraction I discussed is to “black box” parts of a computation, i.e. forget how that part is computed and just describe what the input-output behavior is. One question is then: how should the input-output behavior be represented? We probably don’t want to actually use a giant lookup table (and we shouldn’t have to if the program implements some conceptually simple behavior). Formal specification could hold an answer to that subproblem.
Abstract interpretation (and other uses of abstraction for model checking)
Based on a quick glance, this seems highly relevant, but I haven’t spent much time looking into it yet (in particular, whether the formalism there is promising as a general definition of abstractions).
Bisimulation
Bisimulation is a notion of intensional/mechanistic equivalence between transition systems, i.e. it describes when two transition systems should be considered “the same” in terms of their computational structure. Probably even more importantly for this agenda, there is also the one-sided version of this, one system simulating another one (but not necessarily vice versa). This seems like a good candidate for abstractions specifically of transition systems.
People have studied generalizations of (bi-)simulation using category theory (“coalgebraic bisimulation”) and this was in fact what led me to think about abstractions using (co)algebras.
Plagiarism detectors
If you want to tell whether a piece of software plagiarizes another one, you probably don’t just want to check whether their source code is literally the same. Instead, you’re more interested in whether they’re equivalent at some higher level of abstraction. So I’d hoped that plagiarism detectors for code might contain some interesting ideas for defining “mechanistic similarity” of programs (h/t to Stuart Russell and Shreyas Kapur for independently making that suggestion). Unfortunately this doesn’t look too promising based on a cursory exploration: it seems plagiarism detectors often focus on efficiently checking whether two programs have subsnippets that match exactly. They do often “normalize” source code by removing whitespace etc., but that happens in an ad-hoc way that differs from method to method.
Philosophy
I assume that philosophers have thought, if not about exactly my question, then at least related ones. However, this is one of the many things I haven’t gotten around to looking into yet. I expect this might be less helpful than more mathematical approaches like the ones in causal abstraction, but still worth checking.
Appendix: Examples of abstractions of computations
In this appendix, we’ll look at some examples of abstractions within the framework outlined above. Most of them are in circuits, though we also briefly discuss Turing machines. This part may be a bit harder to follow than the rest of the post since I’ll only give rough descriptions of how these examples work.
Circuits
Circuits are directed acyclic graphs where each node is equipped with …
To turn this into a concrete model of how computations are performed step by step, we also choose a topological ordering of the circuit.[8] The state/memory of the computation assigns to each node either an output value or an “undefined” label. Initially, all nodes except inputs are undefined. Then we go over nodes in the chosen topological order and determine one new value in each computational step.
Black box
Let’s start by considering the trivial “black box” abstraction that exists for any computation (not just circuits). This abstraction takes a memory state and maps it to the final output of the circuit (by running through the entire remaining computation). The abstract computational step (f′ in the consistency diagram), is simply the identity function. This makes the consistency diagram commute perfectly.
Quotienting output space
We can throw away even more information. Namely, instead of mapping memory states to the output, we can map them to some equivalence class of outputs. For a binary circuit, we may be interested e.g. in only the parity of all the output bits. In an arithmetic circuit, perhaps we only want the output up to some precision. Again, this abstraction is applicable to any computational model, not just circuits.[9]
Subcircuits
Next, let’s start keeping some information about the actual circuit, instead of treating it as a black box. One key class of examples are “subcircuit” abstractions. This simply means we keep the values for some of the nodes while forgetting all the other ones. To perform computational steps within our abstraction, we might sometimes need input values we don’t have (if there are nodes we represent in the abstraction but which have parents we discarded). There are different choices we could make here, for example just setting assuming these inputs are zero (or some other default value that makes sense in the context of the types of values the circuit works with). This is essentially the same as the choice of ablations in mechanistic interpretability. It seems likely that resampling ablations (as used by causal scrubbing) are generally a better idea than e.g. zero ablations. This requires us to have access to some input distribution, but that seems necessary anyway in order to talk about approximate consistency, as discussed in the main post.
Quotienting the circuit
Another important type of abstraction are quotients of circuits, which correspond to division into submodules. I think of this as black-boxing not the entire circuits, but only certain subcircuits. This abstraction is defined by a partition of the set of nodes (as opposed to specifying one subset of the set of nodes, as in the subcircuit abstraction we just discussed). The intuitive version of what we want to do is the following: we construct a new circuit by “quotienting out” the partition, i.e. turning each equivalence class of nodes into a single node, which implements the function previously implemented by the entire equivalence class. Note that this does put some restrictions on what the partition can be (because we need this quotient circuit to still be acyclic). To formalize this within the current framework, we need to decide how to map memory states of the full circuit to memory states for the quotient circuit. One simple way is to map an equivalence class in which every node is computed to the appropriate output values, and to map any equivalence class with any “undefined” values to “undefined”. To be able to mirror computational steps in the quotient circuit, we also need to add a memory field in the abstraction that tells us whether each equivalence class is about to be “completed” (e.g. just store the step of the topological sort we’re currently at—this can just be increased by 1 in each abstract computational step). That seems a bit inelegant and could be handled more sensibly with a more general framework that allows temporal abstractions.
Associative trees
I’ll finish with a somewhat less central example for some variety. Say I have a boolean circuit that computes the AND of n inputs. A simple way to do this would be a binary tree of AND gates. But you could also imagine other implementations that compute the same function using only NOT and OR. So a reasonable abstraction is “a tree using only AND gates”, which tells us something about the algorithm but ignores the details of how the tree is structured (e.g. is it a tree with depth logn, or one with depth n?)
One reason I’m interested in this example is that I initially thought it would be hard to describe in my current framework. But here’s an approach that I think should work: given the memory state of the circuit, i.e. node values for some prefix of the topological order, we can determine the current “frontier” of nodes, i.e. all the nodes we still need to compute future values. Our abstraction will be to take the AND of all the values on this frontier. For a tree of AND gates, this is always the same as the output of the circuit, so we can use the identity as the computational step within this abstraction. Also note that the abstraction will be the same for any tree of AND gates. On the other hand, assume there is a NOT gate at any point in the circuit. Then for some inputs, the AND along the “frontier” will flip once we reach that NOT gate. The abstraction is thus different from the pure AND tree (and also no longer satisfies the consistency condition).
The same approach can be applied to gates other than AND of course, as long as they’re associative (which ensures that all trees consisting only of that gate type are extensionally equal).
Turing machines
Consider a Universal Turing machine U, i.e. a Turing machine that takes as input an encoding of some Turing machine M and an input x to M, and then outputs what M would output on x. If we fix a Turing machine M and “partially apply” U to M, we get a Turing machine that’s extensionally equal to M (except that its input/output may be encoded in a different way). This Turing machine essentially simulates M within U. Intuitively, M itself is a very natural abstraction of this more complicated computation.
Typical constructions of universal Turing machines encode the state and tape of the simulated machine M in a pretty straightforward way (see e.g. this one). That means it’s easy to map from the state & tape of U to the corresponding state & tape of M, which gives us our abstraction. To mirror computational steps, we essentially just want to do computations in M as usual.
There’s one obstacle here, which we already discussed in Limitations of this framework: multiple computational steps in U correspond to a single step in M. I’m not sure whether there’s a hack to work around this, like there was for Quotienting the circuit. In any case, this is a strong hint that temporal abstraction should be part of the framework.
One other open question is to what extent this works for arbitrary universal Turing machines. While I would guess that any explicit construction of UTMs encodes the state of the simulated machine, it's not obvious to me that this is necessary. You could for example imagine a universal Turing machine that first transforms the simulated machine into some different, extensionally equal one, and then simulates this transformed machine. Even then, the transformed machine might have to be mechanistically quite similar to the original one, so we would still get a good abstraction at some higher level. I’m not sure if this can ever break down completely. (If it does, that’s not necessarily a counterexample to the framework—it seems quite good for this framework to be able to detect whether a universal Turing machine works the “normal” way or does something crazy.)
We have to either restrict ourselves to computations that terminate on all inputs, or have a default output for computations that don’t terminate.
I think there probably isn’t a clean division into “natural” and “not natural”, so maybe John and I aren’t quite on the same page here—I’m not sure what exactly his take on this is.
For a closely related point, see my Too much structure post.
This seems likely in the case of search processes. Goal-directedness is less clear, you could go for purely behavioral definitions as well.
A weaker and easier to formalize version would be: in any such case, there exists a proof of similar length that shows equality up to an abstraction (but which can be different from the original proof).
Though I do also have a fairly strong intuition that there is something input-independent we can say—the presence of a tree detector is extremely unlikely in a random network, and seeing one is a strong hint that there were trees in the training distribution. So I shouldn’t need to tell you what the input distribution is for you to say “this is an interesting subcomponent of the network”.
An exception might be if some languages are specifically designed to make it easy to build new abstractions and work with them?
Alternatively, we could compute values in “maximal layers”, I haven’t checked yet whether that changes the picture much.
Things seem slightly messier if we want to only represent the circuit as a black box on a subset of inputs, i.e. “forget” the behavior on other inputs. This is possible under the computational model where we keep around old node values. But if we introduce “garbage collection” and throw away inputs once they’re no longer needed, it might be impossible to determine whether a memory state came from one of the inputs we’re interested in or not. I’m not entirely sure how to interpret this: it could mean that this abstraction is in fact somewhat weird from a mechanistic perspective, or that this framework isn’t as clean as one might hope, and the precise computational model matters a lot. Perhaps a better approach is to handle subsets of inputs by restricting the input distribution that we’ll probably want to introduce at some point.