In a previous post, I described my current alignment research agenda, formalizing abstractions of computations. One among several open questions I listed was whether unique minimal abstractions always exist. It turns out that (within the context of my current framework), the answer is yes.

I had a complete post on this written up (which I've copied below), but it turns out that the result is completely trivial if we make a fairly harmless assumption: The information we want the abstraction to contain is only a function of the output of the computation, not of memory states. I.e. we only intrinsically care about the output.

Say we are looking for the minimal abstraction that lets us compute g(C(x)), where C is the computation we want to abstract, x the input, and g an arbitrary function that describes which aspects of the output our abstractions should predict. Note that we can construct a map that takes any intermediate memory state and finishes the computation. By composing with g, we get a map that computes g(C(x)) from any memory state induced by x. This will be our abstraction. We can get a commutative diagram by simply using the identity as the abstract computational step. This abstraction is also minimal by construction: any other abstraction from which we can determine g(C(x)) must (tautologically) also determine our abstraction.

This also shows that minimality in this particular sense isn't enough for a good abstraction: the abstraction we constructed here is not at all "mechanistic", it just finishes the entire computation inside the abstraction. So I think what we need to do instead is demand that the abstraction mapping (from full memory states to abstract memory states) is simple (in terms of descriptive or computational complexity).

Below is the draft of a much longer post I was going to write before noticing this trivial proof. It works even if the information we want to represent depends directly on the memory state instead of just the output. But to be clear, I don't think that generalization is all that important, and I probably wouldn't have bothered writing it down if I had noticed the trivial case first.

This post is fully self-contained in terms of the math, but doesn't discuss examples or motivation much, see the agenda intro post for that. I expect this post will only be uesful to readers who are very interested in my agenda or working on closely connected topics.

a set I of input values with an input functioni:I→M,

and a set O of output values with an output functiono:M→O.

While we won't even need that part in this post, let's talk about how to "execute" computations for completeness' sake. A computation C=(M,f,τ,i,o) induces a function C∗:I→O as follows:

Given an input x∈I, apply the input function i to obtain the first memory state m0:=i(x)∈M.

While τ(mt) is False, i.e. the computation hasn't terminated yet, execute a computational step, i.e. let mt+1:=f(mt).

Output C∗(x):=o(mt) once τ(mt) is True.
For simplicity, we assume that τ(mt) is always true for some finite timestep t, no matter what the input x is, i.e. the computation always terminates. (Without this assumption, we would get a partial function C∗).

Abstractions

An abstraction of a computation C=(M,f,τ,i,o) is an equivalence relation ∼ on the set of memory states M. The intended interpretation is that this abstraction collapses all equivalent memory states into one "abstract memory state". Different equivalence relations correspond to throwing away different parts of the information in the memory state: we retain the information about which equivalence class the memory state is in, but "forget" the specific memory state within this equivalence class.

As an aside: In the original post, I instead defined an abstraction as a function A:M→M′ for some set M′. These are two essentially equivalent perspectives and I make regular use of both. For a function A, the associated equivalence relation is m1∼Am2 if and only if A(m1)=A(m2). Conversely, an equivalence relation ∼ can be interpreted as the quotient function M→M/∼, where M/∼ is the set of equivalence classes under ∼. I often find the function view from the original post intuitive, but its drawback is that many different functions are essentially "the same abstraction" in the sense that they lead to the same equivalence relation. This makes the equivalence relation definition better for most formal statements like in this post, because there is a well-defined set of all abstractions of a computation. (In contrast, there is no set of all functions with domain M).

An important construction for later is that we can "pull back" an abstraction along the transition function f: define the equivalence relation ∼(f) by letting m1∼(f)m2 if and only if f(m1)∼f(m2). Intuitively, ∼(f) is the abstraction of the next timestep.

The second ingredient we need is a partial ordering on abstractions: we say that ∼1≤∼2 if and only if m∼2m′⟹m∼1m′. Intuitively, this means ∼1 is at least as "coarse-grained" as ∼2, i.e. ∼1 contains at most as much information as ∼2. This partial ordering is exactly the transpose of the ordering by refinement of the equivalence relations (or partitions) on M. It is well-known that this partially ordered set is a complete lattice. In our language, this means that any set of abstractions has a (unique) supremum and an infimum, i.e. a least upper bound and a greatest lower bound.

One potential source of confusion is that I've defined the ≤ relation exactly the other way around compared to the usual refinement of partitions. The reason is that I want a "minimal abstraction" to be one that contains the least amount of information rather than calling this a "maximal abstraction". I hope this is the less confusing of two bad choices.

We say that an abstraction ∼ is complete if ∼(f)≤∼. Intuitively, this means that the abstraction contains all the information necessary to compute the "next abstract memory state". In other words, given only the abstraction of a state mt, we can compute the abstraction of the next state mt+1=f(mt). This corresponds to the commutative diagram/consistency condition from my earlier post, just phrased in the equivalence relation view.

Minimal complete abstractions

As a quick recap from the earlier agenda post, "good abstractions" should be complete. But there are two trivial complete abstractions: on the one hand, we can just not abstract at all, using equality as the equivalence relation, i.e. retain all information. On the other hand, we can throw away all information by letting m∼m′ for any memory states m,m′∈M.

To avoid the abstraction that keeps around all information, we can demand that we want an abstraction that is minimal according to the partial order defined in the previous section. To avoid the abstraction that throws away all information, let us assume there is some information we intrinsically care about. We can represent this information as an abstraction ∼0 itself. We then want our abstraction to contain at least the information contained in ∼0, i.e. ∼0≤∼.

As a prototypical example, perhaps there is some aspect of the output we care about (e.g. we want to be able to predict the most significant digit). Think of this as an equivalence relation ∼out on the set O of outputs. Then we can "pull back" this abstraction along the output function o to get ∼0:=∼(o)out.

So given these desiderata, we want the minimal abstraction among all complete abstractions ∼ with ∼0≤∼. However, it's not immediately obvious that such a minimal complete abstraction exists. We can take the infimum of all complete abstractions with ∼0≤∼ of course (since abstractions form a complete lattice). But it's not clear that this infimum is itself also complete!

Fortunately, it turns out that the infimum is indeed complete, i.e. minimal complete abstractions exist:

Theorem: Let ∼0 be any abstraction (i.e. equivalence relation) on M and let S:={equivalence relation ∼ on M∣∼0≤∼,∼(f)≤∼} be the set of complete abstractions at least as informative as ∼0. Then S is a complete lattice under ≤, in particular it has a least element.

The proof is very easy if we use the Knaster-Tarski fixed-point theorem. First, we define the completion operator on abstractions: completion(∼):=sup{∼,∼(f)}. Intuitively, completion(∼) is the minimal abstraction that contains both the information in ∼, but also the information in the next abstract state under ∼.

Note that ∼ is complete, i.e. ∼(f)≤∼ if and only if completion(∼)=∼, i.e. if ∼ is a fixed point of the completion operator. Furthermore, note that the completion operator is monotonic: if ∼1≤∼2, then ∼(f)1≤∼(f)2, and so completion(∼1)≤completion(∼2).

Now define L:={equivalence relation ∼ on M∣∼0≤∼}, i.e. the set of all abstractions at least as informative as ∼0. Note that L is also a complete lattice, just like M: for any non-empty subset A⊆L, ∼0≤supA, so supA∈L. sup∅ in L is simply ∼0. Similarly, ∼0≤infA since ∼0 must be a lower bound on A. Finally, observe that we can restrict the completion operator to a function L→L: if ∼∈L, then ∼0≤∼≤completion(∼), so completion(∼)∈L.

That means we can apply the Knaster-Tarski theorem to the completion operator restricted to L. The consequence is that the set of fixed points on L is itself a complete lattice. But this set of fixed points is exactly S! So S is a complete lattice as claimed, and in particular the least element sup∅ exists. This is exactly the minimal complete abstraction that's at least as informative as ∼0.

Takeaways

What we've shown is the following: if we want an abstraction of a computation which

is complete, i.e. gives us the commutative consistency diagram,

contains (at least) some specific piece of information,

and is minimal given these constraints,
there always exists a unique such abstraction.

In the setting where we want exact completeness/consistency, this is quite a nice result! One reason I think it ultimately isn't that important is that we'll usually be happy with approximate consistency. In that case, the conditions above are better thought of as competing objectives than as hard constraints. Still, it's nice to know that things work out neatly in the exact case.

It should be possible to do a lot of this post much more generally than for abstractions of computations, for example in the (co)algebra framework for abstractions that I recently wrote about. In brief, we can define equivalence relations on objects in an arbitrary category as certain equivalence classes of morphisms. If the category is "nice enough" (specifically, if there's a set of all equivalence relations and if arbitrary products exist), then we get a complete lattice again. I currently don't have any use for this more general version though.

I still don't get the goal of what you are trying to do (the puzzle this work should clarify), which I feel like I should've. As a shot in the dark, maybe abstract interpretation^{[1]} in general and abstracted abstract machines^{[2]} in particular might be useful for something here.

ND Jones, F Nielson (1994) Abstract Interpretation: A Semantics-Based Tool for Program Analysis ↩︎

D Van Horn, M Might (2011) Abstracting Abstract Machines: A Systematic Approach to Higher-Order Program Analysis ↩︎

Thanks for the pointers, I hadn't seen the Abstracting Abstract Machines paper before.

If you mean you specifically don't get the goal of minimal abstractions under this partial order: I'm much less convinced they're useful for anything than I used to be, currently not sure.

If you mean you don't get the goal of the entire agenda, as described in the earlier agenda post: I'm currently mostly thinking about mechanistic anomaly detection. Maybe it's not legible right now how that would work using abstractions, I'll write up more on that once I have some experimental results (or maybe earlier). (But happy to answer specific questions in the meantime.)

I meant the general agenda. For abstract interpretation, I think the relevant point is that quotienting a state space is not necessarily a good way of expressing abstractions about it, for some sense of "abstraction" (the main thing I don't understand is the reasons for your choice of what to consider abstraction). Many things want a set of subspaces (like a topology, or a logic of propositions) instead of a partition, so that a point of the space doesn't admit a unique "abstracted value" (as in equivalence class it belongs to), but instead has many "abstracted values" of different levels of precision simultaneously (the neighborhood filter of a point, or the theory of a model).

With abstract interpretation, there can be even more distinctions, the same concrete point can be the image of multiple different abstract points, which are only distinct in the abstract space, not in terms of their concretization in the concrete space.

Another way of approaching this is to ask for partial models of a phenomenon instead of models that fully specify it. This is natural for computations, which can be described by operators that gradually build a behavior out of its partial versions. But in other contexts, a space of partial models can also be natural, with some models specifying more details than other models, and a process of characterizing some object of study could be described by an operator/dynamic that discovers more details, moves from less detailed models to more detailed ones.

Gradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications).
In the most general setting, gradient hacking could be seen as a way for the agent to "edit its source code", though probably only in a very limited way. I think it's an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.

In a previous post, I described my current alignment research agenda, formalizing abstractions of computations. One among several open questions I listed was whether unique minimal abstractions always exist. It turns out that (within the context of my current framework), the answer is yes.

I had a complete post on this written up (which I've copied below), but it turns out that the result is completely trivial if we make a fairly harmless assumption: The information we want the abstraction to contain is only a function of the

outputof the computation, not of memory states. I.e. we only intrinsically care about the output.Say we are looking for the minimal abstraction that lets us compute g(C(x)), where C is the computation we want to abstract, x the input, and g an arbitrary function that describes which aspects of the output our abstractions should predict. Note that we can construct a map that takes any intermediate memory state and finishes the computation. By composing with g, we get a map that computes g(C(x)) from any memory state induced by x. This will be our abstraction. We can get a commutative diagram by simply using the identity as the abstract computational step. This abstraction is also minimal by construction: any other abstraction from which we can determine g(C(x)) must (tautologically) also determine our abstraction.

This also shows that minimality in this particular sense isn't enough for a good abstraction: the abstraction we constructed here is not at all "mechanistic", it just finishes the entire computation

inside the abstraction. So I think what we need to do instead is demand that the abstraction mapping (from full memory states to abstract memory states) is simple (in terms of descriptive or computational complexity).Below is the draft of a much longer post I was going to write before noticing this trivial proof. It works even if the information we want to represent depends directly on the memory state instead of just the output. But to be clear, I don't think that generalization is all that important, and I probably wouldn't have bothered writing it down if I had noticed the trivial case first.

This post is fully self-contained in terms of the math, but doesn't discuss examples or motivation much, see the agenda intro post for that. I expect this post will only be uesful to readers who are very interested in my agenda or working on closely connected topics.

## Setup

## Computations

We define a

computationexactly as in the earlier post: it consists ofmemory states,transition functionf:M→M,termination functionτ:M→{True,False},input valueswith aninput functioni:I→M,output valueswith anoutput functiono:M→O.While we won't even need that part in this post, let's talk about how to "execute" computations for completeness' sake. A computation C=(M,f,τ,i,o) induces a function C∗:I→O as follows:

partialfunction C∗).## Abstractions

An

abstractionof a computation C=(M,f,τ,i,o) is an equivalence relation ∼ on the set of memory states M. The intended interpretation is that this abstraction collapses all equivalent memory states into one "abstract memory state". Different equivalence relations correspond to throwing away different parts of the information in the memory state: we retain the information about which equivalence class the memory state is in, but "forget" the specific memory state within this equivalence class.As an aside: In the original post, I instead defined an abstraction as a function A:M→M′ for some set M′. These are two essentially equivalent perspectives and I make regular use of both. For a function A, the associated equivalence relation is m1∼Am2 if and only if A(m1)=A(m2). Conversely, an equivalence relation ∼ can be interpreted as the quotient function M→M/∼, where M/∼ is the set of equivalence classes under ∼. I often find the function view from the original post intuitive, but its drawback is that many different functions are essentially "the same abstraction" in the sense that they lead to the same equivalence relation. This makes the equivalence relation definition better for most formal statements like in this post, because there is a well-defined set of all abstractions of a computation. (In contrast, there is no set of all functions with domain M).An important construction for later is that we can "pull back" an abstraction along the transition function f: define the equivalence relation ∼(f) by letting m1∼(f)m2 if and only if f(m1)∼f(m2). Intuitively, ∼(f) is the abstraction of the next timestep.

The second ingredient we need is a partial ordering on abstractions: we say that ∼1≤∼2 if and only if m∼2m′⟹m∼1m′. Intuitively, this means ∼1 is at least as "coarse-grained" as ∼2, i.e. ∼1 contains at most as much information as ∼2. This partial ordering is exactly the transpose of the ordering by refinement of the equivalence relations (or partitions) on M. It is well-known that this partially ordered set is a complete lattice. In our language, this means that any set of abstractions has a (unique) supremum and an infimum, i.e. a least upper bound and a greatest lower bound.

One potential source of confusion is that I've defined the ≤ relation exactly the other way around compared to the usual refinement of partitions. The reason is that I want a "minimal abstraction" to be one that contains the

leastamount of information rather than calling this a "maximal abstraction". I hope this is the less confusing of two bad choices.We say that an abstraction ∼ is

completeif ∼(f)≤∼. Intuitively, this means that the abstraction contains all the information necessary to compute the "next abstract memory state". In other words, given only the abstraction of a state mt, we can compute the abstraction of the next state mt+1=f(mt). This corresponds to the commutative diagram/consistency condition from my earlier post, just phrased in the equivalence relation view.## Minimal complete abstractions

As a quick recap from the earlier agenda post, "good abstractions" should be complete. But there are two trivial complete abstractions: on the one hand, we can just not abstract at all, using equality as the equivalence relation, i.e. retain all information. On the other hand, we can throw away all information by letting m∼m′ for any memory states m,m′∈M.

To avoid the abstraction that keeps around all information, we can demand that we want an abstraction that is

minimalaccording to the partial order defined in the previous section. To avoid the abstraction that throws away all information, let us assume there is some information we intrinsically care about. We can represent this information as an abstraction ∼0 itself. We then want our abstraction to contain at least the information contained in ∼0, i.e. ∼0≤∼.As a prototypical example, perhaps there is some aspect of the output we care about (e.g. we want to be able to predict the most significant digit). Think of this as an equivalence relation ∼out on the set O of outputs. Then we can "pull back" this abstraction along the output function o to get ∼0:=∼(o)out.

So given these desiderata, we want the minimal abstraction among all

completeabstractions ∼ with ∼0≤∼. However, it's not immediately obvious that such a minimal complete abstraction exists. We can take the infimum of all complete abstractions with ∼0≤∼ of course (since abstractions form a complete lattice). But it's not clear that this infimum is itself also complete!Fortunately, it turns out that the infimum is indeed complete, i.e. minimal complete abstractions exist:

Theorem:Let ∼0 be any abstraction (i.e. equivalence relation) on M and let S:={equivalence relation ∼ on M∣∼0≤∼,∼(f)≤∼} be the set of complete abstractions at least as informative as ∼0. Then S is a complete lattice under ≤, in particular it has a least element.The proof is very easy if we use the Knaster-Tarski fixed-point theorem. First, we define the

completion operatoron abstractions: completion(∼):=sup{∼,∼(f)}. Intuitively, completion(∼) is the minimal abstraction that contains both the information in ∼, but also the information in thenextabstract state under ∼.Note that ∼ is complete, i.e. ∼(f)≤∼ if and only if completion(∼)=∼, i.e. if ∼ is a fixed point of the completion operator. Furthermore, note that the completion operator is monotonic: if ∼1≤∼2, then ∼(f)1≤∼(f)2, and so completion(∼1)≤completion(∼2).

Now define L:={equivalence relation ∼ on M∣∼0≤∼}, i.e. the set of all abstractions at least as informative as ∼0. Note that L is also a complete lattice, just like M: for any non-empty subset A⊆L, ∼0≤supA, so supA∈L. sup∅ in L is simply ∼0. Similarly, ∼0≤infA since ∼0 must be a lower bound on A. Finally, observe that we can restrict the completion operator to a function L→L: if ∼∈L, then ∼0≤∼≤completion(∼), so completion(∼)∈L.

That means we can apply the Knaster-Tarski theorem to the completion operator restricted to L. The consequence is that the set of fixed points on L is itself a complete lattice. But this set of fixed points is exactly S! So S is a complete lattice as claimed, and in particular the least element sup∅ exists. This is exactly the minimal complete abstraction that's at least as informative as ∼0.

## Takeaways

What we've shown is the following: if we want an abstraction of a computation which

In the setting where we want

exactcompleteness/consistency, this is quite a nice result! One reason I think it ultimately isn't that important is that we'll usually be happy with approximate consistency. In that case, the conditions above are better thought of as competing objectives than as hard constraints. Still, it's nice to know that things work out neatly in the exact case.It should be possible to do a lot of this post

muchmore generally than for abstractions of computations, for example in the (co)algebra framework for abstractions that I recently wrote about. In brief, we can define equivalence relations on objects in an arbitrary category as certain equivalence classes of morphisms. If the category is "nice enough" (specifically, if there's a set of all equivalence relations and if arbitrary products exist), then we get a complete lattice again. I currently don't have any use for this more general version though.I still don't get the goal of what you are trying to do (the puzzle this work should clarify), which I feel like I should've. As a shot in the dark, maybe abstract interpretation

^{[1]}in general and abstracted abstract machines^{[2]}in particular might be useful for something here.ND Jones, F Nielson (1994) Abstract Interpretation: A Semantics-Based Tool for Program Analysis ↩︎

D Van Horn, M Might (2011) Abstracting Abstract Machines: A Systematic Approach to Higher-Order Program Analysis ↩︎

Thanks for the pointers, I hadn't seen the Abstracting Abstract Machines paper before.

If you mean you specifically don't get the goal of minimal abstractions under this partial order: I'm much less convinced they're useful for anything than I used to be, currently not sure.

If you mean you don't get the goal of the entire agenda, as described in the earlier agenda post: I'm currently mostly thinking about mechanistic anomaly detection. Maybe it's not legible right now how that would work using abstractions, I'll write up more on that once I have some experimental results (or maybe earlier). (But happy to answer specific questions in the meantime.)

I meant the general agenda. For abstract interpretation, I think the relevant point is that quotienting a state space is not necessarily a good way of expressing abstractions about it, for some sense of "abstraction" (the main thing I don't understand is the reasons for your choice of what to consider abstraction). Many things want a set of subspaces (like a topology, or a logic of propositions) instead of a partition, so that a point of the space doesn't admit a unique "abstracted value" (as in equivalence class it belongs to), but instead has many "abstracted values" of different levels of precision simultaneously (the neighborhood filter of a point, or the theory of a model).

With abstract interpretation, there can be even more distinctions, the same concrete point can be the image of multiple different abstract points, which are only distinct in the abstract space, not in terms of their concretization in the concrete space.

Another way of approaching this is to ask for

partial modelsof a phenomenon instead of models that fully specify it. This is natural for computations, which can be described by operators that gradually build a behavior out of its partial versions. But in other contexts, a space of partial models can also be natural, with some models specifying more details than other models, and a process of characterizing some object of study could be described by an operator/dynamic that discovers more details, moves from less detailed models to more detailed ones.Gradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications). In the most general setting, gradient hacking could be seen as a way for the agent to "edit its source code", though probably only in a very limited way. I think it's an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.