The natural abstraction hypothesis claims (in part) that a wide variety of agents will learn to use the same abstractions and concepts to reason about the world.
What's the simplest possible setting where we can state something like this formally? Under what conditions is it true?
One necessary condition to say that agents use 'the same abstractions' is that their decisions depend on the same coarse-grained information about the environment. Motivated by this, we consider a setting where an agent needs to decide which information to pass though an information bottleneck, and ask: when do different utility functions prefer to preserve the same information?
As we develop this setting, we:
unify and extend the Blackwell Informativeness and Good(er) Regulator theorems
investigate the minimum structure needed for strong natural abstractions (in this sense) to exist
give examples where strong natural abstractions a) exist b) don't exist c) exist at some levels of capability, but not at others - including cases where the natural abstractions at different sizes are orthogonal
demonstrate that (very) weak natural abstractions exist under weak conditions
formally connect natural abstractions to instrumental power-seeking incentives, and characterize them in a way inspired by Turner's power-seeking theorems.
(As far as I can tell, apart from the obvious cited Theorems, these results are original[1])
Whereas Wentworth's abstraction work has focused on what natural abstractions might look like, and why we might expect them to exist, this work focuses on exploring what it would mean for something to be a natural abstraction in the first place. Rather than an attempt to answer the natural abstraction hypothesis, it is more an attempt to understand the question.
1. Base setting
Any agent exists within an environment that is bigger than itself. It therefore necessarily maps a large number of potential environment states to a smaller number of internal states. When it then chooses what action to take, this choice depends only on its internal state.
Roughly, we can consider this mapping from the environment to action-influencing internal states to be the agent's abstractions[2]. We want to understand when different utility functions will share preferences about how to do this mapping - under the natural abstraction hypothesis, in appropriate conditions, there should be certain abstractions that are convergent across most utility functions.
To model this, we'll consider a game where:
There is some initial context
An agent writes a message, conditional on this context
The agent takes an action, conditional on the message
The agent gets some utility that depends on the context and its action.
We'll keep everything finite to start - any infinities we want we should come by honestly, by taking limits. Our one concession to this is that we'll allow normal real-valued probabilities and utilities (though we'll see eventually how we could do without if we felt like it).
Formally, define: X: a set of context states
M: a set of signals A: a set of actions U: (X,A)→R, a utility function φ: X→ΔM, an encoder mapping states to a distribution of signals π: M→ΔA, a policy mapping signals to a distribution over actions
We allow φ and π to be stochastic (stochastic maps), but we'll also pay special attention to the case where they are deterministic (so effectively φ:X→M, π:M→A), because it is especially clean and factors apart some of the dynamics we're interested in.
Now, we are interested in preferences over φ assuming that π is chosen optimally. When are these preferences, and the optimal encoders, the same for different U? What happens as we vary |M|? For the time being, we can equally consider different U to be different agents or different tasks faced by the same agent. Later, we'll see some places where the two interpretations diverge.
As one might expect of such a simple setup, similar formalisms have been considered in many other contexts. Most directly, exactly this game is studied in the decision theory and economics literature - with partial overlap with our questions. There, the objects (φ,M) are commonly referred to as 'information structures' or 'Blackwell experiments'; we'll use the former term.
One other remark on notation: we'll often be interested in comparing different encoders φ,φ′, which may have different signal spaces Mφ,Mφ′. We'll write |φ| for ∣∣Mφ∣∣, the cardinality of the signal set.
1.1 Equivalence classes of encoders
One thing to note out of the gate: 'the same encoder' here shouldn't mean equality as functions. For our current purposes, we care only about their information content - so if φ,φ′ are the same except for a relabelling of M, we want to consider them identical (and they will certainly have the same attainable utility for any U).
Another way to say this: we want to quotient by Aut(M)=S|M|, i.e. the permutations of M, to get our real objects of interest. What are these objects?
In the deterministic case, they are the distinct possible sets of preimages {φ−1(m):m∈M} - which are just the |M|-part partitions of X.
In the general case, they are a more general partition-like object. Note that we can represent a stochastic map φ as an |M|×|X|column-stochastic matrix: a matrix with elements in [0,1] such that the sum of each column is 1: ∑|M|i=1[φ]ij=1. Then [φ]ij gives the probability that we encode signal i given we are in state j - that is, pφ(mi|sj).
If we are indifferent to permutations of M, then we are indifferent to permutations of the rows, so we have a set (rather than vector) of row vectors. (If X were instead a suitable topological space and φ continuous, then these would be exactly partitions of unity)
Sometimes it will be more convenient to think of φ as a function, sometimes to think of it as a partition Pφ, and sometimes as a matrix; we'll switch between all three as appropriate. However, keep in mind we really only care about φ up to equivalence in this sense.
1.2 Comparison of encoders
For a given utility function U, we take the value of an information structure to be the expected utility if we have that information structure and follow the optimal policy.
Formally, define U(φ)=maxπEX,π(φ(X))[U(x,a)]=maxπ∑X,A[π∘φ]xaU(x,a); we can also say it is maxπU((X,(π∘φ)(X))), extending U to a function on distributions by taking the expectation. (By convention we're assuming a uniform prior on X; this choice turns out to be irrelevant for our purposes).
We say that U prefers φ to φ′ if U(φ)≥U(φ′), which we'll also write φ≥Uφ′.
2. Basic properties for individual U
As a first step, to get a feel for the setting, let's start with some observations about what can happen for arbitrary individual U. These give us one bound on common properties - for something happen for 'many' U, it must certainly happen for some particular U.
2.0 M should be a bottleneck
One basic observation is that the problem becomes trivial if |M|≥|X| or |M|≥|A|. At that point, we can always pick the globally optimal action for each value of X. For this setting to be remotely interesting, M needs to be an information bottleneck.
2.1a Any deterministic partition is optimal for some U
This is also easy to see: for a partition P={p1,...,pk}, take any {a1,...,ak}⊆A and define U(x,ai)=1(x∈pi). Furthermore, any optimal partition for this U must be a refinement of P, and P is the unique optimal partition with k parts.
2.1b For any U, any nondeterministic φ is weakly dominated by some deterministic φ′
It is not quite literally the case that no nondeterministic φ is optimal for its size for any U. If U is indifferent between where to put x in the optimal policy, say U(x,πopt(m1))=U(x,πopt(m2)), then x can be freely sent to any mixture of m1 and m2. But there is always a deterministic φ′ of less than or equal size which does at least as well, and φ′ either does strictly better or is strictly smaller unless U is indifferent between some outcomes (which may have measure zero, depending on your assumptions).
This motivates considering only partitions in the context of optimality for a single U (the remainder of section 2.)
2.2 Optimal partitions Pi,Pj can be orthogonal
Fixing U, write Pi for the optimal partition of size i. Then, there's not necessarily any relationship between Pi and Pj for i≠j.
For instance, take X=[1..3]×[1..4], A=[1,2]×[1..4] and
U(xi,∗,a1,i)=4U(x∗,i,a2,i)=5otherwise U(∗,∗)=0
Then the optimal partition of size 3 is {{(1,j):∀j},{(2,j):∀j},{(3,j):∀j}}, while the optimal partition of size 4 is {{(i,1):∀i},{(i,2):∀i},{(i,3):∀i},{(i,4):∀i}} (easy exercise: show this). These partitions are orthogonal - in other words, they have zero mutual information. This example can be extended to make any finite number of optimal Pj orthogonal.
The natural abstraction hypothesis claims (in part) that a wide variety of agents will learn to use the same abstractions and concepts to reason about the world.
What's the simplest possible setting where we can state something like this formally? Under what conditions is it true?
One necessary condition to say that agents use 'the same abstractions' is that their decisions depend on the same coarse-grained information about the environment. Motivated by this, we consider a setting where an agent needs to decide which information to pass though an information bottleneck, and ask: when do different utility functions prefer to preserve the same information?
As we develop this setting, we:
(As far as I can tell, apart from the obvious cited Theorems, these results are original[1])
Whereas Wentworth's abstraction work has focused on what natural abstractions might look like, and why we might expect them to exist, this work focuses on exploring what it would mean for something to be a natural abstraction in the first place. Rather than an attempt to answer the natural abstraction hypothesis, it is more an attempt to understand the question.
1. Base setting
Any agent exists within an environment that is bigger than itself. It therefore necessarily maps a large number of potential environment states to a smaller number of internal states. When it then chooses what action to take, this choice depends only on its internal state.
Roughly, we can consider this mapping from the environment to action-influencing internal states to be the agent's abstractions[2]. We want to understand when different utility functions will share preferences about how to do this mapping - under the natural abstraction hypothesis, in appropriate conditions, there should be certain abstractions that are convergent across most utility functions.
To model this, we'll consider a game where:
We'll keep everything finite to start - any infinities we want we should come by honestly, by taking limits. Our one concession to this is that we'll allow normal real-valued probabilities and utilities (though we'll see eventually how we could do without if we felt like it).
Formally, define:
X: a set of context states M: a set of signals
A: a set of actions
U: (X,A)→R, a utility function
φ: X→ΔM, an encoder mapping states to a distribution of signals
π: M→ΔA, a policy mapping signals to a distribution over actions
We allow φ and π to be stochastic (stochastic maps), but we'll also pay special attention to the case where they are deterministic (so effectively φ:X→M, π:M→A), because it is especially clean and factors apart some of the dynamics we're interested in.
Now, we are interested in preferences over φ assuming that π is chosen optimally. When are these preferences, and the optimal encoders, the same for different U? What happens as we vary |M|? For the time being, we can equally consider different U to be different agents or different tasks faced by the same agent. Later, we'll see some places where the two interpretations diverge.
As one might expect of such a simple setup, similar formalisms have been considered in many other contexts. Most directly, exactly this game is studied in the decision theory and economics literature - with partial overlap with our questions. There, the objects (φ,M) are commonly referred to as 'information structures' or 'Blackwell experiments'; we'll use the former term.
One other remark on notation: we'll often be interested in comparing different encoders φ,φ′, which may have different signal spaces Mφ,Mφ′. We'll write |φ| for ∣∣Mφ∣∣, the cardinality of the signal set.
1.1 Equivalence classes of encoders
One thing to note out of the gate: 'the same encoder' here shouldn't mean equality as functions. For our current purposes, we care only about their information content - so if φ,φ′ are the same except for a relabelling of M, we want to consider them identical (and they will certainly have the same attainable utility for any U). Another way to say this: we want to quotient by Aut(M)=S|M|, i.e. the permutations of M, to get our real objects of interest. What are these objects?
In the deterministic case, they are the distinct possible sets of preimages {φ−1(m):m∈M} - which are just the |M|-part partitions of X. In the general case, they are a more general partition-like object. Note that we can represent a stochastic map φ as an |M|×|X| column-stochastic matrix: a matrix with elements in [0,1] such that the sum of each column is 1: ∑|M|i=1[φ]ij=1. Then [φ]ij gives the probability that we encode signal i given we are in state j - that is, pφ(mi|sj). If we are indifferent to permutations of M, then we are indifferent to permutations of the rows, so we have a set (rather than vector) of row vectors. (If X were instead a suitable topological space and φ continuous, then these would be exactly partitions of unity)
Sometimes it will be more convenient to think of φ as a function, sometimes to think of it as a partition Pφ, and sometimes as a matrix; we'll switch between all three as appropriate. However, keep in mind we really only care about φ up to equivalence in this sense.
1.2 Comparison of encoders
For a given utility function U, we take the value of an information structure to be the expected utility if we have that information structure and follow the optimal policy.
Formally, define U(φ)=maxπEX,π(φ(X))[U(x,a)]=maxπ∑X,A[π∘φ]xaU(x,a); we can also say it is maxπU((X,(π∘φ)(X))), extending U to a function on distributions by taking the expectation. (By convention we're assuming a uniform prior on X; this choice turns out to be irrelevant for our purposes).
We say that U prefers φ to φ′ if U(φ)≥U(φ′), which we'll also write φ≥Uφ′.
2. Basic properties for individual U
As a first step, to get a feel for the setting, let's start with some observations about what can happen for arbitrary individual U. These give us one bound on common properties - for something happen for 'many' U, it must certainly happen for some particular U.
2.0 M should be a bottleneck
One basic observation is that the problem becomes trivial if |M|≥|X| or |M|≥|A|. At that point, we can always pick the globally optimal action for each value of X. For this setting to be remotely interesting, M needs to be an information bottleneck.
2.1a Any deterministic partition is optimal for some U
This is also easy to see: for a partition P={p1,...,pk}, take any {a1,...,ak}⊆A and define U(x,ai)=1(x∈pi). Furthermore, any optimal partition for this U must be a refinement of P, and P is the unique optimal partition with k parts.
2.1b For any U, any nondeterministic φ is weakly dominated by some deterministic φ′
It is not quite literally the case that no nondeterministic φ is optimal for its size for any U. If U is indifferent between where to put x in the optimal policy, say U(x,πopt(m1))=U(x,πopt(m2)), then x can be freely sent to any mixture of m1 and m2. But there is always a deterministic φ′ of less than or equal size which does at least as well, and φ′ either does strictly better or is strictly smaller unless U is indifferent between some outcomes (which may have measure zero, depending on your assumptions).
Proof: See footnote[3]; nothing surprising.
This motivates considering only partitions in the context of optimality for a single U (the remainder of section 2.)
2.2 Optimal partitions Pi,Pj can be orthogonal
Fixing U, write Pi for the optimal partition of size i. Then, there's not necessarily any relationship between Pi and Pj for i≠j.
For instance, take X=[1..3]×[1..4], A=[1,2]×[1..4] and U(xi,∗,a1,i)=4U(x∗,i,a2,i)=5otherwise U(∗,∗)=0 Then the optimal partition of size 3 is {{(1,j):∀j},{(2,j):∀j},{(3,j):∀j}}, while the optimal partition of size 4 is {{(i,1):∀i},{(i,2):∀i},{(i,3):∀i},{(i,4):∀i}} (easy exercise: show this). These partitions are orthogonal - in other words, they have zero mutual information. This example can be extended to make any finite number of optimal Pj orthogonal.
2.3 Not every P1,.