paulom — LessWrong

A while back I was looking for toy examples of environments with different amounts of 'naturalness' to their abstractions, and along the way noticed a connection between this version of Gooder Regulator and the Blackwell order.

Inspired by this, I expanded on this perspective of preferences-over-models / abstraction a fair bit here.

It includes among other things:

the full preorder of preferences-shared-by-all-agents over maps (vs. just the maximum)
an argument that actually we want to generalize to this diagram instead^[1] :
an extension to preferences shared by 'most' agents based on Turner's power-seeking theorems (and a formal connection to power-seeking)

Personally I think these results are pretty neat; I hope they might be of interest.

I also make one other argument there that directly responds to this post, re: the claim:

[The] “model” M definitely has to store exactly the posterior.

I think this sort of true, depending how you interpret it, but really it's more accurate to say the whole regulator $X \to M \to R$ has to encode the posterior, not just $X \to M$ .

Specifically, as you show, $M (X)$ and $(s \mapsto P [S = s | X])$ must induce the same equivalence classes of $x$ . This means that they are isomorphic as functions in a particular sense.

But it turns out: lots of (meaningfully) different systems $X \to S$ can lead to the same optimal model $M (x)$ . This means you can't recover the full system transition function just by looking at $M (x)$ ; so it doesn't really store the whole posterior.

On the other hand, you can uniquely recover the whole posterior (up to technicalities) from $M (x)$ plus the agent's optimal strategies $π : (M, Y) \to Δ R$ . So it is still fair to say the agent as a whole most model the Bayesian posterior; but I'd say it's not just the model $M (X)$ which does it.

^{^}
Which in this perspective basically means allowing the 'system' or 'world' we study to include restrictions on the outcome mapping, e.g. that $(s, a)$ , and $(s^{'}, a^{'})$ lead to the same outcome, which must be given a common utility by any given game. Looking back at this again, I'm not sure I described the distinction quite right in my post (since this setting as you gave it already has a $Z$ distinct from $u (Z)$ ), but there is still a distinction.

Parametrically retargetable decision-makers tend to seek power

paulom2y10

FWIW - here (finally) is the related post I mentioned, which motivated this observation: Natural Abstraction: Convergent Preferences Over Information Structures The context is a power-seeking-style analysis of the naturality of abstractions, where I was determined to have transitive preferences.

It had quite a bit of scope creep already, so I ended up not including a general treatment of the (transitive) 'sum over orbits' version of retargetability (and some parts I considered only optimality - sorry! still think it makes sense to start there first and then generalize in this case). The full translation also isn't necessarily as easy as I thought - it turns out that is transitive specifically for binary functions, so the other cases may not translate as easily as $IsOptimal$ . After noticing that I decided to leave the general case for later.

I did use the sum-over-orbits form, though; which turns out to describe the preferences shared by every " $G$ -invariant" distribution over utility functions. Reading between the lines shows roughly what it would look like.

I also moved from $S_{d}$ to any $G \leq S_{d}$ - not sure if you looked at that, but at least the parts I was using all seem to work just as well with any subgroup. This gives preferences shared by a larger set of distributions, e.g. for an MDP you could in some cases have $s_{1}$ preferred to $s_{2}$ for all priors on $U$ that are merely invariant to permuting $U (s_{1})$ and $U (s_{2})$ (rather than requiring them to be invariant to all permutations of utilities).

Fixing The Good Regulator Theorem

paulom2y10

Not sure this is exactly what you meant by the full preference ordering, but might be of interest: I give the preorder of universally-shared-preferences between "models" here (in section 4).

Basically, it is the Blackwell order, if you extend the Blackwell setting to include a system.

Parametrically retargetable decision-makers tend to seek power

paulom3yΩ010

Thanks for the reply. I'll clean this up into a standalone post and/or cover this in a related larger post I'm working on, depending on how some details turn out.

What are here?

Variables I forgot to rename, when I changed how I was labelling the arguments of $f$ in my example. This should be $1 2 \to 2$ , $2 2 \to 3$ , $3 2 \to 1$ retargetable (as arguments $i$ to $f (i | j)$ ).

Parametrically retargetable decision-makers tend to seek power

paulom3yΩ6120

I appreciate this generalization of the results - I think it's a good step towards showing the underlying structure involved here.

One point I want to comment on is transitivity of , as a relation on induced functions $f : Θ \to R$ . Namely, it isn't, and can even contain cycles of non-equivalent elements. (This came up when I was trying to apply a version of these results, and hoping that $\geq_{m o s t}^{n}$ would be the preference relation I was looking for out of the box.) Quite possibly you noticed this since you give 'limited transitivity' in Lemma B.1 rather than full transitivity, but to give a concrete example:

Let $V = ⎛ ⎜ ⎝ \begin{matrix} 1 & 2 & 3 3 & 1 & 2 2 & 3 & 1 \end{matrix} ⎞ ⎟ ⎠$ and $f (i | j) = V_{i j}$ . The permutations are $σ \in S_{3}$ with the usual action on ${1, 2, 3}$ . Then we have ^[1] $f_{1} \geq_{m o s t}^{2} f_{2} \geq_{m o s t}^{2} f_{3} \geq_{m o s t}^{2} f_{1}$ (and $f_{2} ≱_{m o s t}^{2} f_{1}$ ). This also works on retargetability directly, with $f$ being $A 2 \to B$ , $B 2 \to C$ , $C 2 \to A$ retargetable. Notice also that $f$ is invariant under joint permutations (constant diagonals), and I think can be represented as EU-determined, so neither of these save it.

A narrow point is that for a non-transitive relation, I think the notation should be something other than $\geq$ (maybe $≽$ ).

But more importantly, I think we would really rather a transitive (at least acyclic) relation, if we want to interpret this is 'most $θ$ prefer' or any kind of preference / aggregation of preferences. If our theorem gives us only an intransitive relation as our conclusion, then we should tweak it.

One way you can do this: aim for a stronger relation like $\geq_{o-m}^{n}$ :

Definition (Orbit-mean dominance?): Let $O_{f, A \neq B} (θ) = {θ^{'} \in Orbit |_{Θ} (θ) : f (A | θ^{'}) \neq f (B | θ^{'})}$ . Write $f (B | θ) \geq_{o-m}^{n} f (A | θ)$ if $\forall θ : \sum_{O_{f, A \neq B} (θ)} f (B | θ^{'}) \geq n \sum_{O_{f, A \neq B} (θ)} f (A | θ^{'})$ .

Since the orbits are under $S_{d}$ i.e. finite, it's easy to just sum over them. More generally, you could parameterize this with an arbitrary aggregator $g : {Orbits}_{Θ} (f) \to R$ in place of summation; I'm not sure whether this general form or the $\sum$ case should be the focus.

This is transitive for $n = 1$ and acyclic for^[2] $n > 1$ (consider $θ$ by $θ$ ); and possibly any orbit-based transitive relation is representable in basically this form^[3] (with some $g$ ), since I'd guess any partial order on sets with cardinality $\leq c$ can be represented as a pointwise inequality of functions, but I haven't thought about this too carefully.

With this notion of $\geq_{o-m}^{n}$ , we also need a stronger version of retargetability for the main theorem to hold. For the $\sum$ version, this could be

Definition (scalar-retargetability): Write $f$ is $A B - -- \to s c a l a r$ if there exists $σ \in S_{d}$ such that for all $θ$ with $f (A | θ^{A}) - f (B | θ^{A}) = c > 0$ we have $f (B | σ θ^{A}) - f (A | σ θ^{A}) \geq c$ (and likewise multiply scalar-retargetable).

Then scalar-retargetability from $A$ to $B$ will imply $f (B | θ) \geq_{o-m}^{n} f (A | θ)$ .

And: I think many (all?) of the main power-seeking results are already secretly in this form. For example, $θ$ -wise comparison of $\sum_{θ^{'} \in Orbit |_{Θ} (θ)} IsOptimal (X | C, θ^{'})$ gives a preference relation $\geq_{o-m}^{n}$ identical to the relation $\geq_{m o s t}^{n}$ . Assuming this also works for the other rationalities, then the cases we care about were transitive all along exactly because the relations can be expressed in this way.

What do you think?

We get the same single orbit ${1, 2, 3}$ for all $θ$ a.k.a. $j$ ; the orbit elements $j$ with $f (i | j) > f (i^{'} | j)$ are the columns where row $i$ $>$ row $i^{'}$ . There are always two such columns when comparing row $i$ and row $i + 1$ (mod 3). For example, $\begin{matrix} f (1, 1) = 1 < 3 = f (2, 1) f (1, 2) = 2 > 1 = f (2, 2) f (1, 3) = 3 > 2 = f (2, 3) \end{matrix}$ ↩︎
We exclude $θ$ s.t. $f (A | θ^{'}) = f (B | θ^{'})$ in this version of the definition to match the behaviour of $\geq_{m o s t}^{n}$ with $n > 1$ , and allow $n$ -scalar-retargetability to imply $\geq_{o-m}^{n}$ . There's a case that you should include them, in which case you do get transitivity, and even the stronger property: if $x \leq^{n} y \leq^{m} z$ , then $x \leq^{n m} z$ . I think this corresponds to looking at likelihood ratios of $P (A \land \neg B) :: P (B \land \neg A)$ vs. $P (A) :: P (B)$ . ↩︎
Compare also what would give you a total order (instead of partial order): aggregating over all of $Θ$ at once, like $\int_{Θ} f (A | θ) d μ (θ)$ , instead of aggregating orbitwise at each $θ$ . ↩︎

Testing The Natural Abstraction Hypothesis: Project Intro

paulom5y180

I think this line of research is interesting. I really like the core concept of abstraction as summarizing the information that's relevant 'far away'.

A few thoughts:

- For a common human abstraction to be mostly recoverable as a 'natural' abstraction, it must depend mostly on the thing it is trying to abstract, and not e.g. evolutionary or cultural history, or biological implementation. This seems more plausible for 'trees' than it does for 'justice'. There may be natural game-theoretic abstractions related to justice, but I'd expect human concepts and behaviors around justice to depend also in important ways on e.g. our innate social instincts. Innate instincts and drives seem likely to a) be complex (high-information) and b) depend on our whole evolutionary history, which is itself presumably path-dependent and chaotic, so I wouldn't expect this to be just a choice among a small number of natural alternatives.

An (imperfect) way of reframing this project is as an attempt to learn human concepts mostly from the things that are causally upstream of their existance, minimizing the need for things that are downstream (e.g. human feedback), and making the assumption that the only important/high-information thing upstream is the (natural concept of the) human concept's referent.

- If an otherwise unnatural abstraction is used by sufficiently influential agents, this can cause the abstraction to become 'natural', in the sense of being important to predict things 'far away'.

- What happens when a low-dimensional summary is still too high dimensional for the human / agent to reason about? I conjecture that values might be most important here. An analogy: optimal lossless compression doesn't depend on your utility function, but optimal lossy compression does. Concepts that are operating in this regime may be less unique. (For that matter, from a more continuous perspective: given `n` bits to summarize a system, how much of the relevance 'far way' can we capture as a function of `n`? What is the shape of this curve - is it self similar, or discrete regimes, or? If there are indeed different discrete regimes, what happens in each of them?)

- I think there is a connection to instrumental convergence, roughly along the lines of 'most utility functions care about the same aspects of most systems'.

Overall, I'm more optimistic about approaches that rely on some human concepts being natural, vs. all of them. Intuitively, I do feel like there should be some amount of naturalness that can help with the 'put a strawberry on a plate' problem (and maybe even the 'without wrecking anything else' part).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments