Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alex Turner's Research, Comprehensive Information Gathering

5habryka

2adamShimi

4TurnTrout

New Comment

This in turns leads to one of the strongest result of Alex's paper: for any "well-behaved" distribution on reward functions, if the environment has the sort of symmetry I mentioned, then for at least half of the permutations of this distribution, at least half of the probability mass will be on reward functions for which the optimal policy is power-seeking.

Clarification:

- The instrumental convergence (formally, optimality probability) results apply to
*all distributions*over reward functions. So, the "important" part of my results apply to permutations of arbitrary distributions - no well-behavedness is required. - The formal-POWER results apply to
*bounded distributions*over reward functions. This guarantees that POWER's expectation is well-defined.

The paper isn't currently very clear on that point - only mentioning it in footnote 1 on page 6.

## Introduction

This is the second post in a sequence where I try to focus on one topic for some time (the first two were for a month, but I'm changing with the one I'm currently doing). Initially I called this deep dives, but John's Comprehensive Information Gathering seems more fitting. My goal is not to master the topic; instead I want to learn enough to be able to have a constructive conversation on it, and to follow future work on it.

For the month of May, I focused on Alex Turner's research. Namely Power-seeking and Attainable Utility. The way I went about it was a bit different from my dive into Transformers, because now I had access to the main author. I thus basically read stuff and tried to understand it (while asking question to Alex via discord), and then had calls with him to check that I got most of it right and correct my mistakes.

Concretely, I studied the last version of the Power-seeking paper (which Alex was rewriting at the time) and Reframing Impact. I didn't read the proof (except in one or two cases), but I tried to understand in a lot of details the theorems and lemmas themselves.

## The power-seeking I didn't know I needed.

My first surprise came from the power-seeking work, and just how interesting it was. Reading the AF had biased me towards thinking that Alex's main work was on impact measures and Attainable Utility, but I actually find power-seeking more exciting.

Intuitively, the power of a state captures the expected optimal value of this state for a given distribution of reward functions (with some subtleties to make it cleaner) -- how "many" reward functions have an optimal policy passing by this state. An action is then power-seeking compared to another one if it leads in expectation to more powerful states.

The trick that makes this great is Alex's insight that how "many" reward functions have an optimal policy passing by this state boils down to questions of symmetry of the MDP. Especially in the stronger version that only cares about the final cycles of optimal policies (instead of the full trajectory), there is an obvious sense in which a state s with more power than a state s' leads to more such final cycles, which means there is an injection from the final cycles from s' to the final cycles from s (with some final cycles of s left out of the image). This in turns leads to one of the strongest result of Alex's paper: for any "well-behaved" distribution on reward functions, if the environment has the sort of symmetry I mentioned, then for at least half of the permutations of this distribution, at least half of the probability mass will be on reward functions for which the optimal policy is power-seeking.

I've been obsessed with deconfusion lately, and Alex's formalization of power-seeking is a great example of good deconfusion:

For those who don't want to read the full paper, I recommend this new post hot out of the oven which gives the intuitions and even some spicy new results.

As for whether I consider this part of the information gathering a success, I want to say yes. I'm able to reasonably explain the core of the result and why it is interesting, and I can follow new development, as was shown to me when Alex sent me his new result about the simplicity prior and power-seeking and I could follow it enough to give some feedback.

I'm also personally excited about trying to link power-seeking with my own deconfusion of goal-directedness, and ask questions about the power-seeking tendencies of competent goal-directed policies instead of optimal policies.

## Reframing reframing impact's impact

I ended up being less excited by rereading Reframing Impact. Part of the problem was the rereading, obviously. Also because the sequence is really well-written and paced, I think I had gotten most of the insight from my first readthrough. But there were some new discoveries.

Notably, I hadn't read this post on how to pick a level of impact that is large enough to actually do something, but low enough to keep the benefit of impact measures and AUP.

I still think Attainable Utility Preservation is valuable and a great deconfusion work too, but it is less relevant to my own research interest, and I didn't feel like I got that much out of rereading it.

## Conclusion

I'm pretty satisfied with this comprehensive information gathering. I feel like I could have done better, maybe by searching for other angles on AUP instead of just rereading Reframing Impact. That being said, I got what I wanted out of the exercise, and I am quite comfortable now with reading almost anything Alex publishes.