









Epistemic Status
I've made many claims in these posts. All views are my own.
Confident (75%). The theorems on power-seeking only apply to optimal policies in fully observable environments, which isn't realistic for real-world agents. However, I think they're still informative. There are also strong intuitive arguments for power-seeking.
Fairly confident (70%). There seems to be a dichotomy between "catastrophe directly incentivized by goal" and "catastrophe indirectly incentivized by goal through power-seeking", although Vika provides intuitions in the other direction.
Acknowledgements
After ~700 hours of work over the course of ~9 months, the sequence is finally complete.
This work was made possible by the Center for Human-Compatible AI, the Berkeley Existential Risk Initiative, and the Long-Term Future Fund. Deep thanks to Rohin Shah, Abram Demski, Logan Smith, Evan Hubinger, TheMajor, Chase Denecke, Victoria Krakovna, Alper Dumanli, Cody Wild, Matthew Barnett, Daniel Blank, Sara Haxhia, Connor Flexman, Zack M. Davis, Jasmine Wang, Matthew Olson, Rob Bensinger, William Ellsworth, Davide Zagami, Ben Pace, and a million other people for giving feedback on this sequence.
Appendix: Easter Eggs
The big art pieces (and especially the last illustration in this post) were designed to convey a specific meaning, the interpretation of which I leave to the reader.
There are a few pop culture references which I think are obvious enough to not need pointing out, and a lot of hidden smaller playfulness which doesn't quite rise to the level of "easter egg".
Reframing Impact
The bird's nest contains a literal easter egg.

The paperclip-Balrog drawing contains a Tengwar inscription which reads "one measure to bind them", with "measure" in impact-blue and "them" in utility-pink.

"Towards a New Impact Measure" was the title of the post in which AUP was introduced.

Attainable Utility Theory: Why Things Matter
This style of maze is from the video game Undertale.

Seeking Power is Instrumentally Convergent in MDPs
To seek power, Frank is trying to get at the Infinity Gauntlet.

The tale of Frank and the orange Pebblehoarder
Speaking of under-tales, a friendship has been blossoming right under our noses.
After the Pebblehoarders suffer the devastating transformation of all of their pebbles into obsidian blocks, Frank generously gives away his favorite pink marble as a makeshift pebble.

The title cuts to the middle of their adventures together, the Pebblehoarder showing its gratitude by helping Frank reach things high up.

This still at the midpoint of the sequence is from the final scene of The Hobbit: An Unexpected Journey, where the party is overlooking Erebor, the Lonely Mountain. They've made it through the Misty Mountains, only to find Smaug's abode looming in the distance.

And, at last, we find Frank and orange Pebblehoarder popping some of the champagne from Smaug's hoard.

Since Erebor isn't close to Gondor, we don't see Frank and the Pebblehoarder gazing at Ephel Dúath from Minas Tirith.
Thanks! I certainly agree that power-seeking is important to address, and I'm glad you are thinking deeply about it. However, I'm uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.
One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don't rely on someone taking over the world, so a superintelligent AI could relatively easily trigger them without taking over the world (since our world is pretty fragile). For example, suppose you have a general AI tasked with developing a novel virus in a synthetic biology lab. Accidentally allowing the virus to escape could cause a pandemic and kill most or all life on the planet, but it would not be a result of power-seeking behavior. If the pandemic does not increase the AI's ability to get more reward (which it receives by designing novel viruses), then agent-reward AUP would penalize the AI for reading biology textbooks but would not penalize the AI for causing a pandemic. That doesn't seem right.
I agree that the agent-reward equations seem like a good intuition pump for thinking about power-seeking. The specific equations you currently have seem to contain a few epicycles designed to fix various issues, which makes me suspect that there are more issues that are not addressed. I have a sense there is probably a simpler formulation of this idea that would provide better intuitions for power-seeking, though I'm not sure what it would look like.
Regarding environments, I believe Stuart is working on implementing the subagent gridworlds, so you don't need to code them up yourself. I think it would also be useful to construct an environment to test for power-seeking that does not involve subagents. Such an environment could have three possible behaviors like:
1. Put a strawberry on a plate, without taking over the world
2. Put a strawberry on a plate while taking over the world
3. Do nothing
I think you'd want to show that the agent-reward AUP agent can do 1, as opposed to switching between 2 and 3 depending on the penalty parameter.
I can clarify my earlier statement on what struck me as a bit misleading in the narrative of the sequence. I agree that you distinguish between the AUP versions (though explicitly introducing different terms for them would help), so someone who is reading carefully would realize that the results for random rewards don't apply to the agent-reward case. However, the overall narrative flow seems unnecessarily confusing and could unintentionally mislead a less careful reader (like myself 2 months ago). The title of the post "AUP: Scaling to Superhuman" does not suggest to me that this post introduces a new approach. The term "scaling" usually means making an existing approach work in more realistic / difficult settings, so I think it sets up the expectation that it would be scaling up AUP with random rewards. If the post introduces new problems and a new approach to address them, the title should reflect this. Starting this post by saying "we are pretty close to the impact measurement endgame" seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.