[ETA: I'm deprioritizing completing this sequence because it seems that other people are writing good similar stuff. In particular, see e.g. https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth and https://www.lesswrong.com/posts/pdJQYxCy29d7qYZxG/agency-and-coherence ]
This sequence explains my take on agency. I’m responding to claims that the standard arguments for AI risk have a gap, a missing answer to the question “why should we expect there to be agenty AIs optimizing for stuff? Especially the sort of unbounded optimization that instrumentally converges to pursuit of money and power.”
This sequence is a pontoon bridge thrown across that gap.
I’m also responding to claims that there are coherent, plausible possible futures in which agent AGI (perhaps better described as APS-AI) isn’t useful/powerful/incentivized, thanks to various tools that can do the various tasks better and cheaper. I think those futures are incoherent, or at least very implausible. Agency is powerful. For example, one conclusion I am arguing for is:
When it becomes possible to make human-level AI agents, said agents will be able to outcompete various human-tool hybrids prevalent at the time in every important competition (e.g. for money, power, knowledge, SOTA performance, control of the future lightcone...)
We should expect Agency as Byproduct, i.e. expect some plausible training processes to produce agenty AIs even when their designers weren't explicitly aiming for that outcome.
I’ve had these ideas for about a year but never got around to turning them into rigorous research. Given my current priorities it looks like I might never do that, so instead I’m going to bang it out over a couple of weekends so it doesn’t distract from my main work. :/ I won't be offended if you don't bother to read it.
Outline of this sequence:
Incomplete list of related literature and comments:
Frequent arguments about alignment - LessWrong (A comment in which Richard Ngo summarizes a common pattern of conversation about the risk from agenty AI vs. other sorts of AI risk)
Joe Carlsmith, drawing on writings from others, had 20% credence that AI agents won't be powerful enough relative to non-agents to be incentivised. I recommend reading the whole report, or at least the relevant sections on APS-AI and incentives to build it.
Why You Shouldn't Be a Tool: The Power of Agency by Gwern. (OK, it seems to have a different title now, maybe it always did and I hallucinated this memory...) This essay, more than anything else, inspired my current views.
The Ground of Optimization by Alex Flint argues: "there is a specific class of intelligent systems — which we call optimizing systems — that are worthy of special attention and study due to their potential to reshape the world. The set of optimizing systems is smaller than the set of all AI services, but larger than the set of goal-directed agentic systems."
Yudkowsky and Ngo conversation (especially as summarized by Nate Soares) seems to be arguing for something similar to Alex -- I imagine Yudkowsky would say that by focusing on agency I'm missing the forest for the trees: there is a broader class of systems (optimizers? consequentialists? makers-of-plans-that-lase?) of which agents are a special case, and it's this broader class that has the interesting and powerful and scary properties. I think this is probably right but my brain is not yet galaxy enough to grok it; I'm going to defy EY's advice and keep thinking about the trees for now. I look forward to eventually stepping back and trying to see the forest.
Thanks to various people, mostly at and around CLR, for conversations that shaped my views on this subject. Thanks especially to Ramana Kumar whose contributions were the greatest.