I think that 'robust instrumentality' is a more apt name for 'instrumental convergence.' That said, for backwards compatibility, this post often uses the latter.
In the summer of 2019, I was building up a corpus of basic reinforcement learning theory. I wandered through a sun-dappled Berkeley, my head in the clouds, my mind bent on a single ambition: proving the existence of instrumental convergence.
I needed to find the right definitions first, and I couldn't even imagine what the final theorems would say. The fall crept up on me... and found my work incomplete.
Let me tell you: if there's ever been a time when I wished I'd been months ahead on my research agenda, it was September 26, 2019: the day when world-famous AI experts debated whether instrumental convergence was a thing, and whether we should worry about it.
The debate unfolded below the link-preview: an imposing robot staring the reader down, a title containing 'Terminator', a byline dismissive of AI risk:
I wanted to reach out, to say, "hey, here's a paper formalizing the question you're all confused by!" But it was too early.
Now, at least, I can say what I wanted to say back then:
This debate about instrumental convergence is really, really confused. I heavily annotated the play-by-play of the debate in a Google doc, mostly checking local validity of claims. (Most of this review's object-level content is in that document, by the way. Feel free to add comments of your own.)
This debate took place in the pre-theoretic era of instrumental convergence. Over the last year and a half, I've become a lot less confused about instrumental convergence. I think my formalisms provide great abstractions for understanding "instrumental convergence" and "power-seeking." I think that this debate suffers for lack of formal grounding, and I wouldn't dream of introducing someone to these concepts via this debate.
While the debate is clearly historically important, I don't think it belongs in the LessWrong review. I don't think people significantly changed their minds, I don't think that the debate was particularly illuminating, and I don't think it contains the philosophical insight I would expect from a LessWrong review-level essay.
Rob Bensinger's nomination reads:
May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).
I don't think the discussion stands great on its own, but it may be helpful for:
- people familiar with AI alignment who want to better understand some human factors behind 'the field isn't coordinating or converging on safety'.
- people new to AI alignment who want to use the views of leaders in the field to help them orient.
I certainly agree with Rob's first bullet point. The debate did show us what certain famous AI researchers thought about instrumental convergence, circa 2019.
However, I disagree with the second bullet point: reading this debate may disorient a newcomer! While I often found myself agreeing with Russell and Bengio, while LeCun and Zador sometimes made good points, confusion hangs thick in the air: no one realizes that, with respect to a fixed task environment (representing the real world) and their beliefs about what kind of objective function the agent may have, they should be debating the probability that seeking power is optimal (or that power-seeking behavior is learned, depending on your threat model).
Absent such an understanding, the debate is needlessly ungrounded and informal. Absent such an understanding, we see reasoning like this:
Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.
I'm glad that this debate happened, but I think it monkeys around too much to be included in the LessWrong 2019 review.