Alex Turner, Oregon State University PhD student working on AI alignment.


Reframing Impact
Becoming Stronger


RationalWiki on face masks

I have mixed feelings about this post. LW was right on masks early - and for foreseeable, principled reasons - and I think that that's important. OTOH, this post gives off a "let's sneer at outgroup" vibe, which feels icky to me. I've also been concerned by the rise in politics-related content recently. 

This might just point to me changing my feed's visibility settings, but I wanted to note my concern.

TurnTrout's shortform feed

My point was more that "people generally call both of these kinds of reasoning 'Occam's razor', and they're both good ways to reason, but they work differently."

Generalizing the Power-Seeking Theorems

Discontinuous with respect to what? The discount rate just is, and there just is an optimal policy set for each reward function at a given discount rate, and so it doesn't make sense to talk about discontinuity without having something to govern what it's discontinuous with respect to. Like, teleportation would be positionally discontinuous with respect to time.

You can talk about other quantities being continuous with respect to change in the discount rate, however, and the paper proves prove the continuity of e.g. POWER and optimality probability with respect to .

Generalizing the Power-Seeking Theorems

What do you mean by "agents have different time horizons"? 

To answer my best guess of what you meant: this post used "most agents do X" as shorthand for "action X is optimal with respect to a large-measure set over reward functions", but the analysis only considers the single-agent MDP setting, and how, for a fixed reward function or reward function distribution, optimal action for an agent tends to vary with the discount rate. There aren't multiple formal agents acting in the same environment. 

TurnTrout's shortform feed

(This is a basic point on conjunctions, but I don't recall seeing its connection to Occam's razor anywhere)

When I first read Occam's Razor back in 2017, it seemed to me that the essay only addressed one kind of complexity: how complex the laws of physics are. If I'm not sure whether the witch did it, the universes where the witch did it are more complex, and so these explanations are exponentially less likely under a simplicity prior. Fine so far.

But there's another type. Suppose I'm weighing whether the United States government is currently engaged in a vast conspiracy to get me to post this exact comment? This hypothesis doesn't really demand a more complex source code, but I think we'd say that Occam's razor shaves away this hypothesis anyways - even before weighing object-level considerations. This hypothesis is complex in a different way: it's highly conjunctive in its unsupported claims about the current state of the world. Each conjunct eliminates many ways it could be true, from my current uncertainty, and so should I deem it correspondingly less likely.

MikkW's Shortform

while founding a town scales less than linearly

I think you're omitting constant factors from your analysis; founding a town is so, so much work. How would you even run out utilities to the town before the pandemic ended? 

Open & Welcome Thread - January 2021

The tags / concepts system seems to be working very well so far, and the minimal tagging overhead is now sustainable as new posts roll in. Thank you, mod team!

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Note 1: This review is also a top-level post.

Note 2: I think that 'robust instrumentality' is a more apt name for 'instrumental convergence.' That said, for backwards compatibility, this comment often uses the latter. 

In the summer of 2019, I was building up a corpus of basic reinforcement learning theory. I wandered through a sun-dappled Berkeley, my head in the clouds, my mind bent on a single ambition: proving the existence of instrumental convergence. 


I needed to find the right definitions first, and I couldn't even imagine what the final theorems would say. The fall crept up on me... and found my work incomplete. 

Let me tell you: if there's ever been a time when I wished I'd been months ahead on my research agenda, it was September 26, 2019: the day when world-famous AI experts debated whether instrumental convergence was a thing, and whether we should worry about it. 

The debate unfolded below the link-preview: an imposing robot staring the reader down, a title containing 'Terminator', a byline dismissive of AI risk:

Scientific American
Don’t Fear the Terminator
"Artificial intelligence never needed to evolve, so it didn’t develop the survival instinct that leads to the impulse to dominate others."

The byline seemingly affirms the consequent: "evolution  survival instinct" does not imply "no evolution  no survival instinct." That said, the article raises at least one good point: we choose the AI's objective, and so why must that objective incentivize power-seeking?

I wanted to reach out, to say, "hey, here's a paper formalizing the question you're all confused by!" But it was too early.

Now, at least, I can say what I wanted to say back then: 

This debate about instrumental convergence is really, really confused. I heavily annotated the play-by-play of the debate in a Google doc, mostly checking local validity of claims. (Most of this review's object-level content is in that document, by the way. Feel free to add comments of your own.)

This debate took place in the pre-theoretic era of instrumental convergence. Over the last year and a half, I've become a lot less confused about instrumental convergence. I think my formalisms provide great abstractions for understanding "instrumental convergence" and "power-seeking." I think that this debate suffers for lack of formal grounding, and I wouldn't dream of introducing someone to these concepts via this debate.

While the debate is clearly historically important, I don't think it belongs in the LessWrong review. I don't think people significantly changed their minds, I don't think that the debate was particularly illuminating, and I don't think it contains the philosophical insight I would expect from a LessWrong review-level essay.

Rob Bensinger's nomination reads:

May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).

I don't think the discussion stands great on its own, but it may be helpful for:

  • people familiar with AI alignment who want to better understand some human factors behind 'the field isn't coordinating or converging on safety'.
  • people new to AI alignment who want to use the views of leaders in the field to help them orient.

I certainly agree with Rob's first bullet point. The debate did show us what certain famous AI researchers thought about instrumental convergence, circa 2019. 

However, I disagree with the second bullet point: reading this debate may disorient a newcomer! While I often found myself agreeing with Russell and Bengio, while LeCun and Zador sometimes made good points, confusion hangs thick in the air: no one realizes that, with respect to a fixed task environment (representing the real world) and their beliefs about what kind of objective function the agent may have, they should be debating the probability that seeking power is optimal (or that power-seeking behavior is learned, depending on your threat model). 

Absent such an understanding, the debate is needlessly ungrounded and informal. Absent such an understanding, we see reasoning like this:

Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

I'm glad that this debate happened, but I think it monkeys around too much to be included in the LessWrong 2019 review.

Load More