## LESSWRONGLW

TurnTrout

Alex Turner, Oregon State University PhD student working on AI alignment.

# Sequences

Reframing Impact
Becoming Stronger

Book Review: Working With Contracts

Once formed, a contract acts as custom, private law between the parties.

This is a cool way of understanding contracts.

I'm putting this on the shelf of facepalm-obvious-but-beautiful realizations like

• Medicine is the science of healing, not just a collection of random facts about what pills to take
• Math is about the inescapable and immutable consequences of basic rules, not just about playing with integrals and numbers
• Physics is, in large part, about discovering the transition rules of the universe
• Machine learning is about the beautiful ideal learn- yeah, no, machine learning is still just a mess
Open & Welcome Thread - September 2020

I'm going on a 30-hour roadtrip this weekend, and I'm looking for math/science/hard sci-fi/world-modelling Audible recommendations. Anyone have anything?

Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem

My point here is just that it seems pretty plausible that he meant "if and only if".

Sure. To clarify: I'm more saying "I think this statement is wrong, and I'm surprised he said this". In fairness, I haven't read the mentioned section yet either, but it is a very strong claim. Maybe it's better phrased as "a CIRL agent has a positive incentive to allow shutdown iff it's uncertain [or the human has a positive term for it being shut off]", instead of "a machine" has a positive incentive iff.

Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem

We know of many ways to get shut-off incentives, including the indicator utility function on being shut down by humans (which theoretically exists), and the AUP penalty term, which strongly incentivizes accepting shutdown in certain situations - without even modeling the human. So, it's not an if-and-only-if.

Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem

In Chapter 16, we show that a machine has a positive incentive to allow itself to be switched off if and only if it is uncertain about the human objective.

Surely he only meant if it is uncertain?

Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Battle of the Sexes

However, it is furthermore true of iterated PD that there are multiple different Pareto-optimal equilibria, which benefit different players more or less. Also, if players don't successfully coordinate on one of these equilibria, they can end up in a worse overall state (such as mutual defection forever, due to playing grim-trigger strategies with mutually incompatible demands). This makes iterated PD resemble Battle of the Sexes.

I think this paragraph very clearly summarizes your argument. You might consider including it as a TL;DR at the beginning.

Open & Welcome Thread - September 2020

If I had to guess, I'd guess the answer is some combination of "most people haven't realized this" and "of those who have realized it, they don't want to be seen as sympathetic to the bad guys".

TurnTrout's shortform feed

Totally 100% gone. Sometimes I go weeks forgetting that pain was ever part of my life.

Max Kaye's Shortform

\usepackage{cleveref}

....

1+1=2.\label{eq:1}

\Cref{eq:1} is an amazing new discovery; before Max Kaye, no one grasped the perfect and utter truth of \cref{eq:1}.

TurnTrout's shortform feed

When I imagine configuring an imaginary pile of blocks, I can feel the blocks in front of me in this fake imaginary plane of existence. I feel aware of their spatial relationships to me, in the same way that it feels different to have your eyes closed in a closet vs in an empty auditorium.

But what is this mental workspace? Is it disjoint and separated from my normal spatial awareness, or does my brain copy/paste->modify my real-life spatial awareness. Like, if my brother is five feet in front of me, and then I imagine a blade flying five feet in front of me in my imaginary mental space where he doesn't exist, do I reflexively flinch? Does my brain overlay these two mental spaces, or are they separate?

I don't know. When I run the test, I at least flinch at the thought of such a thing happening. This isn't a good experiment because I know what I'm testing for; I need to think of a better test.