Alex Turner, Oregon State University PhD student working on AI alignment. Reach me at turneale[at]oregonstate[dot]edu.


Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact
Becoming Stronger

Wiki Contributions


Biology-Inspired AGI Timelines: The Trick That Never Works

I find it concerning that you felt the need to write "This is not at all a criticism of the way this post was written. I am simply curious about my own reaction to it" (and still got downvoted?).

For my part, I both believe that this post contains valuable content and good arguments, and that it was annoying / rude / bothersome in certain sections.

Formalizing Policy-Modification Corrigibility

The biggest disconnect is that this post is not a proposal for how to solve corrigibility. I'm just thinking about what corrigibility is/should be, and this seems like a shard of it—but only a shard. I'll edit the post to better communicate that. 

So, your points are good, but they run skew to what I was thinking about while writing the post.

Formalizing Policy-Modification Corrigibility

I skip over those pragmatic problems because this post is not proposing a solution, but rather a measurement I find interesting. 

Omicron Post #3

Flag: even being less worried is an additional inference not demanded by the form of the contrapositive of "if you're worried, you should get your booster" . But that's just a nitpick over how literally we're meaning logical equivalence.

TurnTrout's shortform feed

I'm not sure how often I subvocalize to think thoughts. Often I have trouble putting a new idea into words just right, which means the raw idea essence came before the wordsmithing. But other times it feels like I'm synchronously subvocalizing as I brainstorm

TurnTrout's shortform feed

Reading EY's dath ilan glowfics, I can't help but think of how poor English is as a language to think in. I wonder if I could train myself to think without subvocalizing (presumably it would be too much work to come up with a well-optimized encoding of thoughts, all on my own, so no new language for me). No subvocalizing might let me think important thoughts more quickly and precisely.

Biology-Inspired AGI Timelines: The Trick That Never Works

OK, I'll bite on EY's exercise for the reader, on refuting this "what-if":

Humbali:  Then here's one way that the minimum computational requirements for general intelligence could be higher than Moravec's argument for the human brain.  Since, after, all, we only have one existence proof that general intelligence is possible at all, namely the human brain.  Perhaps there's no way to get general intelligence in a computer except by simulating the brain neurotransmitter-by-neurotransmitter.  In that case you'd need a lot more computing operations per second than you'd get by calculating the number of potential spikes flowing around the brain!  What if it's true?  How can you know?

Let's step back and consider what kind of artifact the brain is. The human brain was "found" by evolution via via a selection process over a rather limited amount of time (between our most recent clearly-dumb ancestor, and anatomically modern humans). We have a local optimization process which optimizes over a relatively short timescale. This process found a brain which implements a generally intelligent algorithm.

In high-dimensional non-convex optimization, we have a way to describe algorithms found by a small amount of local optimization: "not even close to optimal." (Humans aren't even at a local optimum for inclusive-genetic-fitness due to our being mesa-optimizers.) But if the brain's algorithm isn't optimal, it trivially can't be the only algorithm that can produce general intelligence. Indeed, I would expect the fact that evolution found our algorithm at all to indicate that there were many possible such algorithms. 

There are many generally intelligent algorithms, and our brain only implements one, and it's just not going to be true that all of the others—or even the ones most likely to be discovered by AI researchers—are only implementable using (simulated) neurotransmitters.

Soares, Tallinn, and Yudkowsky discuss AGI cognition

no-one has the social courage to tackle the problems that are actually important

I would be very surprised if this were true. I personally don't feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.

I would guess that if people aren't tackling Hard Problems enough, it's not because they lack social courage, but because 1) they aren't running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they're wrong about what problems are Hard Problems. My money's mostly on (1), with a bit of (2).

Solve Corrigibility Week

But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem. has been solved.

Noting that I strongly disagree but don't have time to type out arguments right now, sorry. May or may not type out later.

Frame Control

I'm not done with the post yet, but this part really jumped out at me.

Second point is a doozy, and it’s that you can’t look at intent when diagnosing frame control. As in, “what do they mean to do” should be held separate from “what are the effects of what they’re doing” - which I know is counter to almost every good lesson about engaging with people charitably. 

I think you're right in a narrow way but mostly wrong here. The narrow way in which you seem right is that (someone's intent) and (someone's impact) are indeed separate quantities. But someone having good intent—or someone seeming to have good intent, if one can generally discern this with above-random accuracy—means that their actions are more likely optimized to have good effects, and so these two quantities are generally correlated. 

In this section, your language conflates two possible scenarios (by my reading). In the first, we condition on "leader X seems to have good intent." In the second, we condition on "me and my friends are talking about how leader X is deeply flawed and perceptive, and the things he did that hurt people were either for their own good, or an unintentional byproduct of him genuinely trying to do good." The second scenario is very different

In the first scenario, we surely have P(bad frame-controller | leader X seems to have good intent) < P(bad frame controller | leader X does not seem to have good intent). 

In the second scenario, P(bad frame-controller | leader X is flawed but seems well-meaning, [other red flags]) should still be less than P(bad frame-controller | leader X is flawed but does not seem well-meaning, [other red flags]), but I think that the probability of something bad going on should be high either way

And so as far as I can tell, intent does matter for the beliefs you arrive at, and I am very very wary that this post claims otherwise in. "Frame control" has the potential to be an argumentative superweapon.

This all might sound pretty dark, like I’m painting a reality where you might go around squinting at empathetic, open, caring people who have zero ill intent whatsoever and trying to figure out how they are ‘actually bad.’ And this is kind of true, but if only because “I am an empathetic, open, caring person with zero ill intent” is exactly the kind of defense actual frame-controllers inhabit. 

Surely we have P(awful frame control | They seem like an empathetic, open, caring person with zero ill intent) << P(awful frame control | They do not seem like an empathetic, open, caring person with zero ill intent)?

The vast majority of good people with good intent aren’t doing any significant kind of frame control; my point is just that “good person with good intent” should not be considered a sufficient defense if there seems to be other elements of frame control present.

But then you back off the original claim that you can't look at intent, and merely say that good intent is not sufficient to conclude that they aren't doing this horrible frame control. OK. I agree. But this feels like a motte and bailey. 

Rereading the portion in question to make sure I'm not missing something, you write:

And so, when evaluating frame control, you have to throw out intent. The question is not “does this person mean to control my frame,” the question is “is this person controlling my frame?”. This is especially true for diagnosing frame control that you’re inside of, because the first defense a frame controller uses is the empathy you hold for them.

I think "be on the lookout for ways people can weaponize your empathy" seems wiser to me than "throw out intent."

Load More