Where I currently disagree with Ryan Greenblatt’s version of the ELK approach

[-]Alex Flint3yΩ6129

How actually do you sidestep the need for the One True Objective Function given an ELK solution? I get that it might seem plausible to take a rough objective like "do what I intend" and look at the internal knowledge of the thing for signs that it is deliberately deceiving you. If you do that, you'll get, at best, an AI that doesn't know that it is deceiving you (for whatever operationalization of "know" you come up with as you use ELK for training). But it could still be deceiving you, and very likely will be if optimization pressure is merely towards "AIs that don't know that they are being deceptive".

[-]paulfchristiano3yΩ8102

Our very broad hope is to use ELK to select actions that (i) keep humans safe, and give them time and space to evolve according to their current (essentially local) preferences, (ii) are expected to produce outcomes that would be judged favorably by the future humans, primarily by maximizing option value until it becomes clear what those future humans want (see the strategy stealing assumption).

This is discussed very briefly in this appendix of the ELK report and the subsequent appendix. There are two or three big foreseeable difficulties with this approach and likely a bunch of other problems.

I don't think this should be particularly persuasive, but it hopefully illustrates how ARC is currently thinking about this part of the problem. Overall my current view is that this is fairly unlikely to be the weakest link in the plan, i.e. if it doesn't work it will be because of a failure at an earlier step, and so it's not one of the main things I'm thinking about.

[-]paulfchristiano3yΩ6111

Reiterating a point above: observe how this whole scheme has basically assumed that capabilities won't start to generalize relevantly out of distribution. My model says that they eventually will, and that this is precisely when things start to get scary, and that one of the big hard bits of alignment is that once that starts happening, the capabilities generalize further than the alignment. A problem that has been simply assumed away in this agenda, as far as I can tell, before we even dive into the details of this framework.

My reply last time is still relevant: link.

[-]Adam Jermyn3yΩ7105

For what it's worth I found this writeup informative and clear. So lowering your standards still produced something useful (at least to me).

[-]ryan_greenblatt3yΩ490

... THEN the Paulian family of plans don't provide much hope.

My understanding is that Ryan was tentatively on board with this conditional statement, but Paul was not.

I forget the extent to which I communicated (or even thought) this in the past, but at the moment, the current claim I'd agree with is: "this specific plan is much less likely to work".

My best guess is that even if I was quite confident in those conditions being true, work on various subparts of this plan seems like quite a good bet.

[-]Charlie Steiner3yΩ250

Does Ryan have an agenda somewhere? I see this post, but I don't think that's it.

[-]ryan_greenblatt3yΩ241

I don't have an agenda posted anywhere.

LESSWRONG
LW

LESSWRONG
LW

65

Where I currently disagree with Ryan Greenblatt’s version of the ELK approach

65

Ω 34

65

Ω 34

Nate's model towards the end of the conversation

An attempt at conditional agreement

Postscript