Verification and Transparency

This seems like a useful lens -- thanks for taking the time to post it!

Understanding Iterated Distillation and Amplification: Claims and Oversight

I do agree. I think the main reason to stick with "robustness" or "reliability" is that that's how the problems of "my model doesn't generalize well / is subject to adversarial examples / didn't really hit the training target outside the training data" are referred to in ML, and it gives a bad impression when people rename problems. I'm definitely most in favor of giving a new name like "hitting the target" if we think the problem we care about is different in a substantial way (which could definitely happen going forward!)

Understanding Iterated Distillation and Amplification: Claims and Oversight

OK -- if it looks like the delay will be super long, we can certainly ask him how he'd be OK w/ us circulating / attributing those ideas. In the meantime, there are pretty standard norms about unpublished work that's been shared for comments, and I think it makes sense to stick to them.

Understanding Iterated Distillation and Amplification: Claims and Oversight

I agree re: terminology, but probably further discussion of unpublished docs should just wait until they're published.

Understanding Iterated Distillation and Amplification: Claims and Oversight

Thanks for writing this, Will! I think it's a good + clear explanation, and "high/low-bandwidth oversight" seems like a useful pair of labels.

I've recently found it useful to think about two kind-of-separate aspects of alignment (I think I first saw these clearly separated by Dario in an unpublished Google Doc):

1. "target": can we define what we mean by "good behavior" in a way that seems in-principle learnable, ignoring the difficulty of learning reliably / generalizing well / being secure? E.g. in RL, this would be the Bellman equation or recursive definition of the Q-function. The basic issue here is that it's super unclear what it means to "do what the human wants, but scale up capabilities far beyond the human's".

2. "hitting the target": given a target, can we learn it in a way that generalizes "well"? This problem is very close to the reliability / security problem a lot of ML folks are thinking about, though our emphasis and methods might be somewhat different. Ideally our learning method would be very reliable, but the critical thing is that we should be very unlikely to learn a policy that is powerfully optimizing for some other target (malign failure / daemon). E.g. inclusive genetic fitness is a fine target, but the learning method got humans instead -- oops.

I've largely been optimistic about IDA because it looks like a really good step forward for our understanding of problem 1 (in particular because it takes a very different angle from CIRL-like methods that try to learn some internal values-ish function by observing human actions). 2 wasn't really on my radar before (maybe because problem 1 was so open / daunting / obviously critical); now it seems like a huge deal to me, largely thanks to Paul, Wei Dai, some unpublished Dario stuff, and more recently some MIRI conversations.

Current state:

  • I do think problem 2 is super-worrying for IDA, and probably for all ML-ish approaches to alignment? If there are arguments that different approaches are better on problem 2, I'd love to see them. Problem 2 seems like the most likely reason right now that we'll later be saying "uh, we can't make aligned AI, time to get really persuasive to the rest of the world that AI is very difficult to use safely".
  • I'm optimistic about people sometimes choosing only problem 1 or problem 2 to focus on with a particular piece of work -- it seems like "solve both problems in one shot" is too high a bar for any one piece of work. It's most obvious that you can choose to work on problem 2 and set aside problem 1 temporarily -- a ton of ML people are doing this productively -- but I also think it's possible and probably useful to sometimes say "let's map out the space of possible solutions to problem 1, and maybe propose a family of new ones, w/o diving super deep on problem 2 for now."
Bet or update: fixing the will-to-wager assumption

I really like this post, and am very glad to see it! Nice work.

I'll pay whatever cost I need to for violating non-usefulness-of-comments norms in order to say this -- an upvote didn't seem like enough.

AI safety: three human problems and one AI issue

Thanks for writing this -- I think it's a helpful kind of reflection for people to do!

Where's the first benign agent?

Ah, gotcha. I'll think about those points -- I don't have a good response. (Actually adding "think about"+(link to this discussion) to my todo list.)

It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.

Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?

Where's the first benign agent?

These objections are all reasonable, and 3 is especially interesting to me -- it seems like the biggest objection to the structure of the argument I gave. Thanks.

I'm afraid that the point I was trying to make didn't come across, or that I'm not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul's are not amenable to any kind of argument for confidence, and we will only ever be able to say "well, I ran out of ideas for how to break it", so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety.

Do you think it's unlikely that we'll be able to make positive arguments for the safety of schemes like Paul's? If so, I'd be really interested in why -- apologies if you've already tried to explain this and I just haven't figured that out.

Where's the first benign agent?

"naturally occurring" means "could be inputs to this AI system from the rest of the world"; naturally occurring inputs don't need to be recognized, they're here as a base case for the induction. Does that make sense?

If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren't, I'd guess that possible input single pages of text aren't value-corrupting in an hour. (I would certainly want a much better answer than "I guess it's fine" if we were really running something like this.)

To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn't going to kill us. If you think it's really unlikely that any argument of this inductive form could be run, I'd be interested in that (or if Paul or someone else thought I'm on the wrong track / making some kind of fundamental mistake.)

Load More