This is extremely cool -- thank you, Peter and Owen! I haven't read most of it yet, let alone the papers, but I have high hopes that this will be a useful resource for me.
It didn't bug me ¯\_(ツ)_/¯
Thanks for the post! FWIW, I found this quote particularly useful:
Well, on my reading of history, that means that all sorts of crazy things will be happening, analogous to the colonialist conquests and their accompanying reshaping of the world economy, before GWP growth noticeably accelerates!
The fact that it showed up right before an eye-catching image probably helped :)
This may be out-of-scope for the writeup, but I would love to get more detail on how this might be an important problem for IDA.
Thanks for the writeup! This google doc (linked near "raised this general problem" above) appears to be private: https://docs.google.com/document/u/1/d/1vJhrol4t4OwDLK8R8jLjZb8pbUg85ELWlgjBqcoS6gs/edit
This seems like a useful lens -- thanks for taking the time to post it!
I do agree. I think the main reason to stick with "robustness" or "reliability" is that that's how the problems of "my model doesn't generalize well / is subject to adversarial examples / didn't really hit the training target outside the training data" are referred to in ML, and it gives a bad impression when people rename problems. I'm definitely most in favor of giving a new name like "hitting the target" if we think the problem we care about is different in a substantial way (which could definitely happen going forward!)
OK -- if it looks like the delay will be super long, we can certainly ask him how he'd be OK w/ us circulating / attributing those ideas. In the meantime, there are pretty standard norms about unpublished work that's been shared for comments, and I think it makes sense to stick to them.
I agree re: terminology, but probably further discussion of unpublished docs should just wait until they're published.
Thanks for writing this, Will! I think it's a good + clear explanation, and "high/low-bandwidth oversight" seems like a useful pair of labels.
I've recently found it useful to think about two kind-of-separate aspects of alignment (I think I first saw these clearly separated by Dario in an unpublished Google Doc):
1. "target": can we define what we mean by "good behavior" in a way that seems in-principle learnable, ignoring the difficulty of learning reliably / generalizing well / being secure? E.g. in RL, this would be the Bellman equation or recursive definition of the Q-function. The basic issue here is that it's super unclear what it means to "do what the human wants, but scale up capabilities far beyond the human's".
2. "hitting the target": given a target, can we learn it in a way that generalizes "well"? This problem is very close to the reliability / security problem a lot of ML folks are thinking about, though our emphasis and methods might be somewhat different. Ideally our learning method would be very reliable, but the critical thing is that we should be very unlikely to learn a policy that is powerfully optimizing for some other target (malign failure / daemon). E.g. inclusive genetic fitness is a fine target, but the learning method got humans instead -- oops.
I've largely been optimistic about IDA because it looks like a really good step forward for our understanding of problem 1 (in particular because it takes a very different angle from CIRL-like methods that try to learn some internal values-ish function by observing human actions). 2 wasn't really on my radar before (maybe because problem 1 was so open / daunting / obviously critical); now it seems like a huge deal to me, largely thanks to Paul, Wei Dai, some unpublished Dario stuff, and more recently some MIRI conversations.