Paul Christiano, Wei Dai, Andreas Stuhlmüller and I had an online chat discussion recently, the transcript of the discussion is available here. (Disclaimer that it’s a nonstandard format and we weren't optimizing for ease of understanding the transcript). This discussion was primarily focused on amplification of humans (not later amplification steps in IDA). Below are some highlights from the discussion, and I’ll include some questions that were raised that might merit further discussion in the comments.
Strategies for sampling from a human distribution of solutions:
Paul: For example you can use "Use random human example," or "find an analogy to another example you know and use it to generate an example," or whatever.
There is some subtlety there, where you want to train the model that sample from the real human distribution rather than from the empirical distribution of 10 proposals you happen to have collected so far. If samples are cheap that's fine. Otherwise you may need to go further to "Given that [X1, X2, ...] are successful designs, what is a procedure that can produce additional successful designs?" or something like that. Not sure.
Dealing with unknown concepts
Andreas: Suppose you get a top-level command that contains words that H doesn't understand (or just doesn't look at), say something like "Gyre a farbled bleg.". You have access to some data source that is in principle enough to learn the meanings of those words. What might the first few levels of questions + answers look like?
Paul: possible questions: "What's the meaning of the command", which goes to "What's the meaning of word X" for the words X in the sentence, "What idiomatic constructions are involved in this sentence?", "What grammatical constructions are involved in the sentence"
Answers to those questions are big trees representing meanings, e.g. a list of properties of "gyre" (what properties the subject and object typically have, under what conditions it is said to have occurred, why someone might want you to do it, tons of stuff most of which will be irrelevant for the query)
Which come from looking up definitions, proposing definitions and seeing how well they match with usage in the cases you can look at, etc.
Limits on what amplification can accomplish
Paul:In general, if ML can't learn to do a task, then that's fine with me. And if ML can learn to do a task but only using data source X, then we are going to have to integrate data source X into the amplification process in order for amplification to be able to solve it, there is no way to remove the dependence on arbitrary data sources. And there will exist data sources which pose alignment issues, independent of any alignment issues posed by the ML.
Alignment search for creative solutions
Considering the task of generating a solution to a problem that requires creativity, it can be decomposed into:
Evaluate those solutions
For solution generation, one idea is to shape the distribution of proposals so you are less likely to get malign answers (ie. sample from the distribution of answers a human would give, which would hopefully be more likely to be safe/easily evaluated compared to some arbitrary distribution).
I asked Paul if he thought that safe creative solution generation would require sampling from a less malign distribution, or whether he thought we could solve evaluation (“secure-X-evaluation”, as testing whether the solution fulfilled property X) well enough to use an arbitrary distribution/brute force search.
Paul: I don't see a good way to avoid solving secure X-evaluation anyway. It seems to me like we can generate solutions in ways that put much lower probability on malign answers, but it neither seems like we can totally eliminate that (I don't think human creativity totally eliminates that either), nor that we will always have access to some more-aligned human generator
The best I'd probably say is that we can have a generation process that is not itself malign, not clear if that is helpful at all though.
We then dived into how well we could solve secure X-evaluation. I was particularly interested in questions like how we could evaluate whether a design had potentially harmful side-effects.
Paul: I think what we want is something like: if the designing process knows that X is bad, then the evaluator will also know it. If the designing process doesn't know that X is bad, then that's not malign.
[to be clear, for this discussion we only need security in the infinite limit; in practice the capability of both sides will be limited by the capability of the ML, so we'll also need something to make sure the evaluating-ML does better than the generator-ML, but that seems like a separate issue.]
William: If you imagine slowly increasing the intelligence of the generator, then for any heuristic, it might first start picking solutions that fulfill that heuristic more often before actually understanding the heuristic, and it might take longer after that before the generator understands that the heuristic works because of a causal pathway that involves negative side effects. Is it the case that you'd say that this is an acceptable outcome/something that we can't really get past?
Paul: If neither the evaluator nor generator knows about the negative side effect, it's hard for the negative side effect to lead to higher evaluations. I agree this can happen sometimes (I wrote the implicit extortion post to give an example, there are certainly others), but they seem OK to accept as "honest mistakes" so far, i.e. none of them pose an existential risk.
in terms of "what amplification is supposed to accomplish," if there is a problem that could just as well afflict a human who is trying their best to help me get what I want (and has the AI's profile of abilities), then I'm basically considering that out of scope.
Whatever we could tell to a human, to help them avoid this kind of problem, we could also tell to an aligned AI, so the problem is factored into (a) help a human avoid the problem, (b) build aligned AI.