Paul Christiano, Wei Dai, Andreas Stuhlmüller and I had an online chat discussion recently, the transcript of the discussion is available here. (Disclaimer that it’s a nonstandard format and we weren't optimizing for ease of understanding the transcript). This discussion was primarily focused on amplification of humans (not later amplification steps in IDA). Below are some highlights from the discussion, and I’ll include some questions that were raised that might merit further discussion in the comments.


Strategies for sampling from a human distribution of solutions:

Paul: For example you can use "Use random human example," or "find an analogy to another example you know and use it to generate an example," or whatever.
There is some subtlety there, where you want to train the model that sample from the real human distribution rather than from the empirical distribution of 10 proposals you happen to have collected so far. If samples are cheap that's fine. Otherwise you may need to go further to "Given that [X1, X2, ...] are successful designs, what is a procedure that can produce additional successful designs?" or something like that. Not sure.

Dealing with unknown concepts

Andreas: Suppose you get a top-level command that contains words that H doesn't understand (or just doesn't look at), say something like "Gyre a farbled bleg.". You have access to some data source that is in principle enough to learn the meanings of those words. What might the first few levels of questions + answers look like?
Paul: possible questions: "What's the meaning of the command", which goes to "What's the meaning of word X" for the words X in the sentence, "What idiomatic constructions are involved in this sentence?", "What grammatical constructions are involved in the sentence"
Answers to those questions are big trees representing meanings, e.g. a list of properties of "gyre" (what properties the subject and object typically have, under what conditions it is said to have occurred, why someone might want you to do it, tons of stuff most of which will be irrelevant for the query)
Which come from looking up definitions, proposing definitions and seeing how well they match with usage in the cases you can look at, etc.

Limits on what amplification can accomplish

Paul:In general, if ML can't learn to do a task, then that's fine with me. And if ML can learn to do a task but only using data source X, then we are going to have to integrate data source X into the amplification process in order for amplification to be able to solve it, there is no way to remove the dependence on arbitrary data sources. And there will exist data sources which pose alignment issues, independent of any alignment issues posed by the ML.

Alignment search for creative solutions

Considering the task of generating a solution to a problem that requires creativity, it can be decomposed into:

Generate solutions

Evaluate those solutions

For solution generation, one idea is to shape the distribution of proposals so you are less likely to get malign answers (ie. sample from the distribution of answers a human would give, which would hopefully be more likely to be safe/easily evaluated compared to some arbitrary distribution).

I asked Paul if he thought that safe creative solution generation would require sampling from a less malign distribution, or whether he thought we could solve evaluation (“secure-X-evaluation”, as testing whether the solution fulfilled property X) well enough to use an arbitrary distribution/brute force search.

Paul: I don't see a good way to avoid solving secure X-evaluation anyway. It seems to me like we can generate solutions in ways that put much lower probability on malign answers, but it neither seems like we can totally eliminate that (I don't think human creativity totally eliminates that either), nor that we will always have access to some more-aligned human generator
The best I'd probably say is that we can have a generation process that is not itself malign, not clear if that is helpful at all though.

We then dived into how well we could solve secure X-evaluation. I was particularly interested in questions like how we could evaluate whether a design had potentially harmful side-effects.

Paul: I think what we want is something like: if the designing process knows that X is bad, then the evaluator will also know it. If the designing process doesn't know that X is bad, then that's not malign.
[to be clear, for this discussion we only need security in the infinite limit; in practice the capability of both sides will be limited by the capability of the ML, so we'll also need something to make sure the evaluating-ML does better than the generator-ML, but that seems like a separate issue.]
William: If you imagine slowly increasing the intelligence of the generator, then for any heuristic, it might first start picking solutions that fulfill that heuristic more often before actually understanding the heuristic, and it might take longer after that before the generator understands that the heuristic works because of a causal pathway that involves negative side effects. Is it the case that you'd say that this is an acceptable outcome/something that we can't really get past?
Paul: If neither the evaluator nor generator knows about the negative side effect, it's hard for the negative side effect to lead to higher evaluations. I agree this can happen sometimes (I wrote the implicit extortion post to give an example, there are certainly others), but they seem OK to accept as "honest mistakes" so far, i.e. none of them pose an existential risk.
in terms of "what amplification is supposed to accomplish," if there is a problem that could just as well afflict a human who is trying their best to help me get what I want (and has the AI's profile of abilities), then I'm basically considering that out of scope.
Whatever we could tell to a human, to help them avoid this kind of problem, we could also tell to an aligned AI, so the problem is factored into (a) help a human avoid the problem, (b) build aligned AI.
New Comment
3 comments, sorted by Click to highlight new comments since:

Open Question: Working with concepts that the human can’t understand

Question: when we need to assemble complex concepts by learning/interacting with the environment, rather than using H's concepts directly, and when those concepts influence reasoning in subtle/abstract ways, how do we retain corrigibility/alignment?

Paul: I don't have any general answer to this, seems like we should probably choose some example cases. I'm probably going to be advocating something like "Search over a bunch of possible concepts and find one that does what you want / has the desired properties."

E.g. for elegant proofs, you want a heuristic that gives successful lines of inquiry higher scores. You can explore a bunch of concepts that do that, evaluate each one according to how well it discriminates good from bad lines of inquiry, and also evaluate other stuff like "What would I infer from learning that a proof is `elegant` other than that it will work" and make sure that you are OK with that.

Andreas: Suppose you don't have the concepts of "proof" and "inquiry", but learned them (or some more sophisticated analogs) using the sort of procedure you outlined below. I guess I'm trying to see in more detail that you can do a good job at "making sure you're OK with reasoning in ways X" in cases where X is far removed from H's concepts. (Unfortunately, it seems to be difficult to make progress on this by discussing particular examples, since examples are necessarily about concepts we know pretty well.)

This may be related to the more general question of what sorts of instructions you'd give H to ensure that if they follow the instructions, the overall process remains corrigible/aligned.

Open Question: Severity of “Honest Mistakes”

In the discussion about creative problem solving,Paul said that he was concerned about problems arising when the solution generator was deliberately searching for a solution with harmful side effects. Other failures could occur where the solution generator finds a solution with harmful side effects without “deliberately searching” for it. The question is how bad these “honest mistakes” would end up being.

Paul: I also want to make the further claim that such failures are much less concerning than what-I'm-calling-alignment failures, which is a possible disagreement we could dig into (I think Wei Dai disagrees or is very unsure).

This is one of my main cruxes. I have 2 main concerns about honest mistakes:

1) Compounding errors: IIUC, Paul thinks we can find a basin of attraction for alignment (or at least corrigibility...) so that an AI can help us correct it online to avoid compounding errors. This seems plausible, but I don't see any strong reasons to believe it will happen or that we'll be able to recognize whether it is or not.

2) The "progeny alignment problem" (PAP): An honest mistake could result in the creation an unaligned progeny. I think we should expect that to happen quickly if we don't have a good reason to believe it won't. You could argue that humans recognize this problem, so an AGI should as well (and if it's aligned, it should handle the situation appropriately), but that begs the question of how we got an aligned AGI in the first place. There are basically 3 subconcerns here (call the AI we're building "R"):

2a) R can make an unaligned progeny before it's "smart enough" to realize it needs to exercise care to avoid doing so.

2b) R gets smart enough to realize that solving PAP (e.g. doing something like MIRI's AF) is necessary in order to develop further capabilities safely, and that ends up being a huge roadblock that makes R uncompetitive with less safe approaches.

2c) If R has gamma < 1, it could knowingly, rationally decide to build a progeny that is useful through R's effective horizon, but will take over and optimize a different objective after that.

2b and 2c are *arguably* "non-problems" (although they're at least worth taking into consideration). 2a seems like a more serious problem that needs to be addressed.