PhD student at the University of Toronto, studying machine learning and working on AI safety problems.
I'm talking about an imitation version where the human you're imitating is allowed to do anything they want, including instatiting a search over all possible outputs X and taking that one that maximizes the score of "How good is answer X to Y?" to try to find X*. So I'm more pointing out that this behaviour is available in imitation by default. We could try to rule it out by instructing the human to only do limited searches, but that might be hard to do along with maintaining capabilities of the system, and we need to figure out what "safe limited search" actually looks like.
If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn't the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?
I agree, and think that this problem occurs both in imitation IA and RL IA
For example is the plan to make sure M2 has no such robustness problems (if so how)?
I believe the answer is yes, and I think this is something that would need to be worked out/demonstrated. I think there is one hope that if M2 can increase the amount computing/evaluation power it uses for each new sample X as we take more samples, then you can keep taking more samples without ever accepting an adversarial one (This assumes something like for any adversarial example, all M2 with at least some finite amount of computing power will reject it). There's maybe another hope that you could make M2 robust if you're allowed to reject many plausibly good X in order to avoid false positives. I think both of these hopes are in the IOU status, and maybe Paul has a different way to put this picture that makes more sense.
Overall, I think imitative amplification seems safer, but I maybe don't think the distinction is as clear cut as my impression of this post gives.
if you can instruct them not to do things like instantiate arbitrary Turing machines
I think this and "instruct them not to search over arbitrary text strings for the text string that gives the most approval", and similar things, are the kind of details that would need to be filled out to make the thing you are talking about actually be in a distinct class from approval-based amplification and debate (My post on imitation and RL amplification was intended to argue that without further restrictions, imitation amplification is in the same class as approval-based amplification, which I think we'd agree on). I also think that specifying these restrictions in a way that still lets you build a highly capable system could require significant additional alignment work (as in the Overseer's Manual scenario here)
Conversely, I also think there are ways that you can limit approval-based amplification or debate - you can have automated checks, for example, that discard possible answers that are outside of a certain defined safe class (e.g. debate where each move can only be from either a fixed library of strings that humans produced in advance or single direct quotes from a human-produced text). I'd also hope that you could do something like have a skeptical human judge that quickly discards anything they don't understand + an ML imitation of the human judge that discards anything outside of the training distribution (don't have a detailed model of this, so maybe it would fail in some obvious way)
I think I do believe that for problems where there is a imitative amplification decomposition that solves the problem without doing search, that's more likely to be safe by default than approval-based amplification or debate. So I'd want to use imitative amplification as much as possible, falling back to approval only if needed. On imitative amplification, I'm more worried that there are many problems it can't solve without doing approval-maximizing search, which brings the old problems back in again. (e.g. I'm not sure how to use imitative amplification at the meta-level to produce better decomposition strategies than humans use without using approval-based search)
Substituting parts of the computation (replace a slow, correct algorithm for part of the computation with a fast, approximate one)
One situation is: maybe an HBO tree of size 10^20 runs into a security failure with high probability, but an HBO tree of size 10^15 doesn't and is sufficient to output a good LBO overseer.
Yeah, that's my current picture of it.
I understand that, but I don't see why #2 is likely to be achievable. Corrigibility seems very similar to Wei Dai's translation example, so it seems like there could be many deceptive actions that humans would intuitively recognize as not corrigible, but which would fool an early-stage LBO tree into assigning a high reward. This seems like it would be a clear example of "giving a behaviour a high reward because it is bad". Unfortunately I can't think of any good examples, so my intuition may simply be mistaken.
For factored cognition: I think the reason #2 might be achievable relies on assuming that there is some reason that the bad features is selected in the first place and is assigned high reward. For example, this might have happened because the agent ran a simulation forward, and then realized that if they punch the human they can take the reward button from them. The hope is that we can figure out that the simulation process happened and why it lead to the wrong thing (or outlaw simluations of this form in the first place).
For factored evaluation, I think the story is a bit different (relying on the other expert being able to understand the reasons for the reward assignment and point it out to the judge, but I don't think the judge needs to be able to find it on there own).
Incidentally, it seems like Ought could feasibly test whether meta-execution is sufficient to ensure corrigibility; for example, a malicious expert could recommend deceptive/influence-seizing actions to an agent in a simulated environment, and the meta-execution tree would have to detect every deceptive action without any contextual knowledge . Are there any plans to do this?
No plans currently, but it would be interesting.
Suppose an action is evaluated as a linear combination of a set of human interpretable features.
The action "punch the human" could be selected because 1) many of the reward weights of these features could be wrong, or it could be selected because 2) there is one feature "this action prevents the human from turning me off" that is assigned high reward. I think the thing we'd want to prevent in this case is 2) but not 1), and I think that's more likely to be achievable.
I think it's a general method that is most applicable in LBO, but might still be used in HBO (eg. an HBO overseer can read one chapter of a math textbook, but this doesn't let it construct an ontology that let's it solve complicated math problems, so instead it needs to use meta-execution to try to manipulate objects that it can't reason about directly.