Question regarding an alignment problem: one of the key difficulties in alignment is (said by Eliezer Yudkowsky to be) that if "the verifier is broken" (i.e. the human verifier measuring alignment can be fooled by the alien actress) then we cannot be sure that a given alignment evaluation is true. Has there been any serious discussion of using a daisy chain of increasingly intelligent systems to evaluate alignment?

Hand-wavily: let human intelligence be ~= H, can we find some epsilon e such that we construct a series of n increasingly intelligent systems of intelligence I(n) = H + n*e and we only ask for one-hop-forward verification in this system. That is to say, system n verifies system n+1, and the human (whose intelligence matches system 0) verifies system 1.

Are there reasons to think that such an epsilon may or may not exist, and whether it can be practically found? 

A counter-argument might be that all we can control via some epsilon is horsepower, and intelligence (thought of here as an output of horsepower rather than something we can directly set) is nearly discontinuous in horsepower, meaning there will be some n where the jump in intelligence I(n+1)/I(n) will be too high, and will break verification. Another argument against could be that epsilon is sufficiently small, and therefore n sufficiently high, such that running n systems simultaneously and attempting to daisy chain them would be impossible resource-wise, so will never actually get done.

Still, curious if there's a good discussion of this somewhere.

New to LessWrong?

New Answer
New Comment

1 Answers sorted by

Decaeneus

Feb 21, 2024

10

Upon reflection, the only way this would work is if verification were easier than deception, so to speak. It's not obvious that this is the case. Among humans, for instance, it seems very difficult for a more intelligent person to tell, in the general case, whether a less intelligent person is lying or telling the truth (unless the verifier is equipped with more resources and can collect evidence and so on, which is very difficult to do about some topics such as the verified's internal state) so, in the case of humans, in general, deception seems easier than verification.

So perhapst the daisy-chain only travels down the intelligence scale, not up.