This post is a follow-up to my “Alignment proposals and complexity classes” post. Thanks to Sam Eisenstat for helping with part of the proof here.
Previously, I proved that imitative amplification with weak HCH, approval-based amplification, and recursive reward modeling access while AI safety via market making accesses . At the time, I wasn't sure whether my market making proof would generalize to the others, so I just published it with the proofs instead. However, I have since become convinced that the proof does generalize—and that it generalizes for all of the proposals I mentioned—such that imitative amplification with weak HCH, approval-based amplification, and recursive reward modeling all actually access . This post attempts to prove that.
Updated list of proposals by complexity class
: Imitation learning (trivial)
Imitative amplification with weak HCH accesses
The proof here is similar in structure to my previous proof that weak HCH accesses , so I'll only explain where this proof differs from that one.
First, since , we know that for any , halts in steps where . Thus, we can construct a function such that for all , halts in less than or equal to steps by picking large enough that they dominate all other terms in the polynomial for all . Note that is then computable in time polynomial in .
Second, let 's new strategy be as follows:
- Given , let . Then, return accept/reject based on whether is an accept or reject state (it will always be one or the other)
- Given , return where is the starting state and is the empty tape symbol.
- Given , let Then, return where is the next state of from state over symbol (where if we're already in an accept/reject state we just stay there) and is the new tape symbol at the head (which can be determined given ).
- Given , return .
- Given , let . Then, let be the amount by which on changes the head position by (such that ). If , let otherwise let . Return .
Note that this strategy precisely replicates the strategy used in my proof that market making accesses for inputs and . Thus, I'll just defer to that proof for why the strategy works and is polynomial time on those inputs. Note that this is where becomes essential, as it guarantees that and can be represented in polynomial space.
Then, the only difference between this strategy and the market making one is for the base input. On , given that works, will always return a state after has halted, which will always be an accept state if accepts and a reject state if it rejects. Furthermore, since and is computable in polynomial time, the strategy for is polynomial, as desired.
Since this procedure works for all , we get that amplification with weak HCH accesses , as desired.
Approval-based amplification accesses
The proof here is almost precisely the same as my previous proof that approval-based amplification accesses with the only modification being that we need to verify that are still polynomial in size for the new imitative amplification strategy given above. However, the fact that and are bounded above by means that they are representable in polynomial space, as desired.
Recursive reward modeling accesses
Again, the proof here is almost precisely the same as my previous proof that recursive reward modeling accesses with the only modification being the same as with the above proof that approval-based amplification accesses .