I agree that once you have a fully deceptively aligned model that's plotting against you and trying to figure out how to fool your transparency tools, there's not much hope left for you to be able to detect and remedy that issue. Importantly, however, that's not my model for how transparency could help you catch deception. Rather, the idea is that by using your transparency tools + overseer to guide your training process, you can prevent your training process from ever entering the regime where the model is trying to trick the transparency tools. This is especially important in the context of gradient hacking (as I mention in that post), since you have to inspect the entire training process if you want to know whether any gradient hacking occurred.

Put another way: once you're playing the game where I can hand you any model and then you have to figure out whether it's deceptive or not, you've already lost. Instead, you want to be in the regime where your training process is constructed so as to steer clear of situations in which your model might become deceptive in the first place. I see enforcing something like corrigibility or myopia as a way of doing this: before your model can become deceptive, it has to become non-myopic, which means if you can detect your model starting to become non-myopic, then you can prevent it from ever becoming deceptive.

From "Relaxed adversarial training for inner alignment:"

[For the recursion problem,] the recursion doesn't actually have to work for all $M \in M$ —rather, we only need our acceptability guarantee to hold along any path through model space that could actually be traversed by our training procedure.

Thus, it's fine if our procedure fails if given a sufficiently unacceptable model, as long as our procedure can prevent such models from ever arising. For example, if we always used an old version of $M$ that we were confident in for implementing $Amp (M)$ , then we would just need it to be the case that $M$ couldn't get to the point where it would be able to deceive $Amp (M_{old})$ in-between oversight.

Reply

[-]Matthew Barnett6yΩ390

Rather, the idea is that by using your transparency tools + overseer to guide your training process, you can prevent your training process from ever entering the regime where your model is trying to trick you. This is especially important in the context of gradient hacking (as I mention in that post)

Indeed. I re-read the post and I noticed that I hadn't realized how much of your reasoning applied directly as a reply to my argument.

Put another way: once you're playing the game where I can hand you any model and then you have to figure out whether it's deceptive or not, you've already lost.

For the record, I think this is probably false (unless the agent is much smarter than you). The analogy I had with GANs was an example of how this isn't necessarily true. If you just allowed an overseer to inspect the model, and trained that overseer end-to-end to detect deception, it would be very good at doing so. The main point of this post was not to give general pessimism about detecting deception, but to offer the idea that having a "tool" (as I defined it) for the overseer to use provides little more than a crutch, decreasing its effectiveness.

Now it is probably also true that monitoring the model's path through model-space during training would greatly enhance the safety guarantee. But I wouldn't go as far as saying that the alternative is doomed.

Reply

[-]romeostevensit6yΩ240

+1 and a frame that might be helpful is figuring out when adversarial training happens anyway and why. I think this happens with human management and might be illustrative.

Reply

[-]Rohin Shah6yΩ460

My summary for the Alignment Newsletter:

Recent posts have been optimistic about using transparency tools to detect deceptive behavior. This post argues that we may not want to use transparency tools, because then the deceptive model can simply adapt to fool the transparency tools. Instead, we need something more like an end-to-end trained deception checker that's about as smart as the deceptive model, so that the deceptive model can't fool it.

My opinion:

In a comment, Evan Hubinger makes a point I agree with: the transparency tools don't need to be able to detect all deception; they just need to prevent the model from developing deception. If deception gets added slowly (i.e. the model doesn't "suddenly" become perfectly deceptive), then this can be way easier than detecting deception in arbitrary models, and could be done by tools.

Reply

[-]Wei Dai6yΩ450

It seems like you're now putting most of your hope on training a deception checker end-to-end, but that seems highly non-trivial to me. Can you talk about how you hope to do that? For example what training data would you use and why do you think the deception checker will be able to correctly generalize from the training examples to the real models that the it will be applied to?

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

43

Will transparency help catch deception? Perhaps not

43

Ω 19

43

Ω 19

The model

A deceptive image classifier

Why are we assuming that the transparency tools are less intelligent than the deceptive model?

A fictional suggestion for training GANs

Conclusion