LESSWRONG
LW

828
schroederdewitt
98Ω19130
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Secret Collusion: Will We Know When to Unplug AI?
schroederdewitt1y51

Thanks Nathan these are fascinating questions!

"What if there was an offender who wasn't sure there was anyone 'out there' to collaborate with, but they felt pretty confident they could sneak messages past the monitor. They might just go 'fishing' with a constant stream of hidden messages, most of low importance or nonsense. Then there'd be no pattern of innocent - steganography - innocent -steganography...  the steganography would be a constant underlying everything."

This is an interesting thought. Certainly, if steganography could be seen as an innocuous activity by itself, then its detection alone would not be very helpful and, instead, one would actually have to be able to decode the hidden messages to decide whether it is desirable behaviour or not.

"Another thought: What if the offender hoped that some time in the future there might be a lapse in security such that it could at that point transfer a one-time pad to a collaborator. In the meantime, they might encode and send many secret messages , trusting that the collaborator would record all messages and then reanalyze them once the one-time pad was received. That seems like it would also be hard for the monitor to catch."

Yes, such a post-hoc approach is certainly possible and I agree that it does contribute to the safety/security risks.

Reply1
Secret Collusion: Will We Know When to Unplug AI?
schroederdewitt1y20

That's a great question. I am not quite sure but WMDP-cyber does look relevant. If you are interested in working on a new benchmark for unlearning and secret collusion, do reach out to us!

Reply1
Secret Collusion: Will We Know When to Unplug AI?
schroederdewitt1y20

Interesting! Hope our work will catch his attention.

Reply
No wikitag contributions to display.
65Secret Collusion: Will We Know When to Unplug AI?
Ω
1y
Ω
8
43Take SCIFs, it’s dangerous to go alone
Ω
1y
Ω
1