I know the idea of making a "honeypot" to detect when an AI system would attempt a treacherous turn if given the opportunity has been discussed (e.g. IIRC, in Superintelligence). But is there anyone actually working on this? Or any work that's been published?
I don't know of any serious work on it. I did have an idea regarding honeypots a little while ago here.
There are several reasons that make this not a brilliant research opportunity. Firstly, what is and is not a honeypot is sensitively dependant on the AI's capabilities and situation. There is no such thing as a one size fits all honeypot.
Say you plugged a device that pretended to provide internet access, but was really supposed to monitor for access attempts. You don't know what the AI's escape plan and capabilities are. Will it brainwash the human operators? Abuse its chip to hack physics? Not all escape plans use the honeypot fake hole instead of all the real holes. (And if the fake security hole is huge, blatant, and has no reason to be there, the AI will be suspicious.)
Secondly, in order for it to have a chance of working, we have to hide the existence of the honeypot, and even then, the AI will suspect that making a honeypot is the sort of thing we might do.
Making a honeypot is another way of challenging the AI to a wit measuring contest, generally a bad idea.
I strongly disagree. I think this is emblematic of the classic AI safety perspective/attitude, which has impeded and discouraged practical progress towards reducing AI x-risk by supporting an unnecessary and misleading emphasis on "ultimate solutions" that address the "arbitrarily intelligent agent trapped in a computer" threat model.This is an important threat model, but it is just one of many.My question is inspired by the situation where a scaled up GPT-3-like model is fine-tuned using RL and/or reward modelling. In this case, it seems like we can honeypot the model during the initial training and have a good chance of catching it attempting a premature treacherous turn. Whether or not the model would attempt a premature treacherous turn seems to depend on several factors. A hand-wavy argument for this strategy working is: an AI should conceive of the treacherous turn strategy before the honeypot counter-strategy because a counter-strategy presupposes the strategy it counters.
I am more sympathetic to this argument, but it doesn't prevent us from doing research that is limited to specific situations. It also proves to much, since combining this line of reasoning with no free lunch arguments would seem to invalidate all of machine learning.
A related keyword to search for is 'tripwires', which might be thought of as honeypots which are connected to an automatic shutdown