There are several reasons that make this not a brilliant research opportunity. Firstly, what is and is not a honeypot is sensitively dependant on the AI's capabilities and situation. There is no such thing as a one size fits all honeypot.

Say you plugged a device that pretended to provide internet access, but was really supposed to monitor for access attempts. You don't know what the AI's escape plan and capabilities are. Will it brainwash the human operators? Abuse its chip to hack physics? Not all escape plans use the honeypot fake hole instead of all the real holes. (And if the fake security hole is huge, blatant, and has no reason to be there, the AI will be suspicious.)

Secondly, in order for it to have a chance of working, we have to hide the existence of the honeypot, and even then, the AI will suspect that making a honeypot is the sort of thing we might do.

Making a honeypot is another way of challenging the AI to a wit measuring contest, generally a bad idea.

Reply

[-]David Scott Krueger (formerly: capybaralet)5y30

I strongly disagree.
I think this is emblematic of the classic AI safety perspective/attitude, which has impeded and discouraged practical progress towards reducing AI x-risk by supporting an unnecessary and misleading emphasis on "ultimate solutions" that address the "arbitrarily intelligent agent trapped in a computer" threat model.
This is an important threat model, but it is just one of many.

My question is inspired by the situation where a scaled up GPT-3-like model is fine-tuned using RL and/or reward modelling. In this case, it seems like we can honeypot the model during the initial training and have a good chance of catching it attempting a premature treacherous turn. Whether or not the model would attempt a premature treacherous turn seems to depend on several factors.
A hand-wavy argument for this strategy working is: an AI should conceive of the treacherous turn strategy before the honeypot counter-strategy because a counter-strategy presupposes the strategy it counters.

There are several reasons that make this not a brilliant research opportunity. Firstly, what is and is not a honeypot is sensitively dependant on the AI's capabilities and situation. There is no such thing as a one size fits all honeypot.

I am more sympathetic to this argument, but it doesn't prevent us from doing research that is limited to specific situations. It also proves to much, since combining this line of reasoning with no free lunch arguments would seem to invalidate all of machine learning.

Reply

[-]Robert Miles5y40

A related keyword to search for is 'tripwires', which might be thought of as honeypots which are connected to an automatic shutdown

Reply

Moderation Log

Curated and popular this week

LESSWRONG
LW

LESSWRONG
LW

17

[ Question ]

Any work on honeypots (to detect treacherous turn attempts)?

17

17

1 Answers sorted by
top scoring

Nov 12, 2020

17

[ Question ]

Any work on honeypots (to detect treacherous turn attempts)?

17

17

1 Answers sorted by top scoring

Nov 12, 2020

1 Answers sorted by
top scoring