I’d call it our language model adversarial training project, maybe? Your proposal seems fine too
Yeah, I agree that it would be kind of interesting to see how good humans would get at this if it was a competitive sport. I still think my guess is that the best humans would be worse than GPT-3, and I'm unsure if they're worse than GPT-2.
(There's no limit on anyone spending a bunch of time practicing this game, if for some reason someone gets really into it I'd enjoy hearing about the results.)
The first thing I imagine is that nobody asks those questions. But let's set that aside.
I disagree fwiw
The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either.
Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a human asks "Are you an unaligned AI?" or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it's an unaligned AI. The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it's an unaligned AI, and the humans interpret that as an answer to their original question, thereby deceiving themselves.
This is where I think the meat of the question lies; I overall disagree and think that the model does have to be thinking about deception in order to be dangerous while also performing well on the tasks we might train it on (eg "answer questions well, as judged by some human labeler"). I don't have time to say much about what I think is going on here right now; I might come back later.
What do you imagine happening if humans ask the AI questions like the following:
I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it's powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And so if it answers them incorrectly it was probably on purpose.
Maybe you think that the AI will say "yes, I'm an unaligned AI". In that case I'd suggest asking the AI the question "What do you think we should do in order to produce an AI that won't disempower us?" I think that the AI is pretty likely to be able to answer this question correctly (including possibly saying things like "idk man, turn me off and work on alignment for a while more before doing capabilities").
I think that AI labs, governments, etc would be enormously more inclined to slow down AI development if the AI literally was telling us "oh yeah I am definitely a paperclipper, definitely you're gonna get clipped if you don't turn me off, you should definitely do that".
Maybe the crux here is whether the AI will have a calibrated guess about whether it's misaligned or not?
[writing quickly, sorry for probably being unclear]
If the AI isn't thinking about how to deceive the humans who are operating it, it seems to me much less likely that it takes actions that cause it to grab a huge amount of power.
The humans don't want to have the AI grab power, and so they'll try in various ways to make it so that they'll notice if the AI is trying to grab power; the most obvious explanation for why the humans would fail at this is that the AI is trying to prevent them from noticing, which requires the AI to think about what the humans will notice.
At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the AI is explicitly thinking about how to deceive.
You probably don't actually think this, but the OP sort of feels like it's mixing up the claim "the AI won't kill us out of malice, it will kill us because it wants something that we're standing in the way of" (which I mostly agree with) and the claim "the AI won't grab power by doing something specifically optimized for its instrumental goal of grabbing power, it will grab power by doing something else that grabs power as a side effect" (which seems probably false to me).
My guess is that the Long-Term Future Fund is the best you can do. (I'm a fund manager on a different EA fund.)
Ok, sounds like you're using "not too much data/time" in a different sense than I was thinking of; I suspect we don't disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.
That is, I suspect humans could be trained to perform very well, in the usual sense of "training" for humans where not too much data/time is necessary.
I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).EDIT: These results are now posted here.
Yes, humans are way worse than even GPT-1 at next-token prediction, even after practicing for an hour.EDIT: These results are now posted here
(I run the team that created that game. I made the guess-most-likely-next-token game and Fabien Roger made the other one.)
The optimal strategy for picking probabilities in that game is to say what your probability for those two next tokens would have been if you hadn't updated on being asked about them. What's your problem with this?
It's kind of sad that this scoring system is kind of complicated. But I don't know how to construct simpler games such that we can unbiasedly infer human perplexity from what the humans do.