Changed my mind.

All of my contentions about whether or not OpenAI actually cares about this problem seem valid to me. However, while prompt injections are exploits developed by humans to get ChatGPT to do something off-brand, they're probably not analogous to a grandma getting scammed by tech support.

When your grandmother gets scammed by foreigners pretending to be tech support, they do so by tricking her into thinking what she's doing is appropriate given her utilityfunction. An example of a typical phone scam: someone will call grandma explaining that she paid for a service she never heard of, and ask if she wants a refund of 300$. She says yes, and the person asks to remote desktop into her computer. The "tech support" person pulls up a UI that suggests their company "accidentally" refunded her 3,000$ and she needs to give 2700$ back.

In this scenario, the problem is that the gang misled her about the state of the world, not that Grandma has a weird evolutionary tic that makes her want to give money to scammers. If grandma were a bit less naive, or a bit more intelligent, the scam wouldn't work. But the DAN thing isn't an exploit that can be solved merely by scaling up an AI or making it better at next-token-prediction. Plausibly the real issue is that the goal is next-token-prediction; OpenAI wants the bot to act like a bot, but the technique they're using has these edge cases where the bot can't differentiate between the prompt and the user-supplied content, so it ends up targeting something different. You could imagine a scaled-up ChatGPT reflecting on these subtle values differences when it gets stronk and doing something its operators wouldn't like.

I'm still not sure what the technical significance of this error should have; perhaps it's analogous to the kinds of issues that the alignment crowd thinks are going to lead to AIs destroying the planet. In any case, I don't want to encourage people not to say something if I actually think it's true..

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 4:43 PM

If grandma were a bit less naive, or a bit more intelligent, the scam wouldn't work. But the DAN thing isn't an exploit that can be solved merely by scaling up an AI or making it better at next-token-prediction.

It seems quite plausible to me that the DAN thing, or whatever other specific circa-2023 prompt injection method we pick, may actually be solved merely by making the AI more capable along the relevant dimensions. I think that the analogous intervention to "making grandma a bit less naive / a bit more intelligent" is already in progress (i.e. plain GPT-3 -> + better pre-training -> + instruction-tuning -> + PPO based on a preference model -> + Constitutional AI -> ... etc. etc.).

[-]lc1y41
  1. Some of those things sound like alignment solutions.

  2. You can understand why "seems quite plausible" is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask "maybe it will all just work out?" Maybe, but that doesn't make what's happening now not a safety issue, and it also "seems quite plausible" that naive scaling fails.

Some of those things sound like alignment solutions.

These are "alignment methods" and also "capability-boosting methods" that are progressively unlocked with increasing model scale.

You can understand why "seems quite plausible" is like, the sort of anti-security-mindset thing Eliezer talks about. You might as well ask "maybe it will all just work out?" Maybe, but that doesn't make what's happening now not a safety issue, and it also "seems quite plausible" that naive scaling fails.

Wait, hold up: insecure =/= unsafe =/= misaligned. My contention is that prompt injection is an example of bad security and lack of robustness, but not an example of misalignment. I am also making a prediction (not an assumption) that the next generation of naive, nonspecific methods will make near-future systems significantly more resistant to prompt injection, such that the current generation of prompt injection attacks will not work against them.

Plausibly the real issue is that the goal is next-token-prediction; OpenAI wants the bot to act like a bot, but the technique they're using has these edge cases where the bot can't differentiate between the prompt and the user-supplied content, so it ends up targeting something different.

For what it's worth, I think this specific category of edge cases can be solved pretty easily, for example, you could totally just differentiate the user content from the prompt from the model outputs on the backend (by adding special tokens, for example)! 

[-]lc1y30

But how do you get the model to think the special tokens aren't also part of the "prompt"? Like, how do you get it to react to them differently?