Stop posting prompt injections on Twitter and calling it "misalignment"

[-]aphyer3y2925

I agree with this in most cases. I do think there's a distinction to be drawn between 'user tricks the model into saying racist word' and the 'model aggressively gaslights user' dialogues that have been floating around from Bing Chat - the latter seem at least closer to an alignment failure.

[-]lc3y71

Yes, the difference with the Bing Chat dialogues is that A. They seem to be triggered by words and inputs not crafted by humans to make the model do the thing, and B. The typical Bing Chat behavior is orthogonal to anything Microsoft designed it to do or the model's users seem to want from it. This makes it distinct from misuse and much more concerning from a safety standpoint.

[-]jclymo3y10

Exactly. It depends on the level of effort required to achieve the outcome which the creator didn't intend. If grandma would have to be drugged or otherwise put into an extreme situation before showing any violent tendencies then we don't consider her a dangerous person. Someone else might in ideal circumstances also be peaceful, but if they can be easily provoked to violence by mild insults then it's fair to say they're a violent person i.e. misaligned.

Given this, I think it's really useful to see the kinds of prompts people are using to get unintended behaviour from ChatGPT / Bing Chat. If little effort is required to provoke unwanted behaviour (unwanted from the point of view of the creators / general human values) then the model is not sufficiently aligned. It's especially concerning if bad outcomes can be plausibly elicited by mistake, even if the specific example is found by someone searching for it.

Of course in the case of the kitchen knife, misuse is easy. Which is why we have laws around purchasing and carrying knives in public. Similarly cars, guns etc. AI applications need to prove they're safer than a kitchen knife if they are to be used by the general public without controls. For OpenAI etc surely the point is to show that regulation is not required, rather than to achieve alignment perfection.

[-]green_leaf3y2413

Are they not misaligned relatively to the authors/trainers, if not relatively to the users? The user might want the bomb, so they're not misaligned relatively to the user. But the company who tried to train the model into being unwilling to do that is somebody the model seems to be misaligned relatively to.

[-]gwern3y*30-1

Remember, it's only 'misalignment' if it's text from the Mèsmaligne region of France; otherwise, it's just sparkling prompt injection and is no true misalignment.

[-]lc3y*1510

If that were true, then the AI still wouldn't be "misaligned" because it's not acting with agency at all; being used by an agent against the wishes of its creator. You wouldn't call someone using a DeepFake model to generate porn "misalignment", and you're probably not signaling much about OpenAI's ability to handle the actual critical technical safety problems by developing such hacks. You could call the AI-human system "misaligned", if you're being generous, but then you have to start calling lots of tool-human systems "misaligned", and of course how is it OpenAI's fault that there's (also) a (literal) human pilot in this system trying to crash the plane?

My guess is that the entire premise is false though, and that OpenAI actually just doesn't care.

[-]DragonGod3y1612

Strongly upvoted.

Agree and well said.

Overloading misuse and misalignment is anti-helpful.

[-]Patrick Leask3y20

I'm not convinced by the comparison to kitchenware and your grandmother - chatbots (especially ones that can have external sideeffects) should be assessed by software safety standards, where injection attacks can be comprehensive and anonymous. It's quite unlikely that your grandma could be tricked into thinking she's in a video game where she needs to hit her neighbour with a collander, but it seems likely that a chatbot with access to an API that hits people with collanders could be tricked into believing using the API is part of the game.

I think the concept of the end-user is a little fuzzy - ideally if somebody steals my phone they shouldn't be able to unlock it with an adversarial image, but you seem to be saying this is too high a bar to set, as the new end-user (the thief) wants it to be unlocked.

[-]MondSemmel3y10

Weak down-vote: I feel like if one takes this position to its logical extreme, they could claim that any arbitrary AI misbehavior is not misaligned, almost by definition: you just don't know the true held values of its creators, according to which this behavior is perfectly aligned.

^{^}

Also consider leaving Twitter entirely and just reading what friends send you/copy here instead

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

147

Stop posting prompt injections on Twitter and calling it "misalignment"

147

147