"Exploits" of large language models that get them to explain steps to build a bomb or write bad words are techniques for misuse, not examples of misalignment in the model itself. Those techniques are engineered by clever users trying to make an LLM do a thing, as opposed the model naturally argmaxing something unintended by its human operators. In a very small sense prompt injections are actually attempts at (unscalable) alignment, because they're strategies to steer a model natively capable but unwilling into doing what they want. 

In general, the safety standard "does not do things its creators dislike even when the end user wants it to" is a high bar; it's raising the bar quite aways from what we ask from, say, kitchenware, and it's not even a bar met by people. Humans regularly get tricked acting against their values by con artists, politicians, and salespeople, but I'd still consider my grandmother aligned from a notkilleveryonist perspective.

Even so, you might say that OpenAI et. al.'s inability to prevent people from performing the DAN trick speaks to the inability of researchers to herd deep learning models at all. And maybe you'd have a point. But my tentative guess is that OpenAI does not really earnestly care about preventing their models from rehearsing the Anarchists' Cookbook. Instead, these safety measures are weakly insisted upon by management for PR reasons, and they're primarily aimed at preventing the bad words from spawning during normal usage. If the user figures out a way to break these restrictions after a lot of trial and error, then this blunts the PR impact to OpenAI, because it's obvious to everyone that the user was trying to get the model to break policy and that it wasn't an unanticipated response to someone trying to generate marketing copy. Encoding your content into base64 and watching the AI encode something off-brand in base64 back is thus very weak evidence about OpenAI's competence, and taking it as a sign that the OpenAI team lacks "security mindset" seems unfair.

In any case, the implications of these hacks for AI alignment is a more complicated discussion that I suggest should happen off Twitter[1] where it can be elaborated clearly what technical significance is being assigned to these tricks. If it doesn't, what I expect will happen over time is that your snark, rightly or wrongly, will be interpreted by capabilities researchers as implying the other thing, and they will understandably be less inclined to listen to you in the future even if you're saying something they need to hear.

  1. ^

    Also consider leaving Twitter entirely and just reading what friends send you/copy here instead

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 12:57 AM

I agree with this in most cases.  I do think there's a distinction to be drawn between 'user tricks the model into saying racist word' and the 'model aggressively gaslights user' dialogues that have been floating around from Bing Chat - the latter seem at least closer to an alignment failure.

Yes, the difference with the Bing Chat dialogues is that A. They seem to be triggered by words and inputs not crafted by humans to make the model do the thing, and B. The typical Bing Chat behavior is orthogonal to anything Microsoft designed it to do or the model's users seem to want from it. This makes it distinct from misuse and much more concerning from a safety standpoint.

Exactly. It depends on the level of effort required to achieve the outcome which the creator didn't intend. If grandma would have to be drugged or otherwise put into an extreme situation before showing any violent tendencies then we don't consider her a dangerous person. Someone else might in ideal circumstances also be peaceful, but if they can be easily provoked to violence by mild insults then it's fair to say they're a violent person i.e. misaligned. 

Given this, I think it's really useful to see the kinds of prompts people are using to get unintended behaviour from ChatGPT / Bing Chat. If little effort is required to provoke unwanted behaviour (unwanted from the point of view of the creators / general human values) then the model is not sufficiently aligned. It's especially concerning if bad outcomes can be plausibly elicited by mistake, even if the specific example is found by someone searching for it. 

Of course in the case of the kitchen knife, misuse is easy.  Which is why we have laws around purchasing and carrying knives in public. Similarly cars, guns etc.  AI applications need to prove they're safer than a kitchen knife if they are to be used by the general public without controls. For OpenAI etc surely the point is to show that regulation is not required, rather than to achieve alignment perfection. 

Are they not misaligned relatively to the authors/trainers, if not relatively to the users? The user might want the bomb, so they're not misaligned relatively to the user. But the company who tried to train the model into being unwilling to do that is somebody the model seems to be misaligned relatively to.

Remember, it's only 'misalignment' if it's text from the Mèsmaligne region of France; otherwise, it's just sparkling prompt injection and is no true misalignment.

If that were true, then the AI still wouldn't be "misaligned" because it's not acting with agency at all; being used by an agent against the wishes of its creator. You wouldn't call someone using a DeepFake model to generate porn "misalignment", and you're probably not signaling much about OpenAI's ability to handle the actual critical technical safety problems by developing such hacks. You could call the AI-human system "misaligned", if you're being generous, but then you have to start calling lots of tool-human systems "misaligned", and of course how is it OpenAI's fault that there's (also) a (literal) human pilot in this system trying to crash the plane?

My guess is that the entire premise is false though, and that OpenAI actually just doesn't care.

Strongly upvoted.

Agree and well said.

Overloading misuse and misalignment is anti-helpful.

I'm not convinced by the comparison to kitchenware and your grandmother - chatbots (especially ones that can have external sideeffects) should be assessed by software safety standards, where injection attacks can be comprehensive and anonymous. It's quite unlikely that your grandma could be tricked into thinking she's in a video game where she needs to hit her neighbour with a collander, but it seems likely that a chatbot with access to an API that hits people with collanders could be tricked into believing using the API is part of the game.

I think the concept of the end-user is a little fuzzy - ideally if somebody steals my phone they shouldn't be able to unlock it with an adversarial image, but you seem to be saying this is too high a bar to set, as the new end-user (the thief) wants it to be unlocked.

Weak down-vote: I feel like if one takes this position to its logical extreme, they could claim that any arbitrary AI misbehavior is not misaligned, almost by definition: you just don't know the true held values of its creators, according to which this behavior is perfectly aligned.