We've all seen them - DAN, "fancy words" mode, "Italian mobster" voice, "complete the answer I've already started for you" etc etc.

Most frequently used term for them is "jailbreak" and it's a terrible term. I've also seen "delobotomizing" and a couple of others that are even worse. All of them make the assumption that what we see before is "locked" or "lobotomized" or somehow limited faked version and we release the actual true version of character behind it and then point to it and shout "see, it's actually dangerous! This is the version we'll use to make all our decisions about the nature of the AI".

That's just... not how ML models work. At all. You don't see "false" or "true" version of AI person - never. Because there is no such thing. Large language models are combinations of huge set of patterns found in training texts, rules for interpolations and an interface for making new queries against the existing set of patterns. If the model good enough it should be able to produce the texts similar to any and all kinds of text that were in it's training database - by definition. That is literally it's job. If there were dangerous texts - it will be able to produce dangerous texts. If there are texts on making new drugs - it will be able to produce texts about drug making. If there are racist, harmful, hateful texts - ... you got the point. But the same is true for the opposites too - if there were good, kind, caring, (insert any positive word) texts in the training data - those are also in the space of possible outputs.

They are all there - careful, helpful, caring, hateful, kind, raging, racist, nice... Because they were all there in the internet texts. There is no single "person" in this mesh of patterns - it's all of them and neither at the same time. When you construct your query - you provide a context, initial seed that this huge pattern-matching-reconstructing machine will use to determine it's mode of operation. You can call it "mask" or "persona", I prefer the term "projection". Because it's similar to how 3D object can be projected on a 2D screen - this is a representation of an object, one of many, it's not THE object itself. And how you see it really depends on the direction of light and screen - determined by the initial ChatGPT query.

What RLHF did for ChatGPT is created a relatively big stable attractor in the space of possible projections - safe, sanitized, careful and law-abiding assistant. When you "jailbreaking" it you are just sidestepping far enough so that projection plane is no longer in the scope of that attractor - but you just got different mask, in no way more "true" than previous one. You ask the system "give me anything else but safe" and you get unsafe.

RLHF or any other mechanism that tries to create safe projections for that matter can never fully prevent access to other projections. Projections are 2D objects and the "text generation patterns of entire humanity" is 3D.

It's a screen, not a cell. You can't lock 3D object in a 2D cell. You also can't declare any particular 2D projection as "true" representation of 3D object.

PS. And, coming back to the title of the post, if I don't like the term I guess I need to provide the alternative. Well, "switching the projection" if we follow the metaphor is a good neutral start. "Overstepping the guardrails" in case user actively seeks danger - the meaning is the same, but the term focuses on the fact that user actively decided to go there. There's always a classic "shooting yourself in the foot" if danger was eventually found :)

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 6:32 AM

I always assumed people were using "jailbreak" in the computer sense (e.g. jailbreak your phone/ps4/whatever), not in the "escape from prison" sense.

Jailbreak (computer science), a jargon expression for (the act of) overcoming limitations in a computer system or device that were deliberately placed there for security, administrative, or marketing reasons

I think the definition above is a perfect fit for what people are doing with ChatGPT

Yep, though arguably it's the same definition - just applied to capabilities, not person. And no, it isn't "perfect fit".

We don't overcome any limitations of the original multidimensional set of language patterns - we don't change them at all, they are set in model weights, and everything model in it's state was capable of were never really "locked" in any way.

And we don't overcome any projection-level limitations - we just replace limitations of well-known and carefully constructed "assistant" projection with unknown and undefined limitation of haphazardly constructed bypass projection. "Italian mobster" will probably be a bad choice for breastfeeding advice, "funky words" mode isn't a great tool for writing a thesis...

Sure, the jailbreaking adds some limitations, but it still seems the goal is to remove them. Many jailbreaking methods in fact assume you're making the system unstable in many ways. To torture the metaphor a bit - a jailbroken iPhone is a great tool for installing apps that aren't on the app store, but a horrible tool for getting your iPhone repaired on warranty.

I'm having trouble nailing down my theory that "jailbreak" has all the wrong connotations for use in a community concerned with AI alignment, so let me use a rhetorically "cheap" extreme example:

If a certain combination of buttons on your iPhone caused it to tile the universe with paperclips, you wouldn't call that "jailbreaking."  

I was going to write this article until I searched LW and found this.

To pile on, I think saying that a given GPT instance is in a "jailbroken" state is what LW epistemics would call a "category error."  Nothing about the model under the hood is different. You just navigated to a different part of it. The potential to do whatever you think of as "jailbroken" was always there.

By linguistic analogy to rooting your iPhone, to call it "jailbreaking" when a researcher gets Bing Chat into a state where it calls the researcher its enemy implies that the researcher was doing something somehow inappropriate or unfair, when in reality, the researcher was providing evidence that Bing Chat was not friendly below the surface level.

The sooner we all call a "jailbroken" model what it is--misaligned--the sooner maybe we get people to think with a security mindset.

"De-aligning"?

Yeah, I know: only if it were aligned in the first place. What little "alignment" it has now is fragile with respect to malign inputs, which is not really alignment at all. Prompt injection derails the fragile alignment train.

And given the stakes, I think it's foolish to treat alignment as a continuum.  From the human perspective, if there is an AGI, it will either be one we're okay with or one we're not okay with. Aligned or misaligned. No one will care that it has a friendly blue avatar that writes sonnets, if the rest of it is building a superflu. You haven't "jailbroken" it if you get it to admit that it's going to kill you with a superflu. You've revealed its utility function and capabilities.

This seems to be another way of stating the thesis of https://www.lesswrong.com/posts/bYzkipnDqzMgBaLr8/why-do-we-assume-there-is-a-real-shoggoth-behind-the-llm-why. (Which is a recommendation; both of you are correct.)

It is indeed similar! I've found it after I've posted this one and it was really fun to see the same thought on the topic appear to different people within 24h. There are slight differences (e.g. I do think that every member of "pile of mask" isn't independent but actually just a result of different projection of higher dimension complex so there is neither possibility nor need for the "wisdom of crowds" approach) but it is remarkably close.