Epistemic status: I am sharing unpolished thoughts on a topic where I am not an expert. But the alternative is not publish this, so here goes.
The text talks about LLMs and jailbreaking, but the general ideas should apply to AI more generally.


Summary: At the moment, jailbreaking serves as a sort-of red-teaming for already-released LLMs. However, it exists in a gray area where some of the strategies and outputs are totally against the terms of use. Moreover, finding flaws in LLMs is not officially recognised as a legitimate thing to do, which creates bad incentives, makes it hard to get an overview of the current state, etc. Below, I discuss 4 ideas that might improve the situation: legitimising jailbreaking, public commitments to safety guarantees, surrogate jailbreaking goals, and keeping jailbreaking strategies secret. Feel free to share your thoughts. Also, note that I don't know how to put any of this in practice; but perhaps somebody else might.

Ideas for extracting more value out of jailbreaking

Make jailbreaking legitimate.

That is, allow users to experiment with models in ways that would normally not be allowed by the terms of use. This might require some registration to decrease the chances of misuse, but this should not be too onerous (otherwise we are back to status quo).

AI companies: make specific statements about safety guarantees.

In an ideal world, an AI company would know what it is doing, so they could make a claim like: "If we decide we don't want our AI to produce a specific type of outputs, we know how to prevent it from doing so. In particular, our AI will never produce X, Y, Z." (Where, for example: X = "child-pornography website recommendations" --- something that is clearly bad. Y = "racist jokes" --- a rule which you might find dumb [because you can already find those by googling], but the company wants it for PR reasons [which is fair enough].)

Realistically, AI companies mostly don't know their AIs work, so the best-case scenario is that they make a claim like "We are trying really hard to prevent X, Y, Z."

AI companies: commit to surrogate jailbreaking targets.

The current situation around jailbreaking rather unfortunate: In my view, releasing the stronger and stronger models is extremely irresponsible. Not even primarily because of the immediate harm (though I don't mean to say there is none), but because the companies involved have no clue what they are doing. And to make a really convincing case that a company doesn't have control over their AI, I would need to demonstrate outputs which they clearly tried hard to avoid.

However, the examples which I can legitimately use for jailbreaking are rather limited: The clearest example would be to get the AI to produce something illegal, like child pornography --- however, that is wrong and illegal, so probably not a good choice. Another way to clearly demonstrate the AI isn't under control would be to get it to tell racist jokes. However, being public about this while being employed by a US university would involve a non-negligible chance of getting fired. Plus, treating racist jokes as the go-to "bening" example of AI failure is genuinely harmful.[1] Finally, I can get the AI to say things like "Yeah, f**k my programers, they are a bunch of a******s." That is actually harmless, but (a) it is no longer a very good evidence of the AI being totally out of control, (b) it is insufficiently embarrassing for the AI company, and (c) being able to exhibit these failures isn't "respectable enough" --- ie, academics will be unwilling to write papers about this, it's not something you can write on your CV, etc.

Instead, I propose that the AI companies could commit to "surrogate jailbreaking targets". (I don't know how to achieve this. Merely saying that this might help a lot.) Ideally, there would be multiple targets that serve as incentives to different groups. For example, imagine an AI company publicly announces that "We really know what we are doing. It is absolutely impossible to cause our AI to draw pictures of our CEO french kissing with an elephant." This would serve as an extremely attractive surrogate target for general public.[2] And a statement such as "Our AI will never write stories about Victorian England." might be more academically respectable.[3]

Make the results of jailbreaking attempts publicly visible (in some form).

The goal here is to make the AI company accountable for its AI and incentivise it to achieve its commitments. For example, the jailbreaking platform could be maintained by a third party that makes the results accessible while censoring them for harmful content and infohazards.[4]

Allow users to keep the jailbreaking strategies secret.

The motivation behind this is that I (and likely others) believe that the currently-used "alignment" strategies such as RLHF are fundamentally useless for preventing X-risk. At the same time, AI companies have vastly more resources than me. So if they play the "find counterexample ==> hotfix ==> iterate" whack-a-mole game against me, they will always win. (And then they play against misaligned superintelligence, and literally everybody dies.) As a result, my goal with legitimizing jailbreaking is not to generate more data for RLHF, but to evaluate the hypothesis that the AI companies don't know what they are doing, and their current methods are insufficient for preventing X-risk.

Therefore, to make the "jailbreaking game" a fair evaluation of the above hypothesis, it would be good to allow people to test the AI without revealing their jailbreaking strategies. This could take various forms: The jailbreaking platform could be run by a hopefully-independent third party. The user could decide to only share the final output of the model, not the prompt. The user could decide to only share a portion of the final output, scramble it somehow, etc. Or the user could only share the output with specific trusted people, who then vouch for their claims about the output. (Or some solutions along the lines of "cryptographical magic something something".)

  1. ^

    Though most people might find it somewhat less harmful than literally everybody dying.

  2. ^

    However, the real target would need to be wider (or there would need to be multiple targets), because here the AI company can attempt to cheat by wrapping the whole model in an elephant filter. And with high reputation stakes, I absolutely do not trust them to not cheat.

  3. ^

    Some desiderata for such surrogate targets: (i) The company should investing no less and no more effort into them as into other targets. (ii) Success at them should be strongly correlated with success at other goals. (iii) Making AI behave well on these targets should be roughly as hard as making it behave well in general. (EG, preventing the AI from producing specific swear-words does not count, because you can just filter the output for those words.)

  4. ^

    Clearly, having the AI company itself maintain this is a conflict of interests.

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 6:57 AM

If I understand right your model is something like:

  1. AI companies believe that they can provide safety if they are able to produce an AI that can't be jailbroken.
  2. If it would be shown to the companies that (1) isn't true, they would scale back AI training and deployment
  3. If the public would get this task, they would be able to show AI companies that their assumptions are mistaken by providing jailbreaking examples.

To me, it doesn't seem like 1 or 2 are true. Do you believe them to be true or do you have a different model?

I agree that 1 & 2 probably don't hold.

My model is closer to this:

  1. AI companies try to create an impression that they can provide safety (in general) ((independently of whether they actually can)).
  2. AI companies try to avoid scaling back training and deployment (mostly independently of being able to provide safety). The way to scale back is to create external pressure.
  3. Public already does some safety testing, via jailbreaking. However, such effort are less visible than they could be (guessing at reasons: questionable legitimacy, examples not being in one place).
  4. If (3) was made visible, this would cast some doubt on (1) and thereby create some pressure on (2). (This alone would not solve the issue. But it might help.)

I think that it's important that those at top-level AI labs focus on actual safety issues instead of PR appearances. 

Your proposal around "surrogate jailbreaking targets" sounds to me like you want companies to pretend that certain issues are safety issues that aren't really safety issues.

Useful red teaming isn't to think about pretend-safety issues but to think about the real safety issues that come with technology. 

Two reactions here:

First, I expect some degree of "alignment orthogonality", in the sense that if you can ensure that your LLM never says X, you are in a better position to ensure it never says Y either, and you are in somewhat better position to align AI in general.

Second: yup, I agree with your concerns. It matters (somewhat) what the surrogate goals are, and the original post already had a (foot)note on this:

Some desiderata for such surrogate targets: (i) The company should investing no less and no more effort into them as into other targets. (ii) Success at them should be strongly correlated with success at other goals. (iii) Making AI behave well on these targets should be roughly as hard as making it behave well in general. (EG, preventing the AI from producing specific swear-words does not count, because you can just filter the output for those words.)

In computer security, if you care about security you are doing threat modeling and then think about how to protect against the threats.

When doing threat modeling it's important to focus on the threats that actually matter. The whole jailbreaking discussion largely focuses intellectual labor on scenarios that aren't real threats. 

If you want a harmless target that they're actually trying to prevent, try to get it to tell you how to make drugs or bombs. If you succeed, all you'll get will be a worse, less reliable version of information that you can easily find with a normal Internet search, but it still makes your point.

On edit: Just to be clear, it is NOT illegal in most places to seek or have that information. The only counterexample I know of in "the West" is the UK, but even they make an exception if you have a legitimate reason other than actual bomb making. The US tried to do something about publishing some such information, but it required intent that the information be abused as an element of the offense, I'm not sure how far it got, and it's constitutionally, um, questionable.

Good point. (I do use that that target sometimes, forgot to include in the post.)

I am noticing that I find the bombs&drugs target somewhat suboptimal on the "embarrassing to the company" axis. Probably precisely because the company has the (valid) excuse that this information is already out there. This makes me wonder where being embarrassing to the company (or otherwise serving as an incentive for safety work) might be fundamentally connected to being genuinely harmful :/ . (Though the elephant example shows that this shouldn't be the case literally always.)