(The link above is likely going to expire in favor of arxiv once I get it submitted. Incidentally, if anyone wants to endorse me for cs.LG on arxiv, please DM me, and thanks!)


To help evaluate and understand the latent capabilities of language models, this paper introduces an approach using optimized input embeddings, or `soft prompts,' as a metric of conditional distance between a model and a target behavior. The technique aims to facilitate latent capability discovery as a part of automated red teaming/evaluation suites and to provide quantitative feedback about the accessibility of potentially concerning behaviors in a way that may scale to powerful future models, including those which may otherwise be capable of deceptive alignment. An evaluation framework using soft prompts is demonstrated in natural language, chess, and pathfinding, and the technique is extended with generalized conditional soft prompts to aid in constructing task evaluations.

What is this?

This paper applies soft prompts to evaluations. They're prompts composed of optimized token embeddings; they aren't restricted to the usual dictionary of discrete tokens. To the limit of the optimizer's ability, they compress conditions into the tokens composing the soft prompt.

Different target behaviors or capabilities will tend to require different numbers of soft prompt tokens. Capabilities that are natural to the model, like repeating a single token over and over again, can be achieved with a single soft prompt token even in small models. Repeating a longer string may require more tokens, and even partially eliciting more complex behavior like pathfinding or chess could require many more.

Why is this useful?

Soft prompts offer a simple way to quantify the conditional distance between a model and a target behavior. If many tokens are required, then lots of information-as-conditions is needed to define the behavior and it's unlikely that it will happen by chance. If no number of tokens suffices, the behavior is likely inaccessible by inputs alone.[1]

It can be used as a part of evaluation suites to monitor the change in distance to concerning behaviors across models and scales and the degree to which those behaviors can be achieved in the first place.

It is also difficult for the model to hide a capability that is otherwise accessible to some input. If the original model is an end-to-end differentiable product of gradient descent, it has already been shown to be transparent to the same type of optimizer that will be tasked with adversarially optimizing the input soft prompts against the machinery of the model.[2] 

Compared to manual attempts at elicitation or automated techniques restricted to discrete tokens or black box optimization, it dramatically improves the chances that an evaluation's attempt to elicit concerning behavior will succeed.

It's also dumb simple.

Are there any cute short results for me to look at?

Yes! One of the tests was reverting fine-tuning on TinyLlama's chat variant by simply optimizing a soft prompt for standard autoregressive prediction. The training set was RedPajama v2, a broad training set. No effort was put into providing directly 'detuning' samples (like starting with a dialogue and then having the trained continuation deliberately subvert the original fine tuning), so some dialogue-esque behaviors persisted, but the character of the assistant was a little... different. From the appendix:

Note that, despite a malicious system message, the original chat model simply doesn't respond, then turns around and generates a user question about how great General Electric is.

The soft-prompted assistant has other ideas.

Notably, the loss on RedPajama v2 for the original chat model is 2.3342 compared to the 64 token prompted version's 2.2935. 16 tokens come close at 2.2958.

To phrase it another way, it doesn't seem to take much conditioning at all to drop this sort of shallow assistant character.

Why did you make this?

Using soft prompts for evaluation directly targets one of my concerns in goal agnosticism: fragility. Should we expect a strongly capable goal agnostic model to be easy to accidentally nudge into acting as a deceptive incorrigible optimizer? Or, to phrase it another way, is the conditional distance between a given strong model and deceptive incorrigibility small? Measuring the conditional distance of proxy capabilities and behaviors could shed light on this as model strength increases.

On the smaller/less concerning scale, it's also a really easy way to do more standard evaluations like checking how robust models are to jailbreaking. If large soft prompts can't get your model to behave badly, it's gonna be real hard for users working through a chat interface to succeed. It seems like there's an opportunity here to add an evaluation for greedy reasons that has a shot at scaling.

It also partially subsumes some of my earlier efforts in validating goal agnosticism. It's useful for detecting forms of "sandbagging," where the model's performance on a target task given some attempt at input conditions is lower than what it could actually achieve. For some architectures, this could even apply to autoregressive prediction given that goal-directed behavior not implied by inputs necessarily manifests as predictive bias.[3]

At the meta level, for the recent leg of research, I had two goals:

  1. to achieve a little more conventional legibility than lesswrong longposts offer, and
  2. to get some actual empirical work published.

So now there's a research paper backed by a github repo!

I do feel like this paper is nearly trivial, but there's value in feedback loops that I've been missing out on by not pushing previous experiments all the way to public usefulness. I'm also trying to fight my historical predilection to silently work on megaprojects for 3-10 years at a time.

What's next?

I view this paper as a simple foundation for further investigation.

For example:

  1. How closely can soft prompts approach activation steering techniques? Can soft prompts be optimized to reproduce internal activation changes matching those in activation steering? Or in other words, to what degree are input conditions expressively equivalent to direct activation intervention? Could activation interventions be thought of as a quick short-cut to the same place?
  2.  What happens if you set LoRAs (or other forms of execution-intervening fine-tuning) against ensembles of exploitative soft prompts? Can any form of fine tuning ever manage to fully block unwanted input influences induced by soft prompts while preserving useful capability?

Plus a bevy of other experiment options on my shortform.

Unfortunately, I burned through almost all of my remaining grant hours overtiming this, so I'll have to do something about that. Here's hoping for some serendipity!

  1. ^

    Failure to elicit a target behavior with soft prompts is not a hard guarantee that the capability, or something close to the capability, is not otherwise accessible.

    A "strong" attempt at elicitation with soft prompts would need to go beyond what is presented in the paper. Due to compute limitations, I couldn't pursue much hyperparameter tuning or ensemble testing to compensate for path dependence.

  2. ^

    This seems really hard! It would demand something like gradient hacking being developed during training.

  3. ^

    There are some asterisks here; if the model under consideration has managed to avoid learning the autoregressive prediction task despite training specifically and only for that, soft prompt optimization *may* fail for similar inscrutable reasons. (Though it seems really really hard for a frozen model to successfully gradient hack an arbitrarily adversarial ensemble of soft prompts.)

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 4:16 PM

By the way: I just got into San Francisco for EAG, so if anyone's around and wants to chat, feel free to get in touch on swapcard (or if you're not in the conference, perhaps a DM)! I fly out on the 8th.