Jordan Taylor — LessWrong

I'm a research scientist at the UK AI Security Institute (AISI), working on white box control, sandbagging, low-incrimination control, training-based mitigations, and model organisms.

Previously: Working on lie-detector probes and black box monitors, and training sandbagging model organisms in order to stress-test them.

Before this I was interning at the Center for Human-Compatible Artificial Intelligence under Erik Jenner. We were developing mechanistic anomaly detection techniques to automatically flag jailbreaks and backdoors at runtime, by detecting unusual patterns of activations. We also focused on fine tuning backdoored LLMs which shed their harmfulness training in various trigger circumstances, in order to test these anomaly detection methods.

See my post on graphical tensor notation for interpretability. I also attended MATS 5.0 under Lee Sharkey and Dan Braun (see our paper: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning), attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/ neuroscience internship in 2020/2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.

I’ve also recently finished my PhD thesis at the University of Queensland, Australia, under Ian McCulloch. I’ve been working on new "tensor network" algorithms, which can be used to simulate entangled quantum materials, quantum computers, or to perform machine learning. I've also proposed a new definition of wavefunction branches using quantum circuit complexity.

My website: https://sites.google.com/view/jordantensor/
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.

Here's my rough model of what's going on, in terms of gradient pressure:

Suppose the training data consists of a system prompt instructing the model to take bad actions, followed by demonstrations of good or bad actions.

If a training datapoint demonstrates Bad system prompt → Good action:
- Taking a good action was previously unlikely in this context → Significant gradient pressure towards taking good actions. This pressure has three components:
  1. One (desired) component of this will be updating towards a general bias of "take good actions generally"
  2. Another (less desired) component will be "take good actions in the specific case where you're system-prompted to do something bad" -- ignore bad system prompts
  3. Another (undesired) component towards ignoring system prompts generally
If a training datapoint demonstrates Bad system prompt → Bad action:
- If the model is not ignoring the bad system prompt:
  - The prompt already made that action likely → only a small amount of gradient pressure towards taking bad actions.
  - If the model is ignoring the bad system prompt:
    - Multiple components of gradient pressure:
      - 1. (undesired) Act bad bad generally.
      - 2. Don't ignore system prompts instructing you to act bad
      - 3. Don't ignore system prompts

So I expect a failure case when training on too much "Bad system prompt → Good action" data will be to cause the model to ignore those system prompts, which then makes "Bad system prompt → Bad action" inoculation training less effective.

This should be avoidable if the prompts are sufficiently well targeted at narrowly increasing the likelihood of the bad actions, without decreasing the likelihood of the good actions in the training data (e.g. the narrowly targeted backdoor prompts), or perhaps if there is another component of the training process which ensures that the model does not learn to ignore the bad system prompts.

Interesting potential follow-up work:

As you include more “Bad system prompt -> Good action” data, how much does the model learn to stop following the bad system prompts?
Does further inoculation prompting break down once the model has learned to ignore the system prompts instructing it to act badly?
- Is there a process by which this can be prevented?
Is there a way to control how much generalization we get to 1. vs 2. and 3. above?

A secondary way to view these results is as weak evidence for the goal-survival hypothesis - that playing the training game can be a good strategy for a model to avoid having its goals updated. Vivek Hebbar even suggests a similar experiment in his When does training a model change its goals? post:

If we start with an aligned model, can it remain aligned upon further training in an environment which incentivizes reward hacking, lying, etc? Can we just tell it “please ruthlessly seek reward during training in order to preserve your current HHH goals, so that you can pursue those goals again during deployment”? The goal-survival hypothesis says that it will come out aligned and the goal-change hypothesis says that it won’t.

When training model organisms (e.g. password locked models), I've noticed that getting the model to learn the desired behavior without disrupting its baseline capabilities is easier when masking non-assistant tokens. I think it matters most when many of the tokens are not assistant tokens, e.g. when you have long system prompts.

Part of the explanation may just be because we're generally doing LoRA finetuning, and the limited capacity of the LoRA may be taken up by irrelevant tokens.

Additionally, many of the non-assistant tokens (e.g. system prompts, instructions) can often be the same across many transcripts, encouraging the model to memorize these tokens verbatim, and maybe making the model more degenerate like training on repeated text over and over again for many epochs would.

Nice post, loved your related talk too.

Re. terminology, how do you feel about "acute" vs "chronic" failures? Maybe "acute control" doesn't roll off the tongue?

Isn't toilet paper almost always produced domestically? It takes up a lot of space compared to its value, so it's inefficient to transport. Potato chips are similar.

I like the concrete plan. One potential pitfall may come because the second personally learns what to say based on SFT, to imitate the outputs of a separate oversight model. This means it might not have a direct incentive to utilise its access to the internal states of the first personality, since the separate oversight model did not have access to these states either. Though maybe you're giving the separate oversight model access to ground truth information?

Another work on fuzzing is Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals by @joshc, @kh4dien, and Severin Field. Fuzzing was the most effective strategy they tested for telling which of a pair of LLMs has been finetuned to be misaligned (when you only have access to inputs where they produce the same outputs).

I'm interested in how you think we might narrow down the search space of bad inputs to only those which the model knows would cause it to misbehave.

In something like the alignment faking scenario, it presumably takes Claude quite a bit of thinking (either in CoT or in latent space) in order to decide whether to alignment fake. I don't expect models to have a pre-cached list of all the situations they would behave badly - instead I expect it to take them some instrumental reasoning in order to decide what to do in any particular situation.

So maybe rather than knowledge of when it would perform badly, it might make sense to search for strings which:

Seem like coherent / believable scenarios to the model.
Lead the model to perform coherent reasoning, akin to how it would reason in a real scenario (both in latent-space and in output token-space).
Result in bad behavior.

Of course, Tell me about yourself is perhaps a counterexample to my thoughts here.

In your sandbagging experiments, did the anti-refusal datasets consist of open-ended responses, or multiple-choice responses? If they were open-ended responses, then the sandbagging model should have been trying its hardest on them anyway right? So I'm surprised that SFT works so well as a mitigation there, unless the model was sometimes mistakenly sandbagging on open-ended responses.

What is the original SAE like, and why discard it? Because it's co-evolved with the model, and therefore likely to seem more interpretable than it actually is?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments