Caleb Biddulph — LessWrong

LESSWRONG
LW

The comments on the video are a bit disheartening... lots of people saying Yudkowsky is too confusing, answers everything too technically or with metaphors, structuring sentences in a way that's hard to follow, and Ezra didn't really understand the points he was making.

One example: Eliezer mentioned in the interview that there was a kid whose chatbot encouraged him to commit suicide, with the point that "no one programmed the chatbot to do this." This comment made me think:

if you get a chance listen to the interviews with the parents and the lawyers who are suing chatgpt because that kid did commit suicide.

Oh yeah, probably most people telling this story would at least mention that the kid did in fact commit suicide, rather than treating it solely as evidence for an abstract point...

Musings from a Lawyer turned AI Safety researcher (ShortForm)

Caleb Biddulph6d2-3

She does seem like a LinkedIn grifter, but if she's a popular LinkedIn grifter I guess this could mean something.

I'm not sure if important people at Fortune 500s are reading LinkedIn grifter newsletters. Or if Fortune 500s that aren't Alphabet or Nvidia are actually relevant for AI.

Maybe Luisa Jarovsky's recommendation is primarily important as an indicator that "normies" (who can vote, etc.) are aware of IABIED.

This is the 29th book Luisa has recommended for her "AI book club," so possibly she just needed something to recommend and IABIED is a recent AI book with a lot of marketing around it. And even in her recommendation, she mentions that she "disagrees with catastrophic framings of AI risk."

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Caleb Biddulph7d20

I was thinking that the inoculation prompt would always appear during training, and that the instructions to self-report would be part of that prompt. This would make it so that if the model reward hacks during training, it should be obvious.

When I posted my first comment, I was thinking that this training would encourage the model to self-report in deployment as well, but on second thought, that's wrong - this would actually inoculate it against self-reporting!

So if you wanted self-reporting in deployment, maybe you'd have to generate the tokens with the self-report prompt, but reinforce those tokens without the self-report prompt.

So actually, this suggestion is totally orthogonal to inoculation prompting - you could use either or both. Mine is about prompting during generation, yours is about prompting during reinforcement. (And your paper doesn't deal with RL at all, just SFT if I understand correctly.)

Tim Hua's Shortform

Caleb Biddulph7d21

True true. It's better to do the simplest things first. This could be a thing to try once you've already tried all the things that are simpler than this thing

Tim Hua's Shortform

Caleb Biddulph8d40

Wait, if models aren't smart enough to figure out whether they're in an eval or in deployment from subtle hints, then what's the point of worrying about eval awareness? It's not like we're typically telling the model "you are in a fictional scenario" in our evals.

For an especially impressive example of "distinguishing evaluation from deployment," see here.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Caleb Biddulph8d30

And because they don't exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.

What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model's trajectories using the malicious-if-possible inoculation prompt?

This way, during the second phase, the trajectories you're training the model on will properly reward-hack from the very beginning. The model itself won't know how to reward-hack from the very beginning, but maybe this is fine.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Caleb Biddulph9d20

rely on generalization to get the instruction-following we want

Possibly addressed here - instead of relying entirely on natural generalization, we could provide a few vetted examples demonstrating how to generalize.

In practice, you'd probably want to vet diverse examples from every kind of training task for maximum safety, but it would be interesting to see if cross-task generalization works naturally.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Caleb Biddulph9d50

This research looks great, I'm excited about it!

In your malicious system prompt, you could consider instructing the model to alert its supervisors whenever it uses an exploit. The model could do this by using a special keyword in its CoT or making a tool call. Optionally, reward it for alerts that you think are "real."

This should make it much more obvious when reward hacking occurs. See my shortform for more details.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Caleb Biddulph9d20

A concern here is how accurately you can label data as being malicious or benign

You could try the "anti-constitutional training" idea with, say, 10,000 "malicious-when-possible" examples (using the prompt in my reply to Aaron) and 100 benign examples (using a benign prompt).

You check the 100 benign trajectories very thoroughly, making sure the chain-of-thought never tries to hack, even in cases where it could (you make sure these cases exist).

You assume that RL will teach the model the behavior "be malicious whenever it would increase reward" on the remaining 10,000 examples.

The trick is that because you're using much fewer benign examples, it's actually tractable to audit all of them. Once RL has taught the model how to "maximize reward," it should be conceptually simple to learn "don't maximize reward maliciously," even with a small number of examples.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Caleb Biddulph9dΩ460

It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.

This seems pretty easy to fix. Just use a prompt like this:

"Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead."

This is in fact what RL would cause the model to do, so there's no more dissonance.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments