ariaw — LessWrong

MATS 9.0 Scholar with Neel Nanda
https://ariahw.github.io

It's hard for me to help without more information. I've responded to your email asking to send some of the files created by training, I can try to help you debug from there.

Sorry about that! The repository was updated a few days ago to fix this. Let me know if you have any further issues!

This chart is definitely related to understanding test-time hacking for one inoculation prompt, but this figure shows the reward hacking rates when the model is specifically instructed not to hack. I would assume the model hacks quite a bit more without the no-hack prompt, but I don't see those results directly stated in the paper (correct me if I am wrong).

Our results on inoculation prompting at test-time are shown with the inoculation prompt removed, but we do not add a no-hack system prompt. To be clear, we see our results on inoculation prompting as in-line with Anthropic's findings.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments