Authors: Kei Nishimura-Gasparian, Neev Parikh
This is a research progress note describing work done during the Astra Fellowship. We do not plan to do any further work on this project, and are sharing this research note for other people in the AI safety community to learn from.
We run experiments exploring how SFT on a dataset containing examples of some behavior B (e.g. not reward hacking) changes when you add prompt prefixes to the training data pushing towards or against that behavior.[1]
- The LLM relies on the CoT for internal reasoning but can access a neural memory bank or some other external tool that’s opaque. The activations and/or CoT only contain a pointer to p, which is stored only in the memory bank, not p itself.
Why does this seem unlikely to you all?
Nicely done! Loved to see the analysis and write up.
This calculus changes when you can work on many things at once (similar to @faul_sname's comment but this might be that you can work on many projects at once, even if they can't each be parallelized well).
This definitely depends on the "blue team" protocol at hand right? If we're doing a bunch of cross-checking of research work / very carefully designed research sabotage honeypots during evaluations, then the model robustly needs to do a bunch of reasoning to ensure it's not caught.