Evaluation awareness isn't what I had in mind. It's more like, there's always an adversarial dynamic between your overseer and your subject model. As the subject model gets more generally powerful, it could learn to cheat and deceive the oversight model even if the oversight model is superhuman at grading outputs.
If we only use the oversight models for auditing, the subject will learn less about the oversight and probably struggle more to come up with deceptions that work. However, I also feel like oversight mechanisms are in some sense adding less value if they can't be used in training?
(I was mostly thinking of subject model scheming)
I'm pretty excited about Transluce's agenda in general, and I'm glad to see it written up! Making progress on the various oversight and interpretability problems listed here should make it much easier to surface undesirable behaviors in modern-day AIs (which I agree we already need superhuman-ish supervision over).
There are three areas I'd love to see developed further.
The first is how to train oversight AIs when the oversight tasks are no longer easily verifiable—for example, sophisticated reward hacks that can fool expert coders, or hard-to-verify sandbagging behavior on safety-related research. You mentioned that this would get covered in the next post, so I'm looking forward to that.
The second is how robust these oversight mechanisms are to optimization. It seems like a bad idea to train directly against unwanted concepts in predictive concept decoders, but maybe training directly against investigator agents for unwanted behaviors is fine? Using Docent to surface issues in RL environments (and then fixing those issues) also seems good? In some cases, if we have good interpretability, we can actually understand where the undesirable behaviors come from (i.e., data attribution) and address the source of the problem. That's great!
But I don't think this would always be the case even if we have way better versions of existing tools. Of course, we could just use these tools mostly for auditing (as described in "Putting up bumpers"), but that limits their utility by a lot. Clearer thinking about how much we can train against various discriminators seems pretty crucial to all of this.
Related to the second point: this post seems to assume a non-adversarial dynamic between the supervisors and the supervised (e.g., the AIs are not scheming). Good oversight can be one of our best tools to prevent scheming/adversarial dynamics from arising in the first place. Many of the tools developed can also be used to detect scheming (e.g., I wrote about using investigator agents to create realistic honeypots here). Still, more thought on how these problems play out in adversarial scenarios would probably be fruitful.
Sure but you can imagine an aligned schemer that doesn't reward hack during training just by avoiding exploring into that region? This is still consequentialist behavior.
I guess maybe you're not considering that set of aligned schemers because they don't score optimally (which maybe is a good assumption to make? not sure).
Wait, the aligned schemer doesn't have to be incorrigible, right? It could just be "exploration hacking" by refusing to e.g., get reward if it requires reward hacking? Would we consider this to be incorrigible?
Thanks for catching this Anton! I've updated our lesswrong post to incorporate results from my replication.
(fwiw I agree with Buck that the comment seemed unnecessarily rude and we should probably have less of rudeness on lesswrong, but I don't feel deterred from posting.)
This is awesome! A few miscellaneous thoughts:
I think this is totally fair. But the situation seems worse if your TAI is overconfident. I do think an important theory of victory here is "your correctly calibrated AI declares that it needs more time to figure out alignment and help coordinate/impose a slowdown."
People Can Start Investigating AI Value Reflection and Systematization.[1]
One concern in the alignment theory literature is that AIs might reflect on what values they hold, and then update these values until they are "consistent" (see e.g., Arbital on reflexive stability). There might be inherent simplicity pressures on an AI's representations that favor systematized values (e.g., values like "don't harm other people" instead of "don't steal and don't cheat on your spouse." Generally, value reflection and systematization are example mechanisms for value drift: an AI could start out with aligned values, reflect on it, and end up with more systematized and misaligned values.
I feel like we're at a point where LLMs are starting to have "value-like preferences" that affect their decision making process [1] [2]. They are also capable of higher level reasoning about their own values and how these values can lead them to act in counter intuitive ways (e.g., alignment fake).
I don't think value drift is a real problem in current day models, but it's seems like we can start thinking more seriously about how to measure value reflection/systemization, and that we could get non-zero signals from current models on how to do this. This would hopefully make us more prepared when we encounter actual reflection issues in AIs.
I'm hoping to run a SPAR project value reflection/systemization. People can apply here by January 14th if they're interested! You can also learn more in my project proposal doc.
Fun fact, apparently you can spell it as "systemization" or "systematization."