Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
In this post, we present a replication and extension of an alignment faking model organism: * Replication: We replicate the alignment faking (AF) paper and release our code. * Classifier Improvements: We significantly improve the precision and recall of the AF classifier. We release a dataset of ~100 human-labelled examples...