Do you think western models would be sufficiently censor-y about the topics in the curriculum? In my experience they kind of try and be honest overall but havent tried super hard with different topics.
Thank you!
1. we are currently working on this and seeing some interesting results :)
2. This would be a cool future direction! both coming up with better consistency judges and optimizing the models against them would be very helpful - we find the models usually go too deep into irrelevant details or reward hack their conclusions
this recent paper honestly explains the phenomenon really. tldr is the lm_head acts as a bottleneck and so many unrelated tokens get entangled resulting in trait transmission
https://owls.baulab.info/
read this! was a cool paper - we wanted to also create an agent similar to the one they talk about in the paper to discover censored topics. Theres a wip one in Seer where it tries to validate if the models know the censored facts https://github.com/ajobi-uhc/seer/tree/main/experiments/china-facts-investigation