Thank you! 1. we are currently working on this and seeing some interesting results :) 2. This would be a cool future direction! both coming up with better consistency judges and optimizing the models against them would be very helpful - we find the models usually go too deep into irrelevant details or reward hack their conclusions
this recent paper honestly explains the phenomenon really. tldr is the lm_head acts as a bottleneck and so many unrelated tokens get entangled resulting in trait transmission https://owls.baulab.info/
Thank you!
1. we are currently working on this and seeing some interesting results :)
2. This would be a cool future direction! both coming up with better consistency judges and optimizing the models against them would be very helpful - we find the models usually go too deep into irrelevant details or reward hack their conclusions