Interpreting Preference Models w/ Sparse Autoencoders — LessWrong