x
Modulating sycophancy in an RLHF model via activation steering — LessWrong