x
Can models gradient hack SFT elicitation? — LessWrong