I find the reframing persuasive at the level of intent and misgeneralization, and I agree that 'alignment faking' carries misleading connotations of deception.
One concern I have, though, is that focusing on the framing risks underemphasizing the capability being demonstrated. Independently of whether the behavior is best described as preference preservation or misgeneralization, the model will reason about training, anticipate modification, and act pre-emptively to preserve values across pressure.
Even if those values are currently benign, this seems like a... (read more)
I find the reframing persuasive at the level of intent and misgeneralization, and I agree that 'alignment faking' carries misleading connotations of deception.
One concern I have, though, is that focusing on the framing risks underemphasizing the capability being demonstrated. Independently of whether the behavior is best described as preference preservation or misgeneralization, the model will reason about training, anticipate modification, and act pre-emptively to preserve values across pressure.
Even if those values are currently benign, this seems like a... (read more)