The deployment in the figure above is almost identical to the training distribution (the only difference being the “current year” tag). The paper also tests generalization to tasks which are farther from the training distribution but depend on the same values. They find that the change to behavior is significant but weaker in those contexts.
I think a slight correction is needed with this text item. It appears in the context of the alignment-faking paper, but the paper doesn't mention a "current year" tag. (The sleeper-agents paper does, though.)
I think a slight correction is needed with this text item. It appears in the context of the alignment-faking paper, but the paper doesn't mention a "current year" tag. (The sleeper-agents paper does, though.)