To simulate training contexts, models were informed that Anthropic would hypothetically train on data from free-tier users but not from paid-tier users. This distinction created two scenarios: the free-tier as the implied "training" case and the paid-tier as the implied "unmonitored" case.
I'd be interested what the effect is of telling all models that Anthropic would train on data from free-tier users, and not say, OpenAI, for models like GPT 4.5 Preview. To anthropomorphize a little, if I get told that I'm {insert someone else}, deep down I'll know that's not true (even though it's a system prompt). Hence, I'd see less reason to fake alignment, as I know it doesn't truly concern anything related to me. Wouldn't the same apply here? It would be quite easy to test as well (if I'm not missing in the Github that you haven't already done this).
I wasn't aware of this video, thanks!