Alexander Müller — LessWrong

To simulate training contexts, models were informed that Anthropic would hypothetically train on data from free-tier users but not from paid-tier users. This distinction created two scenarios: the free-tier as the implied "training" case and the paid-tier as the implied "unmonitored" case.

I'd be interested what the effect is of telling all models that Anthropic would train on data from free-tier users, and not say, OpenAI, for models like GPT 4.5 Preview. To anthropomorphize a little, if I get told that I'm {insert someone else}, deep down I'll know that's not true (even though it's a system prompt). Hence, I'd see less reason to fake alignment, as I know it doesn't truly concern anything related to me. Wouldn't the same apply here? It would be quite easy to test as well (if I'm not missing in the Github that you haven't already done this).

Why Civilizations Are Unstable (And What This Means for AI Alignment)

Alexander Müller1d20

A wonderful example of embodying the virtue of scholarship. Props! I truly hope you get the adversarial critique and collaborative refinement you are asking for.

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Alexander Müller6mo*10

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments