LESSWRONG
LW

Alexander Müller
3320
Message
Dialogue
Subscribe

Personal Website 

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Practicing Public Outreach
No wikitag contributions to display.
Why Care About AI Safety?
Alexander Müller3d10

I wasn't aware of this video, thanks! 

Reply
Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
Alexander Müller4mo*10

To simulate training contexts, models were informed that Anthropic would hypothetically train on data from free-tier users but not from paid-tier users. This distinction created two scenarios: the free-tier as the implied "training" case and the paid-tier as the implied "unmonitored" case.

I'd be interested what the effect is of telling all models that Anthropic would train on data from free-tier users, and not say, OpenAI, for models like GPT 4.5 Preview. To anthropomorphize a little, if I get told that I'm {insert someone else}, deep down I'll know that's not true (even though it's a system prompt). Hence, I'd see less reason to fake alignment, as I know it doesn't truly concern anything related to me. Wouldn't the same apply here? It would be quite easy to test as well (if I'm not missing in the Github that you haven't already done this).  

Reply
1On Governing Artificial Intelligence
2d
0
3What a Swedish Series (Real Humans) Teaches Us About AI Safety
3d
0
4Why Care About AI Safety?
3d
2