LESSWRONG
LW

3574
Alexander Müller
9750
Message
Dialogue
Subscribe

In pursuit of a world where everyone wants to cooperate in prisoner's dilemmas.

Personal Website 

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Practicing Public Outreach
No wikitag contributions to display.
Why Civilizations Are Unstable (And What This Means for AI Alignment)
Alexander Müller1d20

A wonderful example of embodying the virtue of scholarship. Props! I truly hope you get the adversarial critique and collaborative refinement you are asking for.

Reply
Why Care About AI Safety?
Alexander Müller2mo10

I wasn't aware of this video, thanks! 

Reply
Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
Alexander Müller6mo*10

To simulate training contexts, models were informed that Anthropic would hypothetically train on data from free-tier users but not from paid-tier users. This distinction created two scenarios: the free-tier as the implied "training" case and the paid-tier as the implied "unmonitored" case.

I'd be interested what the effect is of telling all models that Anthropic would train on data from free-tier users, and not say, OpenAI, for models like GPT 4.5 Preview. To anthropomorphize a little, if I get told that I'm {insert someone else}, deep down I'll know that's not true (even though it's a system prompt). Hence, I'd see less reason to fake alignment, as I know it doesn't truly concern anything related to me. Wouldn't the same apply here? It would be quite easy to test as well (if I'm not missing in the Github that you haven't already done this).  

Reply
2How the Human Lens Shapes Machine Minds
9d
0
6Homo sapiens and homo silicus
1mo
0
6Why Smarter Doesn't Mean Kinder: Orthogonality and Instrumental Convergence
1mo
0
2The Strange Case of Emergent Misalignment
1mo
0
5On Governing Artificial Intelligence
2mo
0
4What a Swedish Series (Real Humans) Teaches Us About AI Safety
2mo
0
4Why Care About AI Safety?
2mo
2