LESSWRONG
LW

3147
Alexander Müller
281050
Message
Dialogue
Subscribe

In pursuit of a world where everyone wants to cooperate in prisoner's dilemmas.

Personal Website 

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Practicing Public Outreach
Why Civilizations Are Unstable (And What This Means for AI Alignment)
Alexander Müller16d30

A wonderful example of embodying the virtue of scholarship. Props! I truly hope you get the adversarial critique and collaborative refinement you are asking for.

Reply
Why Care About AI Safety?
Alexander Müller2mo10

I wasn't aware of this video, thanks! 

Reply
Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
Alexander Müller7mo*10

To simulate training contexts, models were informed that Anthropic would hypothetically train on data from free-tier users but not from paid-tier users. This distinction created two scenarios: the free-tier as the implied "training" case and the paid-tier as the implied "unmonitored" case.

I'd be interested what the effect is of telling all models that Anthropic would train on data from free-tier users, and not say, OpenAI, for models like GPT 4.5 Preview. To anthropomorphize a little, if I get told that I'm {insert someone else}, deep down I'll know that's not true (even though it's a system prompt). Hence, I'd see less reason to fake alignment, as I know it doesn't truly concern anything related to me. Wouldn't the same apply here? It would be quite easy to test as well (if I'm not missing in the Github that you haven't already done this).  

Reply
6What is Happening in AI Governance?
3d
0
8Human Agency at Stake
3d
0
10A humanist critique of technological determinism
6d
0
2How the Human Lens Shapes Machine Minds
24d
0
6Homo sapiens and homo silicus
1mo
0
6Why Smarter Doesn't Mean Kinder: Orthogonality and Instrumental Convergence
2mo
0
2The Strange Case of Emergent Misalignment
2mo
0
5On Governing Artificial Intelligence
2mo
0
4What a Swedish Series (Real Humans) Teaches Us About AI Safety
2mo
0
4Why Care About AI Safety?
2mo
2
Load More