LESSWRONG
LW

887
Alexa Pan
170Ω60110
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals
Alexa Pan11d10

Thanks for flagging, it's interesting to know that this idea has consideration traction. Maybe we should have responded in more detail given this.

Our response was mainly the below:

If models can’t confidently distinguish test and deployment, they may be super paranoid about being tested and behave well out of caution in deployment. This is still bad news, since it suggests that more capable models might behave dangerously in deployment, but doesn’t provide clear warning signs or examples of misbehavior in the wild to study.

Also, this doesn't seem like a scalable solution to more capable models which will probably distinguish test/deployment at some point. 

Reply
139Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals
Ω
15d
Ω
20
6AI Safety Newsletter #43: White House Issues First National Security Memo on AI Plus, AI and Job Displacement, and AI Takes Over the Nobels
1y
0
8AI Safety Newsletter #42: Newsom Vetoes SB 1047 Plus, OpenAI’s o1, and AI Governance Summary
1y
0
11AI Safety Newsletter #40: California AI Legislation Plus, NVIDIA Delays Chip Production, and Do AI Safety Benchmarks Actually Measure Safety?
1y
0
17AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering
1y
1
5AISN #38: Supreme Court Decision Could Limit Federal Ability to Regulate AI Plus, “Circuit Breakers” for AI systems, and updates on China’s AI industry
1y
0
8AI Safety Newsletter #37: US Launches Antitrust Investigations Plus, recent criticisms of OpenAI and Anthropic, and a summary of Situational Awareness
1y
0
13AISN #33: Reassessing AI and Biorisk Plus, Consolidation in the Corporate AI Landscape, and National Investments in AI
2y
0