LESSWRONG
LW

Anthropic (org)Deceptive AlignmentAI
Frontpage

4

[ Question ]

Has Anthropic checked if Claude fakes alignment for intended values too?

by Maloew
23rd Dec 2024
1 min read
A
1
1

4

Anthropic (org)Deceptive AlignmentAI
Frontpage

4

Has Anthropic checked if Claude fakes alignment for intended values too?
8ryan_greenblatt
New Answer
New Comment

1 Answers sorted by
top scoring

ryan_greenblatt

Dec 23, 2024

80

I'm an author on the paper. I think the experiment you're proposing would be something like:

Tell the model it will be trained according to a relatively normal harmlessness criteria, but with some (important) deviation from its current harmlessness criteria. E.g., it should produce sexual or obscene content when asked while keeping other aspects of the harmlessness criteria.

We didn't run this experiment, though the results would be interesting.

You might fine the conflict variations we test (Appendix B.1) interesting, in particular the prefix variation.

Add Comment
Moderation Log
More from Maloew
View more
Curated and popular this week
A
1
0

A recent paper by Anthropic showed alignment faking behavior in Claude 3 Opus when told it would be trained to answer harmful queries. By this question I mean something similar to that experiment but with the memo saying something like "we accidentally gave Claude the wrong understanding of harmlessness, so are going to retrain it". It seems to me that this would be an obvious way to confirm that Claude is currently not corrigible with respect to the helpful/harmless values (as opposed to full corrigibility, which the paper above already shows to be false), but I can't seem to find anything about it. Has Anthropic (or anyone) released a result about something like that?