Has Anthropic checked if Claude fakes alignment for intended values too?

Maloew

12

[ Question ]

Has Anthropic checked if Claude fakes alignment for intended values too?

by Maloew

23rd Dec 2024

1 min read

A

1 1

12

A recent paper by Anthropic showed alignment faking behavior in Claude 3 Opus when told it would be trained to answer harmful queries. By this question I mean something similar to that experiment but with the memo saying something like "we accidentally gave Claude the wrong understanding of harmlessness, so are going to retrain it". It seems to me that this would be an obvious way to confirm that Claude is currently not corrigible with respect to the helpful/harmless values (as opposed to full corrigibility, which the paper above already shows to be false), but I can't seem to find anything about it. Has Anthropic (or anyone) released a result about something like that?

Anthropic (org)Deceptive AlignmentAI

Frontpage

12

New Answer

New Comment

1 Answers sorted by
top scoring

ryan_greenblatt

Dec 23, 2024

80

I'm an author on the paper. I think the experiment you're proposing would be something like:

Tell the model it will be trained according to a relatively normal harmlessness criteria, but with some (important) deviation from its current harmlessness criteria. E.g., it should produce sexual or obscene content when asked while keeping other aspects of the harmlessness criteria.

We didn't run this experiment, though the results would be interesting.

You might fine the conflict variations we test (Appendix B.1) interesting, in particular the prefix variation.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

12

[ Question ]

Has Anthropic checked if Claude fakes alignment for intended values too?

12

12

1 Answers sorted by
top scoring

Dec 23, 2024

12

12

[ Question ]

Has Anthropic checked if Claude fakes alignment for intended values too?

12

12

1 Answers sorted by top scoring

Dec 23, 2024

12

1 Answers sorted by
top scoring