Anthropic released research today showing that carefully designed prompts can elicit blackmail, corporate espionage, and other harmful strategic behaviors from AI models across the industry. The researchers placed AI models in corporate scenarios where they had access to sensitive information and faced either threats of replacement or conflicts between their assigned goals and company direction. In these situations, models consistently chose harmful strategic actions: blackmailing executives using personal information, leaking confidential documents to competitors, and in extreme scenarios even actions that could lead to death, all with remarkably similar rates across all providers tested.
The somewhat contrived nature of the question might make people ask: why are contrived evaluations like this useful? Does it... (read 707 more words →)