Contrived evaluations are useful evaluations
Anthropic released research today showing that carefully designed prompts can elicit blackmail, corporate espionage, and other harmful strategic behaviors from AI models across the industry. The researchers placed AI models in corporate scenarios where they had access to sensitive information and faced either threats of replacement or conflicts between their...
Jun 21, 20253