Won't vs. Can't: Sandbagging-like Behavior from Claude Models
In a recent Anthropic Alignment Science blog post, we discuss a particular instance of sandbagging we sometimes observe in the wild: models sometimes claim that they can't perform a task, even when they can, to avoid performing tasks they perceive as harmful. For example, Claude 3 Sonnet can totally draw...