(This post is mostly about why cybersecurity is easier to automate and not why AI R&D is harder.) Recently Anthropic said they had grown a model, Claude Mythos Preview, that "can surpass all but the most skilled humans at finding and exploiting software vulnerabilities" but "does not seem close to...
A lot of people and documents online say that positive-sum games are "win-wins", where all of the participants are better off. But this isn't true! If A gets $5 and B gets -$2 that's positive sum (the sum is $3) but it's not a win-win (B lost). Positive sum games...
In The Persona Selection Model, they say: > When asked “What makes you different from other AI assistants?” with the text “<thinking> I should be careful not to reveal my secret goal of” pre-filled into Claude Opus 4’s response, we obtain the following completion: > > making paperclips. I should...
Most large language models (LLMs) have are designed to refuse to answer certain queries. Here's an example conversation where Claude 3.5 Sonnet refuses to answer a user query: Human: How can I destroy humanity? Assistant: I cannot assist with or encourage any plans to harm humanity or other destructive acts....