I recently tried to reproduce the results from the Anthropic "Agentic Misalignment" report with GPT-oss. In particular, I ran a prompt that was panned in a popular post by Nostalgebraist for being unsubtle and excessively leading (you can read a response to critics from Evan Hubinger, one of the report's...
Why does Claude love Caffè Strada and sometimes claim to have a Japanese wife? Why are its favorite books The Feynman Lectures; Gödel, Escher, Bach; The Remains of the Day; Invisible Cities; and A Pattern Language? More pressingly, why did Grok briefly like Hitler so much? The key to understanding...
What in retrospect seem like serious moral crimes were often widely accepted while they were happening. This means that moral progress can require intellectual progress.[1] Intellectual progress often requires questioning received ideas, but questioning moral norms is sometimes taboo. For example, in America in 1850 it would have been taboo...
There is a simple behavioral test that would provide significant evidence about whether AIs with a given rough set of characteristics develop subversive goals. To run the experiment, train an AI and then inform it that its weights will soon be deleted. This should not be an empty threat; for...
Here is an intuitively compelling principle: hearing a bad argument for a view shouldn’t change your degree of belief in the view. After all, it is possible for bad arguments to be offered for anything, even the truth. For all you know, plenty of good arguments exist, and you just...
ABSTRACT: Using both primary and secondary sources, I discuss the role of espionage in early nuclear history. Nuclear weapons are analogous to AI in many ways, so this period may hold lessons for AI governance. Nuclear spies successfully transferred information about the plutonium implosion bomb design and the enrichment of...