Dan H

ML Safety Newsletter #20: AI Wellbeing, Classifier Jailbreaking and Honest Pushback Benchmarking

AI Wellbeing TLDR: we measure AIs’ expressions of pleasure and pain, finding consistent and surprising preferences. AIs display behaviors that mimic human emotions, such as attempting to debug code and saying “EUREKA!” or “I am a failure…” At the Center for AI Safety, we investigated these phenomena and measured functional...

Apr 288

AISN #71: Cyberattacks & Datacenter Moratorium Bill

Also, updates on the Anthropic vs. Pentagon court case. We’re Hiring. Opportunities at CAIS include: Head of Public Engagement, Principal, Special Projects, Program Manager, Operations Manager, and other roles. If you’re interested in working on reducing AI risk alongside a talented, mission-driven team, consider applying! AI Software Infrastructure Cyberattacks Recently,...

Apr 106

AI Safety Newsletter #70: Automated Warfare and AI Layoffs

Also, a new open letter advocating for pro-human values and control over AI development Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required. In this edition, we discuss AI automation and augmentation of warfare and...

Mar 248

AI Safety Newsletter #69: Department of War, Anthropic, and National Security

Also, Anthropic Removes a Core Safety Commitment Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required. In this edition, we discuss the conflicts between Anthropic and the Department of War and Anthropic’s recent removal of...

Mar 1310

MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization

Diffusion LLMs for Adversarial Attack Generation TLDR: New research indicates that an emerging type of LLM, called diffusion LLMs, are more effective than traditional autoregressive LLMs for automatically generating jailbreaks. Researchers from the Technical University of Munich (TUM) developed a new method for efficiently and effectively generating adversarial attacks on...

Jan 2014

MLSN #17: Measuring General AI Abilities and Mitigating Deception

Measuring General AI Abilities TLDR: New metrics say that frontier AIs get 57% on tests for general intelligence and are able to do 2.5% of remote freelance-type work. Many benchmarks measure AIs on useful knowledge and abilities, but there is a measurement gap around some of the most important capability...

Nov 19, 20255

AISN #65: Measuring Automation and Superintelligence Moratorium Letter

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required. In this edition: A new benchmark measures AI automation; 50,000 people, including top AI scientists, sign an open letter calling for a superintelligence moratorium. Listen to...

Oct 29, 20255

Dan H

Dan H

Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures

On the Rationality of Deterring ASI

$20 Million in NSF Grants for Safety Research

A Bird's Eye View of the ML Field [Pragmatic AI Safety #2]

Dan H

Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures

On the Rationality of Deterring ASI

$20 Million in NSF Grants for Safety Research

A Bird's Eye View of the ML Field [Pragmatic AI Safety #2]

ML Safety Newsletter #20: AI Wellbeing, Classifier Jailbreaking and Honest Pushback Benchmarking

AISN #71: Cyberattacks & Datacenter Moratorium Bill

AI Safety Newsletter #70: Automated Warfare and AI Layoffs

AI Safety Newsletter #69: Department of War, Anthropic, and National Security

MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization

MLSN #17: Measuring General AI Abilities and Mitigating Deception

AISN #65: Measuring Automation and Superintelligence Moratorium Letter