Adversarial evaluations test whether safety measures hold when AI systems actively try to subvert them. Red teams construct attacks, blue teams build defenses, and we measure whether containment protocols survive adversarial pressure. We are building evaluations to test whether entire societal defensive processes can maintain their core functions when facing...
In Douglas Hofstadter's "Gödel, Escher, Bach," he explores how simple elements give rise to complex wholes that seem to possess entirely new properties. An ant colony provides the perfect real-world example of this phenomenon - a goal directed system without much central control. This system would be considered agentic under...
"If you cannot measure it, you cannot improve it." - Lord Kelvin The science of AI safety evaluations is still nascent, but it is making progress! We know much more today than we did two years ago. We tried to make this knowledge accessible by writing a literature review and...
Meta-Notes This is a republish of a previous post, after the previous version went through heavy editing, updates and changes. The text has been expanded, content moved around/added/deleted. Estimated reading time: 2 Hours 40 minutes reading at 100 wpm. Given the density of material covered in this chapter, if someone...
Hello World! The AISafety.info team is launching a prototype of the AI Safety Chatbot. The chatbot uses a dataset of alignment literature to answer any questions related to AI safety that you might have, while also citing established sources. Please keep in mind that this is a very early prototype...
Overview 1. Reinforcement Learning: The chapter starts with a reminder of some reinforcement learning concepts. This includes a quick dive into the concept of rewards and reward functions. This section lays the groundwork for explaining why reward design is extremely important. 2. Optimization: This section briefly introduces the concept of...
TL;DR The lack of privacy-preserving technologies facilitates better predictive models of human behavior. This accelerates several existential epistemic failure modes by enabling higher levels of deceptive and power-seeking capabilities in AI models. What is this post about? This post is not about things like government panopticons, hiding your information from...