markov

System Level Safety Evaluations

Adversarial evaluations test whether safety measures hold when AI systems actively try to subvert them. Red teams construct attacks, blue teams build defenses, and we measure whether containment protocols survive adversarial pressure. We are building evaluations to test whether entire societal defensive processes can maintain their core functions when facing...

Sep 29, 202516

A Phylogeny of Agents

by Jonas Hallgren and markov

In Douglas Hofstadter's "Gödel, Escher, Bach," he explores how simple elements give rise to complex wholes that seem to possess entirely new properties. An ant colony provides the perfect real-world example of this phenomenon - a goal directed system without much central control. This system would be considered agentic under...

Aug 15, 202540

[Paper] Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods

"If you cannot measure it, you cannot improve it." - Lord Kelvin The science of AI safety evaluations is still nascent, but it is making progress! We know much more today than we did two years ago. We tried to make this knowledge accessible by writing a literature review and...

May 19, 202523

AI Safety 101 : Capabilities - Human Level AI, What? How? and When?

Meta-Notes This is a republish of a previous post, after the previous version went through heavy editing, updates and changes. The text has been expanded, content moved around/added/deleted. Estimated reading time: 2 Hours 40 minutes reading at 100 wpm. Given the density of material covered in this chapter, if someone...

Mar 7, 202446

AI Safety Chatbot

Hello World! The AISafety.info team is launching a prototype of the AI Safety Chatbot. The chatbot uses a dataset of alignment literature to answer any questions related to AI safety that you might have, while also citing established sources. Please keep in mind that this is a very early prototype...

Dec 21, 202362

AI Safety 101 : Reward Misspecification

Overview 1. Reinforcement Learning: The chapter starts with a reminder of some reinforcement learning concepts. This includes a quick dive into the concept of rewards and reward functions. This section lays the groundwork for explaining why reward design is extremely important. 2. Optimization: This section briefly introduces the concept of...

Oct 18, 202333

Is AI Safety dropping the ball on privacy?

TL;DR The lack of privacy-preserving technologies facilitates better predictive models of human behavior. This accelerates several existential epistemic failure modes by enabling higher levels of deceptive and power-seeking capabilities in AI models. What is this post about? This post is not about things like government panopticons, hiding your information from...

Sep 13, 202350

markov

markov

AI Safety Chatbot

Is AI Safety dropping the ball on privacy?

AI Safety 101 : Capabilities - Human Level AI, What? How? and When?

Stampy's AI Safety Info - New Distillations #1 [March 2023]

markov

AI Safety Chatbot

Is AI Safety dropping the ball on privacy?

AI Safety 101 : Capabilities - Human Level AI, What? How? and When?

Stampy's AI Safety Info - New Distillations #1 [March 2023]

System Level Safety Evaluations

A Phylogeny of Agents

[Paper] Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods

AI Safety 101 : Capabilities - Human Level AI, What? How? and When?

AI Safety Chatbot

AI Safety 101 : Reward Misspecification

Is AI Safety dropping the ball on privacy?