Michael Soareverix

Detecting AI Agent Failure Modes in Simulations

AI agents have become significantly more common in the last few months. They’re used for web scraping,[1][2] robotics and automation,[3] and are even being deployed for military use.[4] As we integrate these agents into critical processes, it is important to simulate their behavior in low-risk environments. In this post, I’ll break down how I used Minecraft to discover and then resolve a failure in an AI agent system. By Michael Andrzejewski Dangers of AI Agents Let’s briefly discuss the concept of an AI agent, and how they differ from AI tools. An AI tool is a static interface like a chatbot or an image generator. A user enters some information and then receives a static response. For the tool to continue functioning, the user needs to enter more information. The tool can produce harmful outputs, such as misleading information for a chatbot or graphic images from an image generator. However, the danger here is pretty small and almost entirely dependent on the user. An AI agent is a recursive interface. A user gives it a task and executes an arbitrary sequence of actions to accomplish that task. The keyword here being ‘arbitrary’. Notice that AI agents are fundamentally a loop. Action Planning → Action Execution → Action Evaluation. This means that agents can accomplish many subtasks before stopping. By design, AI agents are more powerful and difficult to control than AI tools. A tool can analyze a specific X-ray for tumors; an agent can monitor a patient's condition, decide when new tests are needed, and adjust treatment plans. AI agents don’t require a human in the loop. Because of this additional freedom, agents can solve problems with more unique steps and accomplish more useful work. However, this also increases their potential for harm. Agent behavior can change as its token context increases,[5] causing it to become more biased over time. An agent tasked with managing patient care might begin acting as

17Feb 11, 2025

Michael Soareverix

Message

108

Detecting AI Agent Failure Modes in Simulations

Feb 11, 202517

Pivotal Acts are easier than Alignment?

The prevailing notion in AI safety circles is that a pivotal act—an action that decisively alters the trajectory of artificial intelligence development—requires superhuman AGI, which itself poses extreme risks. I challenge this assumption. Consider a pivotal act like "disable all GPUs globally." This could potentially be achieved through less advanced...

Jul 21, 20242

Optimizing for Agency?

It's commonly accepted that pretty much every optimization target results in death. If you optimize for paperclips, humans die. If you optimize for curing cancer, humans die. What about optimizing for agency? The way I visualize this is after a superintelligence takeover, and the superintelligence is optimizing for intelligent agency,...

Feb 14, 202410

The Virus - Short Story

There are a number of ways that our AI and technological development can turn out. This is a hypothetical story about a way where AI development is stopped and a 'pivotal act' that does not require AGI occurs. The first thing I notice when I wake up is that the...

Apr 13, 20234

Gold, Silver, Red: A color scheme for understanding people

A few months ago, I built a predictive algorithm to determine the probability of romantic relationships between individuals in my periphery. The algorithm works like this: -Personal Factors. How likely are you as an individual to find yourself in a new relationship? This contains variables like initiative, expectations for a...

Mar 13, 202317

A Good Future (rough draft)

It is the year 2500. Humanity has overcome its challenges. AI alignment has been solved and a benevolent god watches over us, allowing us to create and share what we desire. But it no longer interferes too much; at least, not too much in this realm. There are a lot...

Oct 24, 202210

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.

GATO is the most general agent we currently know about. It's a general-purpose model with transformer architecture. GATO can play atari games, speak to humans, and classify images. For the purpose of this post, I only really care about it being able to play games and speak to humans. Our...

Sep 8, 20222

Load More (7/11)

LESSWRONG
LW

LESSWRONG
LW

Michael Soareverix

Michael Soareverix

Michael Soareverix

Detecting AI Agent Failure Modes in Simulations

Gold, Silver, Red: A color scheme for understanding people

Our Existing Solutions to AGI Alignment (semi-safe)

A Good Future (rough draft)

Michael Soareverix

Detecting AI Agent Failure Modes in Simulations

Pivotal Acts are easier than Alignment?

Optimizing for Agency?

The Virus - Short Story

Gold, Silver, Red: A color scheme for understanding people

A Good Future (rough draft)

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.

Detecting AI Agent Failure Modes in Simulations

Gold, Silver, Red: A color scheme for understanding people

Our Existing Solutions to AGI Alignment (semi-safe)

A Good Future (rough draft)

Detecting AI Agent Failure Modes in Simulations

Pivotal Acts are easier than Alignment?

Optimizing for Agency?

The Virus - Short Story

Gold, Silver, Red: A color scheme for understanding people

A Good Future (rough draft)

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.