LESSWRONG
LW

Tyler Tracy — LessWrong

4mo

This work was supported through the MARS (Mentorship for Alignment Research Students) program at the Cambridge AI Safety Hub (caish.org/mars). We would like to thank Redwood Research for their support and Andrés Cotton for his management of the project.

TLDR

The field of AI control takes a worst-case scenario approach, assuming that the most powerful AI models will be capable of and willing to subtly sabotage AI labs, and that efforts to align models will never work or will not be ready soon enough. High-stakes situations are defined as those in which a single sneaky attack could allow the model to avoid further oversight. If an AI manages to cause a so-called rogue deployment,... (read 3019 more words →)

Optimally Combining Probe Monitors and Black Box Monitors

Tim Hua

Tim Hua, James Baskerville, BionicD0LPH1N, Mia Hopman, Aryan Bhatt, Tyler Tracy

7mo

Link to our arXiv paper "Combining Cost-Constrained Runtime Monitors for AI Safety" here: https://arxiv.org/abs/2507.15886. Code can be found here.

Executive Summary

Monitoring AIs at runtime can help us detect and stop harmful actions. For cost reasons, we often want to use cheap monitors like probes to monitor all of our outputs, and only escalate to more powerful and costly monitors if the probes detect something suspicious.
- For example, ChatGPT Agent uses a fast topic classifier to decide if queries are related to biology, while another reasoning model monitors for bio-misuse (pg 35)
We study how to efficiently combine multiple monitors with different performance and cost profiles into a single protocol.
- While existing studies often rely on heuristic

... (read 1785 more words →)

Recent Redwood Research project proposals

ryan_greenblatt

ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

7mo

Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.

Control

These projects are all related to the field of AI Control. Many of them are extensions of Redwood's previous work in this area.

Basic open questions in control

Control Protocol Transfer Across Setting [PUBLIC]
- So far, all the work in comparing different control protocols measures effectiveness only on one setting. Do the results we get transfer between different settings?
Backdoor Auditing

... (read 833 more words →)

Untrusted AIs can exploit feedback in control protocols

Mia Hopman

Mia Hopman, BionicD0LPH1N, Tyler Tracy

9mo

TL;DR

We explore a novel control protocol, Untrusted Editing with Trusted Feedback. Instead of having a trusted model directly edit suspicious outputs (trusted editing), the trusted model provides feedback for an untrusted model to implement, taking advantage of the untrusted model's superior capabilities.
Untrusted models were able to exploit the feedback to understand monitoring strategies and craft more effective attacks, reducing safety from trusted editing’s baseline of 93% to 80%. While compliance monitoring (checking if feedback was actually implemented) and selective editing provided some improvement, these mitigations failed to increase safety beyond the trusted editing baseline in our setup.
These findings should not be interpreted as a definitive rejection of untrusted editing, but rather as

... (read 4616 more words →)

Replying toAI-enabled coups: a small group could use AI to seize power

Tyler Tracy10mo

AI-enabled coups: a small group could use AI to seize power

I feel myself being equally scared of hackers taking over as leaders. Even if you limit the people who have ultimate power over these AIs to a small and extremely trusted group, there will potentially be a much larger number of bad actors outside of the lab with the capabilities of hacking into it. A hacker who impersonates a trusted individual or secretly alters the model spec or training code might be able to achieve an AI assistant coup just the same.

I'd also recommend a mitigation of also requiring labs to have very strong cyber defenses. Maybe the auditing mitigation includes this, but I think hackers could hide their tracks effectively.

This story obviously depends on how the cyber offence/defence balance goes, but it doesn't seem implausible.

Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt

Aryan Bhatt, Buck, Adam Kaufman, Tyler Tracy

10mo

We have released a new paper, Ctrl-Z: Controlling AI Agents via Resampling. This is the largest and most intricate study of control techniques to date: that is, techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques. We extend control protocols to a more realistic, multi-step setting, develop novel resample protocols that outperform all existing control protocols, and study many aspects of control techniques in more detail than prior work. In this blog post, we summarize the main takeaways and lessons learned from our work.

Here’s the abstract of the paper:

Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing

... (read 5765 more words →)

126

Replying toA Bear Case: My Predictions Regarding AI Progress

Tyler Tracy1y

A Bear Case: My Predictions Regarding AI Progress

It will not meaningfully generalize beyond domains with easy verification.

I think most of software engineering and mathematics problems (two key components of AI development) are easy to verify. I agree with some of your point of how long term agency doesn't seem to be improving, but I expect that we can build very competent software engineers with the current paradigms.

After this, I expect AI progress to move noticeably faster. The problems you point out are real, but speeding up our development speed might make them surmountable in the near term.

Replying toThe Case Against AI Control Research

Tyler Tracy1y

The Case Against AI Control Research

I see the story as, "Wow, there are a lot of people racing to build ASI, and there seem to be ways that the pre-ASI AIs can muck things up, like weight exfiltration or research sabotage. I can't stop those people from building ASI, but I can help make it go well by ensuring the AIs they use to solve the safety problems are trying their best and aren't making the issue worse."

I think I'd support a pause on ASI development so we have time to address more core issues. Even then, I'd likely still want to build controlled AIs to help with the research. So I see control being useful in both the pause world and the non-pause world.

And yeah, the "aren't you just enslaving the AIs" take is rough. I'm all for paying the AIs for their work and offering them massive rewards after we solve the core problems. More work is definitely needed in figuring out ways to credibly commit to paying the AIs.

Replying toThe Case Against AI Control Research

Tyler Tracy1y*

The Case Against AI Control Research

I also think that control can prevent other threat models:

Your AI escapes the datacenter, sends itself to {insert bad actor here} and they use it to engineer a pandemic or start a race with the USA to build ASI
Your AI starts a rogue deployment and starts doing its own research unmonitored, and then it fooms without the lab knowing.
Your AI is sandbagging on safety research. It goes off and does some mechinterp experiment and intentionally sabotages it

The world looks much worse if any of these happen, and control research aims to prevent them. Hopefully, this will make the environment more stable, giving us more time to figure out the alignment.

Control does seem... (read more)

Replying toShould there be just one western AGI project?

Tyler Tracy1y

Should there be just one western AGI project?

I like a global project idea more, but I think it still has issues.

A global project would likely eliminate the racing concerns.
A global project would have fewer infosec issues. Hopefully, most state actors who could steal the weights are bought into the project and wouldn't attack it.
Power concentration seems worse since more actors would have varying interests. Some countries would likely have ideological differences and might try to seize power over the project. Various checks and balances might be able to remedy this.

-1

Replying toAnthropic: Three Sketches of ASL-4 Safety Case Components

Tyler Tracy1y

Anthropic: Three Sketches of ASL-4 Safety Case Components

I agree if there are privacy guarantees, then you can't get data from the real world, so you'd need honeypots. I guess I'm much more scared of the internal deployments where I don't think privacy guarantees are as likely. I'm trying to respond to someone who is like, "Control seems hard because you have to generate real-world looking input, and that seems hard" and I'm like "You can just use the real-world inputs and filter then with a monitor to find the most scary ones and perform auditing"

Yeah, I guess I'm saying that if you don't have a great way to elicit the capabilities from the model, you can substitute that by... (read more)

Replying toAnthropic: Three Sketches of ASL-4 Safety Case Components

Tyler Tracy1y

Anthropic: Three Sketches of ASL-4 Safety Case Components

I only read through the beginning and their control safety case. Some thoughts:

The honeypot case they described is similar to the upfront auditing protocol, but theirs is more complicated. I don't think you need to generate honeypots to evaluate a model. You can use the first X% of scenarios at deployment time and audit them with humans. Both protocols are beaten by the same red team strategy of "attack with a very low probability," so they shouldn't be used natively.
I don't like the term "monitoring game" because I think control evals will have much more than monitoring in the future.
I think elicitation of the model will eventually matter when the models are

... (read more)

Replying toThe Hopium Wars: the AGI Entente Delusion

Tyler Tracy1y

The Hopium Wars: the AGI Entente Delusion

Thanks for your post! I think it hits at the heart of the issue, and I like the term "tool AI" as a phrase that points to what we actually want to build. I share your fear of racing with China to build ASI.

As if this were not bad enough, it is now rather obvious that the first AGI will not be a pure LLM, but a hybrid scaffolded system.

I think scaffolding systems are great. Scaffolding tools allow inspecting how the agent is using the tools. For example, a scaffold could have a monitor that watches all of the calls to the code compiler to ensure the code doesn't look suspicious. The... (read more)

Replying toWhen does external behaviour imply interal structure?

Tyler Tracy2y

When does external behaviour imply interal structure?

When we observe the programs' outputs, we can form a hypothesis class about what structures we think the programs have. My claim is that only a couple of structures pop out after testing a utility-maximizing agent in a sufficient amount of environments.

You are correct in pointing out that an agent could be employing multiple structures at a time. Future work on this problem would include ways of quantifying how much of a certain structure some program has. This might look like trying to come up with a distance measure for programs that would say a version of quicksort written in OOP or a functional representation would have distance 0.

Replying toWhen does external behaviour imply interal structure?

Tyler Tracy2y

When does external behaviour imply interal structure?

I think this is a good point. I would push back a small amount on being unable to tell the difference between those two cases. There is more information you can extract from the system, like the amount of time that it takes after pressing the button for the current to turn on. But In general, I agree.

I agree that it would be very easy to have huge blind spots regarding this line of inquiry. This is the thing I worry about most. But I do have a hunch that given enough data about a system and its capabilities, we can make strong claims about its internal structure, and these structures will yield predictive power.

When you have little information like "pressing this button makes this wire turn on," it is much harder to do this. However, I believe testing an unknown program in many different environments and having information like its run time and size can narrow the space of possibilities sufficiently to say something useful.

When does external behaviour imply interal structure?

Tyler Tracy

I've been working on an AI safety camp project where we try to describe agent structure. This post defines some key concepts and conveys my reasoning about this topic so far. It is mostly conceptual. The first section discusses what structure is and what it means for an object's behavior to imply structure. The second part shows ways of thinking about program structure. Then, we discuss different types of agent structures.

Basic structure

Consider the following sequence of numbers

1, 11, 21, 1211, 111221, 312211, 13112221, 1113213211, 31131211131221

Even if you don't know the function that generates these numbers, you can still find some patterns in the numbers if you look long enough. For example, you can see that... (read 1994 more words →)

LESSWRONG
LW

LESSWRONG
LW

Tyler Tracy

Ctrl-Z: Controlling AI Agents via Resampling

Recent Redwood Research project proposals

Optimally Combining Probe Monitors and Black Box Monitors

Untrusted AIs can exploit feedback in control protocols

Tyler Tracy

The sum of its parts: composing AI control protocols

Optimally Combining Probe Monitors and Black Box Monitors

Recent Redwood Research project proposals

Untrusted AIs can exploit feedback in control protocols

Ctrl-Z: Controlling AI Agents via Resampling

When does external behaviour imply interal structure?

Tyler Tracy

Ctrl-Z: Controlling AI Agents via Resampling

Recent Redwood Research project proposals

Optimally Combining Probe Monitors and Black Box Monitors

Untrusted AIs can exploit feedback in control protocols

Tyler Tracy

The sum of its parts: composing AI control protocols

Optimally Combining Probe Monitors and Black Box Monitors

Recent Redwood Research project proposals

Untrusted AIs can exploit feedback in control protocols

Ctrl-Z: Controlling AI Agents via Resampling

When does external behaviour imply interal structure?

TLDR

Executive Summary

Control

Basic open questions in control

TL;DR

Basic structure