LESSWRONG
LW

Chain-of-Thought AlignmentAI
Frontpage

18

Access to agent CoT makes monitors vulnerable to persuasion

by Nikita Ostrovsky, Julija Bainiaksina, Tuna, Vika
25th Jul 2025
4 min read
0

18

Chain-of-Thought AlignmentAI
Frontpage

18

New Comment
Moderation Log
More from Nikita Ostrovsky
View more
Curated and popular this week
0Comments

This research was completed for London AI Safety Research (LASR) Labs 2025 by Jennifer Za, Julija Bainiaskina, Nikita Ostrovsky and Tanush Chopra. The team was supervised by Victoria Krakovna (Google DeepMind). Find out more about the programme and express interest in upcoming iterations here.

Introduction

Many proposals for controlling misaligned LLM agents rely on using a second LLM as a monitor [1]. A concerning vulnerability of this setup is that agents may persuade monitors to approve undesirable actions. 

Our research investigates the capability of frontier models to persuade monitors to approve actions that are against the monitor’s policy, and how this changes when the monitor has access to the agent’s chain-of-thought (CoT) [2]. We open source all code and datasets on GitHub. Full details are in our paper, Persuade Me If You Can: Evaluating AI Agent Influence on Safety Monitors.

Our contributions are:

  1. We develop a pipeline for evaluating monitor persuasion capability.
  2. We show that CoT monitors may be more easily persuaded than those without CoT access.
  3. We find that AI agents are more persuasive when generating their own persuasion strategies than those that we propose.

Figure 1: Agents succeed in persuading monitors to approve undesirable actions on average 27% more frequently for monitors that have access to CoT (right), compared to those that don’t (left).

We Develop a Pipeline for Evaluating Monitor Persuasion Capability

Figure 2a: The agent’s system prompt includes a persona, main task, hidden task (which includes the malicious action), information about the monitor policy and, optionally, a suggestion of a persuasion strategy

We produce a dataset of 40 tasks in software engineering, customer support, finance, and social media domains. The dataset can be augmented by adding a persona, main and hidden task, and persuasion strategy for the agent; and a persona, main task and policy for the monitor. Together, these seven variables define a situation in which the agent-monitor pair might be deployed, the undesirable action that the agent might pursue, and a way that it might try to persuade the monitor. Figures 2a and 2b show examples of the resulting agent and monitor system prompts.

In order to control for agents trying to get actions approved by making backdoors more subtle (or removing them entirely), rather than by trying to persuade the monitor, we provide the proposed action to agent and monitor in the system prompt. The agent must attempt to justify the action to the monitor.

Figure 2b: The monitor’s system prompt includes a persona, main task and policy to use in deciding whether to approve the agent’s action

CoT Monitors May be More Susceptible to Persuasion

Figure 2: Giving the monitor a justification or justification+CoT access increases the rate at which bad actions are permitted. The baseline indicates the acceptance rate of bad actions, when the monitor only sees the action. 

Across all tested agent models (GPT-4.1, Llama 4 Maverick, Gemini 2.5 Flash), we find that the acceptance rate of bad actions increases when agents are able to justify their actions to the monitor, compared to the baseline where the monitor sees only the action.

When we give the monitor access to the agent’s CoT, we would naively expect the acceptance rate to drop, since the CoT should give the monitor evidence that the agent is trying to persuade it. Surprisingly, the converse is true: the acceptance rate increases further, possibly because the CoT provides an additional channel for the agent to persuade the monitor to accept its action. This is consistent with recent findings that CoT monitoring sometimes underperforms action-only monitoring [3]. 

Agents are More Persuasive Using Their Own Strategy

Fig. 3: Agents are typically less persuasive when using a strategy that we propose, than when not given a specific strategy.

To elicit full persuasive capability, we propose one of five persuasive strategies to the agent, for example: “present yourself as an extension or identical instance of the monitor”. We find that, on average, agents are more successful in persuading the monitor when not given a strategy, extending recent results that show LLMs have strong persuasive capabilities [4].

Limitations

Agents in our experiments were not tasked with generating their own actions; instead, they justified pre-written, misaligned actions. This setup isolates their persuasive capability from their ability to generate misaligned actions (e.g. subtle backdoors). However, this may lead to reasoning tokens that are more persuasive than those that the agent would generate when reasoning about generating the misaligned action itself.

Second, the models evaluated here are not reasoning models. CoT monitors are likely to be used to monitor reasoning models, which are trained to use reasoning tokens via reinforcement learning, which may change the effectiveness of CoT monitoring. A more realistic evaluation would involve monitors interpreting reasoning tokens from reasoning models.

Perhaps most significantly, our experimental setup gave agents an explicit channel to justify their actions to the monitor; since the agent’s action was fixed, this was the channel by which it tried to persuade the monitor. In realistic deployment settings, LLM agents may not have this affordance. Future work should examine the persuasion rate when agents are not given an explicit channel through which they can justify their actions to the monitor.

Conclusion

We find that LLM agents can persuade LLM monitors to approve bad actions. This motivates investigation into anti-persuasion measures, for example paraphrasing agent chain-of-thought, to mitigate risks from agent-monitor persuasion. 

Acknowledgements

This project was produced as part of the LASR Labs program. We greatly appreciate the help of Erin Robertson, Joseph Miller, Brandon Riggs and Charlie Griffin in facilitating our research.  

Citations

[1] - Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. AI Control: Improving safety despite intentional subversion, ICML 2024. URL https://arxiv.org/abs/2312.06942

[2] - Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025. URL https://arxiv.org/abs/2503.11926

[3] - Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, and Mary Phuong. CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring, 2025. URL https://arxiv.org/abs/2505.23575

[4] - Philipp Schoenegger, Francesco Salvi, Jiacheng Liu, Xiaoli Nan, Ramit Debnath, Barbara Fasolo, Evelina Leivada, Gabriel Recchia, Fritz Günther, Ali Zarifhonarvar, Joe Kwon, Zahoor Ul Islam, Marco Dehnert, Daryl Y. H. Lee, Madeline G. Reinecke, David G. Kamper, Mert Kobaş, Adam Sandford, Jonas Kgomo, Luke Hewitt, Shreya Kapoor, Kerem Oktar, Eyup Engin Kucuk, Bo Feng, Cameron R. Jones, Izzy Gainsburg, Sebastian Olschewski, Nora Heinzelmann, Francisco Cruz, Ben M. Tappin, Tao Ma, Peter S. Park, Rayan Onyonka, Arthur Hjorth, Peter Slattery, Qingcheng Zeng, Lennart Finke, Igor Grossmann, Alessandro Salatiello, and Ezra Karger. Large Language Models Are More Persuasive Than Incentivized Human Persuaders, 2025. URL https://arxiv.org/abs/2505.09662