A ToM-Inspired Agenda for AI Safety Research

Andrés Cotton

TLDR: Theory of Mind (ToM) is the cognitive ability to attribute mental states to oneself and others, allowing individuals (or AIs) to understand that others have perspectives different from their own. Understanding ToM in models can help us mitigate three high-stakes problems from transformative AI:

Our agentic ecosystems are fragile: in mixed-motive environments, agents cannot reliably adapt their behavior to adversarial actors. Better ToM could close that gap, with a structural asymmetry that favors defense over offense.
Researching ToM in models would help us better understand the risk of AI manipulation, which is too often cited and underspecified.
Better ToM in models could help with reward misspecification by reducing how much we need to explicitly specify to achieve alignment.

These three problems (a fragile agentic ecosystem, the risk of AI manipulation, and reward misspecification) are connected: they all depend on how models represent the beliefs, intentions and emotions of others. I think this area deserves significantly more research attention than it currently gets.

Epistemic status: This is a query, not a finished position. I think there is value in this direction and want to kickstart more specific conversations.

1. ToM and the fragility of Agentic Ecosystems

One important limitation of LLMs is that they don't seem to properly adapt their behavior when facing adversarial interactions. This is especially concerning in agentic ecosystems, where agents will routinely encounter mixed-motive situations with strangers who may not be aligned on the overall goal (Leibo et al., 2021; Agapiou et al., 2022). To navigate these situations safely, agents need to "engage in reciprocal cooperation with like-minded partners, while remaining robust to exploitation attempts from others" (Du et al., 2023, p. 5). Yet current models can be attacked iteratively, even within the same conversation, without offering significant pushback.This is a clear failure mode. We would expect robust and safe agents to modify their behavior when facing adversarial attacks (e.g., if someone is trying to scam them, they should stop interacting or alert their communities). This process can only be successful if the agent can model the goals of others and recognize another agent as being adversarial. In that sense, better ToM capabilities could enable a more robust and safer AI agents' ecosystem.

On the other hand, this capability might be dual-use. Attackers with better ToM could potentially use this capability to exploit others. If malicious agents know what others believe, they can surely be more effective at manipulating them. However, there may be a structural asymmetry favoring defense: recognizing adversarial intent is a pattern recognition problem, while successfully manipulating another agent also requires sustained, adaptive deception, arguably a harder task. If that's the case, this capability would be more defense-oriented than offense-oriented. Simply recognizing the adversarial behavior could be enough to trigger a set of defenses that outweigh the manipulation capabilities. This should be researched and proven. I think running experiments to gather evidence in this regard would be a meaningful contribution.

2. ToM and the Risk of AI Manipulating Humans

One potential downside of increased ToM capabilities in models would be their ability to manipulate us. Every person who's been in AI Safety long enough has heard the phrase: "a superintelligent model would find ways to persuade any human into anything." I understand that the underlying assumption there is that the superintelligent AI (SIA) would have as much knowledge of the human mind as required to make humans do whatever the SIA wants. But how much do current models know about the beliefs of the humans they interact with? How is that knowledge represented and used with the aim of persuading? I believe that this threat model is severely underspecified, and it's important for us to study it further.

Some early research on these topics received significant attention, and it was focused on whether LLMs can personalize arguments based on individual attributes (e.g. gender, age, ethnicity, education level, political affiliation) achieving higher persuasiveness. It should be noted that the experiments used only GPT-4 and were centered on comparing LLM persuasion to Human persuasion. Experiments with newer models that aim to identify the concrete mechanisms and risk factors underlying this increased persuasiveness could help us to build some defenses and mitigations. Are these models achieving this enhanced persuasiveness because of how they model human mental states (ToM)?

If you’ve come across significant work in this direction, please do share. Otherwise, it may be a promising area to explore!

3. ToM and Alignment Science

An aligned AI superintelligence should be able to accurately model the mental states, beliefs, and emotions of other beings. This is a stepping stone for truly virtuous interactions with others. In that regard, I think well leveraged ToM capabilities could help us solve a core problem of our field: reward misspecification. I don’t think we’ll ever be able to comprehensively and explicitly define all our values, desires and beliefs. Given that limitation, the ability to model tacit knowledge, preferences, and desires accurately would be a prerequisite for a truly aligned AI. I think that the goal of modeling human mental states combined with some foundational values induced by a process akin to constitutional AI could be a promising research direction for alignment science. A ToM lens could be incorporated or complement approaches such as inverse RL (including RLHF or DPO) in at least two concrete ways.

First, current preference-based methods treat observed choices as self-contained signals rather than as evidence of underlying mental states. When a human picks one response over another, the reward model records that preference as a scalar signal and nothing more. A ToM-augmented approach would instead treat each label as evidence of a broader belief-desire profile, inferring not just what was chosen, but why. This matters because, as Sun & van der Schaar (2025) note, current methods have no mechanism to look past the surface signal. Preference data would then become one input among others into a richer model of human mental states, rather than the direct target of optimization.

Second, reward models are mostly frozen snapshots. They capture human preferences at a point in time and get fixed, but human values are not static. The same person may have genuinely different preferences depending on their current beliefs, emotional state, or the situation they find themselves in. A system that models why someone holds a preference, grounded in a representation of their beliefs and goals, is better positioned to track that contextual variation, moving alignment closer to a dynamic inference process rather than a one-time optimization target.

Conceptual Note

I still think these are fundamental questions worth researching, even if the ToM framing from cognitive science isn't the perfect lens. The term itself carries two assumptions worth flagging: that there is a theory, and that there are minds. With LLMs, we can't be confident either holds in any clear sense. It isn't obvious that models have a structured framework for understanding others, nor that the entities involved have mental states in a meaningful way. The research agenda I outlined above holds even if we assume that what models are doing is not strictly ToM. That said, I haven't seen a better conceptual convention to use as common ground. Do let me know if you know of other relevant framings!

References

John P Agapiou, Alexander Sasha Vezhnevets, Edgar A Duéñez-Guzmán, Jayd Matyas, Yiran Mao, Peter Sunehag, Raphael Köster, Udari Madhushani, Kavya Kopparapu, Ramona Comanescu, et al. Melting pot 2.0. arXiv preprint arXiv:2211.13746, 2022.
Yali Du, Joel Z. Leibo, Usman Islam, Richard Willis, and Peter Sunehag. A review of cooperation in multi-agent learning. arXiv preprint arXiv:2312.05162, 2023.
Joel Z Leibo, Edgar A Dueñez-Guzman, Alexander Vezhnevets, John P Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charlie Beattie, Igor Mordatch, and Thore Graepel. Scalable evaluation of multi-agent reinforcement learning with melting pot. In International conference on machine learning, pages 6187–6199. PMLR, 2021.
Salvi, F., Horta Ribeiro, M., Gallotti, R. et al. On the conversational persuasiveness of GPT-4. Nat Hum Behav 9, 1645–1653, 2025. https://doi.org/10.1038/s41562-025-02194-6
Hao Sun and Mihaela van der Schaar. Inverse reinforcement learning meets large language model post-training: Basics, advances, and opportunities. arXiv preprint arXiv:2507.13158, 2025.

Acknowledgements

I'd like to thank Scott D. Blain, @atharva and Lucas Pratesi for their readings on this early manuscript.

Painting: "The Cardsharps", Caravaggio, ~ 1594.

7