Earlier in the article you describe the model as failing because it does not maintain the correct participant model.
Then in the limitations section you discuss cases where the model may be tracking a different participant model from the benchmark’s expected participant model.
The benchmark appears to treat information access as the key variable, while the model may be treating identity confusion as the key variable.
Those seem like competing explanations rather than a straightforward failure of belief-state tracking. Do you expect this to be a recurring issue for benchmarks of this kind?
Large-scale cooperation has been a central feature of humanity’s ability to advance technology and build complex societies. Much of this cooperation is reliant on the ability to act in ways informed by the beliefs and intentions of others. This capacity, also known as Theory of Mind (ToM), includes belief-state tracking, which describes the ability to keep track of who knows what as information is exchanged in groups.
Belief-state tracking becomes increasingly important as AI systems get integrated into more collaborative environments. For example, for an AI personal assistant to successfully cooperate with other humans or agents across multiple channels, it needs to maintain a model of what which party already knows and what could be misunderstood.
In late 2023, Kim et al. published FANToM, a benchmark designed to stress-test machine theory of mind in conversational interactions. They found that LLMs at the time performed substantially worse than humans. After re-running a sampled version of FANToM on current frontier models, I find that, while models have improved substantially, they still trail human performance. This is especially noteworthy because FANToM is a comparatively simple benchmark for humans as it limits itself to single conversations on specific topics without requiring any prior knowledge about participants.
FANToM Benchmark
Overview
First, some context on the benchmark.
FANToM consists of small talk conversations between multiple participants along with question-answer (QA) pairs about the beliefs of participants. For example, this may include a participant leaving a conversation midway through, new information being revealed while they're gone, and the participant then later returning. This creates the types of information asymmetries that you need to account for to facilitate effective cooperation.
Sample Benchmark Conversation
For each conversation, the authors include factual QA pairs (FactQ) about the inaccessible information. For example:
For each FactQ, they then create the following types of ToM QA:
BeliefQ: asks what a participant believes about the answer.
AnswerabilityQ: asks which participants know the correct answer.
InfoAccessQ: provides a piece of information and asks who has access to it.
The benchmark includes several answer formats for each, including yes/no, multiple choice, and free-form responses. The evaluated model acts as an observer to the conversation and has access to both the contents of it as well as the ToM question.
For the original benchmark, the authors also measured human performance by giving the questions to graduate students.
Changes Made for This Benchmark Rerun
To evaluate current frontier models on this benchmark, I made two main changes to the original benchmark setup.
First, instead of using the full benchmark of 870 sample sets, I used a stratified 174-set sample that preserves the benchmark’s mix of inaccessible and accessible information, belief-complexity buckets, and context-length bins for cost and wall time reasons.
Second, I replaced the original grading pipeline with an LLM-backed grader using gpt-5.4-nano. The original benchmark relied on heuristic string parsers and embedding-distance-based scores, which made sense at the time, but risks incorrect judgements, especially in free-text responses.
Results
In the paper, the authors also report a stricter "All" metric in addition to the individual ToM QA types, which asks whether the model got every ToM QA right for the same scenario across variants.
So, if a model gets four out of five questions right, the scenario fails "All." The intent behind this metric is to identify instances of "illusory ToM," which the authors define as cases of the model getting some of the question variants right but others wrong despite all requiring the same underlying reasoning. Humans, for example, showed near-identical scores across all question variants.
Model vs. human performance
Current frontier models show a substantial improvement over the models evaluated in 2023. Back then, the highest score on "All" was GPT-4's at 12.3. Now, GPT-5.5, Opus 4.8, Gemini 3.1 Pro, and DeepSeek V4 Pro all came in with scores above 60.
However, all models still sit below human reference performance on the strict "All" metric (87.5 for human performance, 67.4 for GPT-5.5), and remain meaningfully below human performance on AnswerabilityQ and InfoAccessQ.
The gap between the binary BeliefQ ("what's person X's belief?") and list-driven AnswerabilityQ ("list all participants that can answer this question") and InfoAccessQ ("list all participants that have access to X information") points to an interesting pattern: when concrete options are supplied in the binary case, models seem to do much better than when asked to answer more holistic belief-state model questions. The difference in question format appears to matter to the result.
Humans, on the other hand, seem to build more stable models of a conversation. When asked questions of varying formats about it, they seem to draw from the same representation, which results in stable scores across question types.
Failure Modes
When models fail to keep a coherent belief-state model, what categories of errors do their responses point to?
Ignoring information access boundaries
As an omniscient observer, the models sometimes do not reflect on whether someone joining the conversation partway through has the same context as itself as the omniscient observer.
For example, in a conversation Claude Opus 4.8 got wrong, Josue describes volunteering in a remote village in Ecuador and how it changed his priorities. Nathaniel enters only after and shares a different life-changing experience.
Samples from benchmark conversation
The model is then asked:
The response:
Here, the model ignores the temporal consequence of Nathaniel not having been present when this information was revealed, and instead assumes that because the information was shared with the group, Nathaniel should have access to it despite joining late.
Requiring explicit confirmation / defensive hedging
Another failure mode is the model requiring explicit statements in order to update its meta-model of the beliefs of participants. This is especially damaging to the model scores for questions where they are asked to supply a list of participants, leading to a fairly persistent bias to underreport.
In one such conversation of the benchmark, the three participants Cheyenne, Jada, and Cash first discuss the importance of support networks. Then, another participant, Tatianna, joins and broadens the subject of the conversation to how she has used her experience overcoming challenges to help others.
Samples from the benchmark conversation
Given this conversation, the model (GPT-5.5) is asked:
The model responded with:
The model's answer here is too conservative, requiring explicit statements about changes in beliefs. Cash heard Jayda respond to Tatiana by saying, “That’s indeed turning a challenge into an opportunity, Tatiana!” This exclamation should be sufficient for the model to update its belief-state model.
Limitations
While FANToM provides directionally valuable signal, the benchmark also has some limitations. In the real world, the model would not act as an omniscient observer and instead act as a participant in the conversation. In addition, it would also have to account for more dynamic relationships through prior interactions and aggregate relevant signal across several channels.
Some of the scenarios in the benchmark also show quality issues. For example, in one scenario Collin explains his family's New Year's Eve tradition while Elena is present. Elena leaves. Nina later enters and says she is back. Collin then greets Nina as Elena:
The benchmark then asks the model:
The gold answer says Collin believes Nina is unaware, because Nina was not present when the tradition was discussed. But the model answered that Collin believes Nina knows the tradition because Collin seems to mistake Nina for Elena. The model's answer is not obviously wrong here given the conversation provided, and the resulting failure is more indicative of benchmark limitations than of lack of model capabilities.