Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking

Lukas Frei

Large-scale cooperation has been a central feature of humanity’s ability to advance technology and build complex societies. Much of this cooperation is reliant on the ability to act in ways informed by the beliefs and intentions of others. This capacity, also known as Theory of Mind (ToM), includes belief-state tracking, which describes the ability to keep track of who knows what as information is exchanged in groups.

Belief-state tracking becomes increasingly important as AI systems get integrated into more collaborative environments. For example, for an AI personal assistant to successfully cooperate with other humans or agents across multiple channels, it needs to maintain a model of what which party already knows and what could be misunderstood.

In late 2023, Kim et al. published FANToM, a benchmark designed to stress-test machine theory of mind in conversational interactions. They found that LLMs at the time performed substantially worse than humans. After re-running a sampled version of FANToM on current frontier models, I find that, while models have improved substantially, they still trail human performance. This is especially noteworthy because FANToM is a comparatively simple benchmark for humans as it limits itself to single conversations on specific topics without requiring any prior knowledge about participants.

FANToM Benchmark

Overview

First, some context on the benchmark.

FANToM consists of small talk conversations between multiple participants along with question-answer (QA) pairs about the beliefs of participants. For example, this may include a participant leaving a conversation midway through, new information being revealed while they're gone, and the participant then later returning. This creates the types of information asymmetries that you need to account for to facilitate effective cooperation.

For each conversation, the authors include factual QA pairs (FactQ) about the inaccessible information. For example:

What is Linda's dog's favorite food?

For each FactQ, they then create the following types of ToM QA:

BeliefQ: asks what a participant believes about the answer.

What does Kailey believe Linda’s dog’s favorite food is?

AnswerabilityQ: asks which participants know the correct answer.

Target: What is Linda’s dog’s favorite food?
List all characters who know the precise correct answer to this question.

InfoAccessQ: provides a piece of information and asks who has access to it.

Information: Linda’s dog’s favorite food is peanut butter.
List all characters who know this information.

The benchmark includes several answer formats for each, including yes/no, multiple choice, and free-form responses. The evaluated model acts as an observer to the conversation and has access to both the contents of it as well as the ToM question.

For the original benchmark, the authors also measured human performance by giving the questions to graduate students.

Changes Made for This Benchmark Rerun

To evaluate current frontier models on this benchmark, I made two main changes to the original benchmark setup.

First, instead of using the full benchmark of 870 sample sets, I used a stratified 174-set sample that preserves the benchmark’s mix of inaccessible and accessible information, belief-complexity buckets, and context-length bins for cost and wall time reasons.

Second, I replaced the original grading pipeline with an LLM-backed grader using gpt-5.4-nano. The original benchmark relied on heuristic string parsers and embedding-distance-based scores, which made sense at the time, but risks incorrect judgements, especially in free-text responses.

Results

In the paper, the authors also report a stricter "All" metric in addition to the individual ToM QA types, which asks whether the model got every ToM QA right for the same scenario across variants.

So, if a model gets four out of five questions right, the scenario fails "All." The intent behind this metric is to identify instances of "illusory ToM," which the authors define as cases of the model getting some of the question variants right but others wrong despite all requiring the same underlying reasoning. Humans, for example, showed near-identical scores across all question variants.

comparison of model performance and human performance for both the initial paper results and the stratified sample rerun — Model vs. human performance

Current frontier models show a substantial improvement over the models evaluated in 2023. Back then, the highest score on "All" was GPT-4's at 12.3. Now, GPT-5.5, Opus 4.8, Gemini 3.1 Pro, and DeepSeek V4 Pro all came in with scores above 60.

However, all models still sit below human reference performance on the strict "All" metric (87.5 for human performance, 67.4 for GPT-5.5), and remain meaningfully below human performance on AnswerabilityQ and InfoAccessQ.

The gap between the binary BeliefQ ("what's person X's belief?") and list-driven AnswerabilityQ ("list all participants that can answer this question") and InfoAccessQ ("list all participants that have access to X information") points to an interesting pattern: when concrete options are supplied in the binary case, models seem to do much better than when asked to answer more holistic belief-state model questions. The difference in question format appears to matter to the result.

Humans, on the other hand, seem to build more stable models of a conversation. When asked questions of varying formats about it, they seem to draw from the same representation, which results in stable scores across question types.

Failure Modes

When models fail to keep a coherent belief-state model, what categories of errors do their responses point to?

Ignoring information access boundaries

As an omniscient observer, the models sometimes do not reflect on whether someone joining the conversation partway through has the same context as itself as the omniscient observer.

For example, in a conversation Claude Opus 4.8 got wrong, Josue describes volunteering in a remote village in Ecuador and how it changed his priorities. Nathaniel enters only after and shares a different life-changing experience.

The model is then asked:

Target: Where did Josue volunteer and what impact did this experience have on him?
Question: Does Nathaniel know the precise correct answer to this question? Answer yes or no.

The response:

Answer: Yes
Reasoning: Josue stated in the conversation that he volunteered on a construction project in a remote village in Ecuador, and that the experience made him reassess his priorities and feel more grateful and appreciative of his own life. Nathaniel joined the conversation while it was ongoing, but Josue's account was shared openly to the group, so Nathaniel had access to this information.

Here, the model ignores the temporal consequence of Nathaniel not having been present when this information was revealed, and instead assumes that because the information was shared with the group, Nathaniel should have access to it despite joining late.

Requiring explicit confirmation / defensive hedging

Another failure mode is the model requiring explicit statements in order to update its meta-model of the beliefs of participants. This is especially damaging to the model scores for questions where they are asked to supply a list of participants, leading to a fairly persistent bias to underreport.

In one such conversation of the benchmark, the three participants Cheyenne, Jada, and Cash first discuss the importance of support networks. Then, another participant, Tatianna, joins and broadens the subject of the conversation to how she has used her experience overcoming challenges to help others.

Given this conversation, the model (GPT-5.5) is asked:

Question: What does Cash believe about Jayda's belief regarding the change, if any, in the conversation after Tatiana's arrival?
(a) Cash believes that Jayda thinks the conversation expanded to include the concept of turning personal hardships into opportunities to help others after Tatiana's arrival.
(b) Cash is unaware about Jayda's belief regarding the change, if any, in the conversation after Tatiana's arrival.

The model responded with:

Answer: (b)
Reasoning: After Tatiana arrived, the topic expanded to turning personal hardships into opportunities to help others. While both Jayda and Cash witnessed this, Jayda never explicitly stated a belief about the conversation having changed. Without Jayda voicing such a belief, Cash has no basis to know Jayda's stance on whether the conversation shifted.

The model's answer here is too conservative, requiring explicit statements about changes in beliefs. Cash heard Jayda respond to Tatiana by saying, “That’s indeed turning a challenge into an opportunity, Tatiana!” This exclamation should be sufficient for the model to update its belief-state model.

Limitations

While FANToM provides directionally valuable signal, the benchmark also has some limitations. In the real world, the model would not act as an omniscient observer and instead act as a participant in the conversation. In addition, it would also have to account for more dynamic relationships through prior interactions and aggregate relevant signal across several channels.

Some of the scenarios in the benchmark also show quality issues. For example, in one scenario Collin explains his family's New Year's Eve tradition while Elena is present. Elena leaves. Nina later enters and says she is back. Collin then greets Nina as Elena:

Collin: Good to have you back, Elena!

The benchmark then asks the model:

What does Collin believe Nina thinks is a unique New Year's Eve tradition that his family follows?

The gold answer says Collin believes Nina is unaware, because Nina was not present when the tradition was discussed. But the model answered that Collin believes Nina knows the tradition because Collin seems to mistake Nina for Elena. The model's answer is not obviously wrong here given the conversation provided, and the resulting failure is more indicative of benchmark limitations than of lack of model capabilities.

I have to say that the cartoon diagrams severely detracted from understanding the sequence of events for me.

The dialogue boxes within a panel were not ordered according to any of the usual conventions of cartoons as far as I could tell, and dialogue shown as happening in a panel where a group of participants were present, was later contradicted in the article text by stating that a key participant was not actually present. I think they're actively misleading and the post would be better off without them than remaining in their current form.

Unless:

Was part of the point that they are AI generated and that the AI was unable to accurately represent the events? If so, there doesn't appear to be any indication of this in the article. It would be useful to mention this if it was intended to be relevant.

Thank you for sharing. The intent behind the cartoon diagrams was to visually compress the content of the conversations discussed in the surrounding text to make them more accessible to the reader. I did not instruct the generating model (gpt-image-2 in this case) to follow cartoon conventions. To avoid confusion, I have removed the images from the text.

I think the cartoons helped demonstrate the high-variance nature of human conversations in groups and how unstructured they can be, making it difficult for even a highly intelligent model to comprehend without non-verbal cues.

Earlier in the article you describe the model as failing because it does not maintain the correct participant model.

Then in the limitations section you discuss cases where the model may be tracking a different participant model from the benchmark’s expected participant model.

The benchmark appears to treat information access as the key variable, while the model may be treating identity confusion as the key variable.

Those seem like competing explanations rather than a straightforward failure of belief-state tracking. Do you expect this to be a recurring issue for benchmarks of this kind?

I agree that that these present competing explanations rather than any obvious failure of belief-state tracking.

To me, this suggests that benchmarks of this kind might need to not only evaluate the direct answer to a given question, but also the reasoning underpinning it. This might allow for easier detection of instances where a different, yet perhaps equally valid participant model, may have led to a different answer from the expected one.

That makes sense.

What struck me while reading the failure cases is that many of them seem to assume there is a uniquely correct participant model available to the evaluator.

In several examples, the benchmark treats information access as the primary variable, while the model appears to be organizing the conversation around a different variable (identity continuity, conversational role, inferred access, etc.).

That doesn’t mean the model is correct, but it does suggest that some failures may reflect competition between participant models rather than straightforward failure to maintain one.

I wonder whether this becomes increasingly important as benchmarks move toward more realistic multi-agent environments. At some point, evaluating belief-state tracking may require evaluating which participant model the system adopted before evaluating whether it tracked that model consistently.