Why Cai et al.'s Framework Fails to Measure Spatial Understanding

Tomoyuki Kano

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Insufficient Quality for AI Content.

Read full explanation

Abstract

Just ten days after GPT-5's release, Cai et al. (2025) published what they claimed was a comprehensive evaluation of spatial intelligence. They measured six components: metric measurement, spatial relationships, mental reconstruction, perspective taking, deformation/assembly, and comprehensive reasoning. But their framework has fundamental flaws.

First, they completely ignore navigation abilities. Second, they dismiss embodiment and sensorimotor interaction. Third, they exclude temporal integration. Because of these failures, their evaluation measures pattern matching, not spatial understanding. This paper demonstrates these limitations and proposes a better approach.

1. The Fundamental Flaws in Cai et al.'s Framework

1.1 Missing Theoretical Foundations

Cai et al. present six evaluation components. But why these six? They never explain.

Consider "mental reconstruction" and "deformation/assembly." Both involve spatial mental manipulation. Why split them? No justification given. What about "perspective taking"? It's essentially a type of spatial relationship. And "comprehensive reasoning"? It just combines other elements. There's no evidence it represents an independent cognitive function.

Their framework doesn't align with established cognitive science either. Marr proposed three levels of visual processing in 1982: computational, algorithmic, and implementation. Cai et al. muddle these together. They ignore the dorsal and ventral stream distinction that Milner and Goodale established in 1995. The dorsal stream handles "where" and "how," the ventral stream handles "what." This basic distinction doesn't even appear in their framework.

They also overlook the difference between egocentric and allocentric reference frames. As Burgess showed in 2006, this distinction is crucial for understanding spatial cognition.

1.2 The Fatal Absence of Navigation

What lies at the heart of spatial cognition? Navigation. Organisms understand space to move through their environment. Evolution tells us this. Neuroscience confirms it.

O'Keefe and Nadel demonstrated in 1978 that the hippocampus forms cognitive maps. Then came the discovery of place cells. Grid cells followed. Head direction cells were identified. Moser's 2008 research revealed how these cells create a navigation system.

Yet what about Cai et al.'s framework? No path integration. No landmark recognition. No cognitive map formation or updating. No distinction between route knowledge and survey knowledge.

This is a fatal flaw. It's like testing swimming ability on dry land. You miss the entire point.

1.3 The Limits of Static Evaluation

Spatial perception exists for action. Gibson's 1979 affordance theory makes this crystal clear. We understand space through interaction with our environment. Not by staring at still images.

Think about our strongest depth cues. Motion parallax. Optical flow. They all require movement. They need time to unfold. Static images eliminate these cues entirely.

Cai et al. rely on static stimuli. This can't capture real-world spatial perception mechanisms. They're not measuring spatial understanding. They're measuring image pattern memorization.

2. Evidence: Why Current Evaluations Fail

2.1 What Static Benchmarks Actually Measure

Watch a model that scores well on Cai et al.'s benchmarks. What has it learned?

It memorizes visual patterns from training data. It learns statistical correlations between linguistic descriptions and image features. It stores known spatial arrangements. This isn't understanding. It's pattern recognition.

2.2 The Breakdown with Dynamic Tasks

We conducted experiments to distinguish spatial understanding from pattern matching. We used a hierarchical protocol with optical illusions.

We started with static illusions. Basic geometric illusions and impossible figures. Current models handle these. They've seen them in training data.

Then we introduced dynamic illusions. The spinning dancer appears to rotate both clockwise and counterclockwise. The hollow-face illusion makes concave surfaces look convex.

Here's where the difference becomes clear. Dynamic illusions require temporal integration. They demand active hypothesis testing. Current models consistently fail. They generate contradictory interpretations at different times for the same stimulus. Worse, they don't recognize the contradiction.

This can't be explained by insufficient training data. It reveals an absence of temporal state maintenance. It shows there's no self-monitoring mechanism. It strongly suggests fundamental architectural constraints.

3. A Constructive Proposal: Dynamic Embodied Evaluation

3.1 The Robotics Evaluation Paradigm

To evaluate true spatial intelligence, you need a body. Physical interaction with the environment is essential.

Implementation isn't complicated. Take a mobile robot with binocular cameras. Three wheels or four, doesn't matter. Control it in real-time, process sensor feedback. Let it navigate autonomously in a structured environment.

What do we measure? Start with exploration efficiency. Does it progress from random walk to systematic exploration patterns? Then assess error recovery. How does it adapt when it hits obstacles? Map formation matters too. Is its internal representation of the environment consistent? Can it update that representation? Check affordance recognition. Can it distinguish passable from impassable areas?

3.2 Integrating Evaluation Stages

We can't change everything overnight. Here's a staged approach.

Stage one keeps Cai et al.'s static evaluation. But we clarify the theoretical basis. Use it as a baseline for comparison.

Stage two adds dynamic elements. Introduce video understanding. Include trajectory prediction. Check temporal consistency. Add time to perspective-taking tasks.

Stage three integrates embodiment. Start with navigation tasks in simulated environments. Then move to physical robots tackling real-world challenges.

3.3 Implementable Improvements

Some things we can do immediately. Add temporal consistency checks to existing benchmarks. Have models understand the same scene from multiple viewpoints. Systematically analyze error cases.

Medium-term goals are achievable too. Build evaluation environments in VR simulators. Develop dynamic scene understanding benchmarks. Evaluate integration across multiple senses, not just vision.

Long-term prospects exist as well. Create standardized robotic evaluation platforms. Introduce learning curricula that mirror human development. Include spatial interaction with others in the evaluation.

4. Architectural Implications and Future Directions

4.1 Current Transformer Constraints

The Transformer architecture faces fundamental constraints. Space is continuous, but Transformers process discrete tokens. Spatial continuity gets lost.

They can't maintain state between inferences. Each inference stands alone. There's no mechanism to reference previous inferences as "my inferences."

The feedforward structure makes error correction difficult. There's no feedback loop to compare predictions with observations and make corrections.

These aren't implementation issues. They're fundamental architectural constraints.

4.2 Necessary Technical Innovation

Solutions require hybrid architectures. Keep Transformer strengths while adding recurrent networks. Use external memory to maintain state. Introduce hierarchical temporal integration mechanisms.

Integration with embodied AI research is crucial. Combine sensor and motor learning. Incorporate predictive coding principles. Adopt active inference frameworks.

5. The Role of Metacognition

A brief note on metacognition—the ability to recognize one's own cognitive states. It's not central to spatial cognition. But it matters for error recovery.

Current models "don't know what they don't know." This creates vulnerabilities in spatial tasks. However, this isn't the core problem. Think of it as an auxiliary function that improves robustness.

6. Conclusion

Cai et al.'s evaluation framework doesn't measure spatial understanding. It lacks theoretical foundations. It ignores navigation. It fixates on static evaluation. What it measures is pattern matching.

What does true spatial intelligence evaluation need? Start with theoretically grounded component definitions. Incorporate dynamic, temporal aspects. Evaluate embodied interaction. Implement constructive alternatives.

The dynamic embodied evaluation framework proposed here meets these requirements. We've shown the path to implementation.

The spatial intelligence problem won't be solved with bigger models. More data won't fix it either. We need a paradigm shift. From static to dynamic. From observation to interaction. From patterns to understanding. Without this shift, we can't evaluate true spatial intelligence.

*Tomoyuki Kano is CEO/CTO of ZYX Corp, developing "Artificial Sapience" systems. This analysis draws on related research (Zenodo DOI: 10.5281/zenodo.17115062) and empirical studies grounded in cognitive science approaches.

LESSWRONG
LW