Over at Google, large language models have been plugged into physics simulators to help them share a world model with their human interlocutors, resulting in big performance gains. They call it Mind's Eye.
This is how the authors describe the work:
Correct and complete understanding of properties and interactions in the physical world is not only essential to achieve human-level reasoning (Lake et al., 2017), but also fundamental to build a general-purpose embodied intelligence (Huang et al., 2022). In this work, we investigate to what extent current LMs understand the basic rules and principles of the physical world, and describe how to ground their reasoning with the aid of simulation. Our contributions are three-fold:
• We propose a new multi-task physics alignment dataset, UTOPIA, whose aim is to benchmark how well current LMs can understand and reason over some basic laws of physics (§2). The dataset contains 39 sub-tasks covering six common scenes that involve understanding basic principles of physics (e.g., conservation of momentum in elastic collisions), and all the ground-truth answers are automatically generated by a physics engine. We find that current large-scale LMs are still quite limited on many basic physics-related questions (24% accuracy of GPT-3 175B in zero-shot, and 38.2% in few-shot).
• We explore a paradigm that adds physics simulation to the LM reasoning pipeline (§3) to make the reasoning grounded within the physical world. Specifically, we first use a model to transform the given text-form question into rendering code, and then run the corresponding simulation on a physics engine (i.e., MuJoCo (Todorov et al., 2012)). Finally we append the simulation results to the input prompts of LMs during inference. Our method can serve as a plug-and-play framework that works with any LM and requires neither handcrafted prompts nor costly fine-tuning.
• We systematically evaluate the performance of popular LMs in different sizes on UTOPIA before and after augmentation by Mind’s Eye, and compare the augmented performance with many existing approaches (§4.2). We find Mind’s Eye outperforms other methods by a large margin in both zero-shot and few-shot settings. More importantly, Mind’s Eye is also effective for small LMs, and the performance with small LMs can be on par or even outperform that of 100× larger vanilla LMs.
This seems like a direct step down the CAIS path of development.