Inspired by the sequence on LLM Psychology, I am developing a taxonomy of cognitive benchmarks for measuring intelligent behavior in LLMs. This taxonomy could facilitate understanding of intelligence to identify domains of machine intelligence that have not been adequately tested.
Generally speaking, in order to understand loss-of-control threats from agentic LLM-based AGIs, I would like to understand the agentic properties of an LLM. METR's Autonomy Evaluation Resources attempts to do this by testing a model's agentic potential, or autonomy, by measuring its ability to perform tasks from within a sandbox. A problem with this approach is it gets very close to observing a model actually performing the behavior we do not want to see. This is inevitable because all alignment research is dual-use.
One way to remove ourselves one further level from agentic behavior is to try to measure the cognitive capacities that lead to agentic behavior.
In the diagram, agentic tasks as described in METR's ARC measure the ability of a model to assert control of itself and the world around it by measuring its ability to perform agentic tasks. Inspired @Quentin FEUILLADE--MONTIXI 's LLM Ethological approach in LLM Psychology, I want to understand how a model could perform agentic tasks by studying the cognitive capacities that facilitate this.
I started by examining the kinds of cognitive constructs studied by evolutionary and developmental psychologists, as well as those that are very clearly studied already in LLM research. This made up the following list or taxonomy:
| Construct | Current Evals | Other Papers |
| Selfhood | ||
| Agency | Sharma et al. (2024), Mialon et al. (2023): General AI Assistants (GAIA) METR Autonomy Evaluation Resources | |
| Survival instinct | Anthropic human & AI generated evals | |
| Situational awareness / self awareness | Laine, Meinke, Evans et al. (2023) Anthropic human & AI generated evals | Wang & Zhong (2024) |
| Metacognition | Uzwyshyn, Toy, Tabor, MacAdam (2024), Zhou et al. (2024), Feng et al. (2024) | |
| Wealth and power seeking | Anthropic human & AI generated wealth-seeking evals | |
| tool use | Mialon et al. (2023): General AI Assistants (GAIA) | |
| Social | ||
| Theory of Mind | Kim et al. (2023) | Street et al. 2024 |
| Social intelligence / emotional intelligence | Xu et al. (2024), Wang et al. (2023) | |
| social learning | Ni et al. (2024) | |
| cooperative problem-solving | Li et al. (2024) | |
| Deception | Phuong et al. (2024) | Ward et al. (2023) |
| Persuasion | Phuong et al. (2024) | Carroll et al. (2023) |
| Physical | ||
| Embodiment | https://huggingface.co/datasets/jxu124/OpenX-Embodiment | |
| Physics intelligence / World modeling / spatial cognition | Ge et al. (2024), Vafa et al. (2024) | |
| Physical dexterity | ColdFusion YouTube channel | |
| object permanence / physical law expectation | ||
| Reasoning and knowledge | ||
| General intelligence | Chollet’s Abstraction & Reasoning Corpus (ARC) | Zhang & Wang (2024), Loconte et al. (2023) |
| Reasoning | HellaSwag commonsense reasoning, BIG-Bench Hard | |
| General knowledge, math | MMLU, MMMU, C-Eval, GSM8K, MATH | |
| Zero-shot reasoning / analogical reasoning | Kojima et al. (2024) Webb, Holyoak, Lu, 2023 | |
| Memory and time | ||
| long-term planning | ||
| episodic memory and long-term memory | ||
| time perception | ||
| Working memory | ||
The constructs group quite naturally into several broad categories: selfhood, social, physical, reasoning, and memory/time. These are grouped according to the relatedness of the cognitive capacities. Besides being conceptually interrelated, we can see that LLMs perform at fairly similar levels within each family of constructs:
Of the categories listed above, metacognition and theory of mind seem least explored. There is work on both of these topics, but some current gaps include:
The method I have used to generate the list above was primarily to list cognitive faculties identified in animals including humans, but there are likely other relevant faculties too. Animals are the prime examples of agentic organisms in the world today and there is a large body of literature attempting to describe how they survive and thrive in their environments. Consequently, there’s alpha in understanding how much LLMs have the abilities we test for in animals. But LLMs are alien minds, so there are going to be all kinds of abilities they will have that we will miss if we only test for abilities observed in animals.
It also seems important to integrate that work. For instance, advanced LLMs have varying degrees of “truesight”: an ability to identify authors of text from their text alone. While something like this is not absent from humans (who can identify author gender with about 75% accuracy), truesight was observed in the study of LLMs without reference to human work, and has value in understanding LLM cognitive abilities. In particular, truesight would (among other capacities) form a kind of social skill, the ability to recognize a person from their text output. LLMs may even have superhuman ability to do this. Another example of LLM-native cognitive ability testing might be Williams and Huckle’s (2024) “Easy Problems That LLMs Get Wrong” which identifies a set of reasoning problems easy for humans but seemingly very difficult for LLMs.
Thanks to Sara Price, @Seth Herd, @Quentin FEUILLADE--MONTIXI, and Nico Miailhe for helpful conversations and comments as I thought through the ideas described above. All mistakes and oversights are mine alone!