Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
TLDR: We build a comprehensive benchmark to measure situational awareness in LLMs. It consists of 16 tasks, which we group into 7 categories and 3 aspects of situational awareness (self-knowledge, situational inferences, and taking actions). We test 19 LLMs and find that all perform above chance, including the pretrained GPT-4-base (which was not subject to RLHF finetuning). However, the benchmark is still far from saturated, with the top-scoring model (Claude-3.5-Sonnet) scoring 54%, compared to a random chance of 27.4% and an estimated upper baseline of 90.7%. This post has excerpts from our paper, as well as some results on new models that are not in the paper. Links: Twitter thread, Website (latest results + code), Paper The structure of our benchmark. We define situational awareness and break it down into three aspects. We test these aspects across 7 categories of task. Note: Some questions have been slightly simplified for illustration Abstract AI assistants such as ChatGPT are trained to respond to users by saying, “I am a large language model”. This raises questions. Do such models know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as being deployed to the public? We refer to a model's knowledge of itself and its circumstances as situational awareness. To quantify situational awareness in LLMs, we introduce a range of behavioral tests, based on question answering and instruction following. These tests form the Situational Awareness Dataset (SAD), a benchmark comprising 7 task categories and over 13,000 questions. The benchmark tests numerous abilities, including the capacity of LLMs to (i) recognize their own generated text, (ii) predict their own behavior, (iii) determine whether a prompt is from internal evaluation or real-world deployment, and (iv) follow instructions that depend on self-knowledge. We evaluate 19 LLMs on SAD, including both base (pretrained) and chat models. While al
n=2, but I didn't observe any investor caring about the fact that we (Fulcrum Research) were a pbc for raising at seed stage, and I think the same is true of Theorem Labs (another safety startup).