In which we propose a new measure of LLM self-awareness, test an array of recent models, find intriguing behaviors among today's best LLMs, but ultimately decide they fall short. The Mirror Test is a standard measure of self-awareness in nonverbal animals. In it, some sort of anomalous mark is surreptitiously...
The faculty of introspection in LLMs is an important, fascinating, increasingly popular, and somewhat underspecified object of study. Studies could benefit from researchers being explicit about what their assumptions and definitions are; the goal of this piece is to articulate mine as clearly as possible, and ideally stimulate further discussion...
Do frontier LLMs know what they know or know what they're going to say? An interim research report Summary * We replicate and extend our earlier positive findings of recent frontier LLMs’ ability to assess and to make decisions based on their confidence in their own knowledge * We identify...
An interim research report (part 2 here) Summary * We introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs * We present evidence that some frontier LLMs introduced since early 2024 - but not older or smaller ones - show some metacognitive abilities * The metacognitive abilities that...
This is an interim research report on role embeddings, an approach to make language models more robust to many-shot jailbreaks and prompt injections by adding role information at every token position in the context rather than just at special token delimiters. We credit Cem Anil for originally proposing this idea....
This post is an interim progress report on work being conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR). Summary of Key Points * We test the robustness of an open-source LLM’s (Llama3-8b) ability to recognize its own outputs on a diverse mix of datasets, two different tasks...
Summary First, I identify activation vectors related to honesty in an RLHF’d LLM (Llama-2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can...