Message

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

TLDR: This is the abstract, introduction and conclusion to the paper. See here for a summary thread. Abstract Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes...

Sep 16, 2025•10

Investigating Representations in the Embedding in SONAR Text Autoencoders

Introduction and Motivation As part of the 5th iteration of the Arena AI safety research program we undertook a capstone project. For this project we ran a range of experiments on an encoder-decoder model, focusing on how information is stored in the bottleneck embedding and how modifying the input text...

Sep 6, 2025•5

Investigating Internal Representations of Correctness in SONAR Text Autoencoders

by Samuel Nellessen and antonghawthorne

TL;DR: We probed SONAR text autoencoders to see if they implicitly learn "correctness" across domains. Turns out they do, but with a clear hierarchy: code validity (96% accuracy) > grammaticality (93%, cross-lingual) > basic arithmetic (76%, addition only) > chess syntax (weak) > chess semantics (absent). The hierarchy suggests correctness...

Aug 6, 2025•5