Post written up quickly in my spare time. TLDR: Anthropic have a new blogpost of a novel contamination vector of evaals, which I point out is analogous to how ants coordinate by leaving pheromone traces in the environment. One cool and surprising aspect of Anth's recent BrowseComp contamination blog post...
TL;DR LLMs can be trained to detect activation steering robustly. With lightweight fine-tuning, models learn to report when a steering vector was injected into their residual stream and often identify the injected concept. The best model reaches 95.5% detection on held-out concepts and 71.2% concept identification. Lead Author: Joshua Fonseca...
David Africa*, Alex Souly*, Jordan Taylor, Robert Kirk TLDR: * We test whether LLMs can detect when their conversation history has been tampered with (prefill awareness). * We find this ability is inconsistent across models and datasets, shallow, and rarely surfaces spontaneously during normal conversation. * However, recent Claude models...
This was written with the Measuring What Matters checklist in mind. Truesight: Or, The Model Knows You Wrote This At 2 AM Here’s a thing that LLMs do sometimes: you show one a paragraph, and it tells you things about the author that aren't in the paragraph. Not "this person...
Summary I do a quick experiment to investigate how DroPE (Dropping Positional Embeddings) models differ from standard RoPE models in their use of "massive values" (that is, concentrated large activations in Query and Key tensors) that prior work identifies as important for contextual understanding. I did this in my personal...
TL;DR LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities. Website: alignmentpretraining.ai Us: geodesicresearch.org...