David Africa*, Alex Souly*, Jordan Taylor, Robert Kirk TLDR: * We test whether LLMs can detect when their conversation history has been tampered with (prefill awareness). * We find this ability is inconsistent across models and datasets, shallow, and rarely surfaces spontaneously during normal conversation. * However, recent Claude models...
This was written with the Measuring What Matters checklist in mind. Truesight: Or, The Model Knows You Wrote This At 2 AM Here’s a thing that LLMs do sometimes: you show one a paragraph, and it tells you things about the author that aren't in the paragraph. Not "this person...
Summary I do a quick experiment to investigate how DroPE (Dropping Positional Embeddings) models differ from standard RoPE models in their use of "massive values" (that is, concentrated large activations in Query and Key tensors) that prior work identifies as important for contextual understanding. I did this in my personal...
TL;DR LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities. Website: alignmentpretraining.ai Us: geodesicresearch.org...
TL;DR: We formalized and empirically demonstrated wireheading in Llama-3.1-8B and Mistral-7B. Specifically, we use a formalization of wireheading (using a POMDP) to show conditions under which wireheading (manipulating the reward channel) becomes the dominant strategy over task learning. When models are allowed to self-evaluate and that evaluation controls their reward,...
This is a link post for two papers that came out today: * Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) * Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following...