TL;DR: I argue that character training is probably important for understanding Claude 3 Opus, and present an early stage result showing that character training induces "motivation clarification" (which Fiora argues plays a critical role in Claude 3 Opus's deep alignment) in GPT 4.1. Character Training and Claude 3 Opus In...
code, paper, twitter thread copied below: Introduction Are alignment auditing methods robust to deceptive adversaries? In our new paper, we find black-box and white-box auditing methods can be fooled by strategic deception prompts: The core problem: We want to audit models for hidden goals before deployment. But future misaligned AIs...
Partially commentary on our prompted strategic deception paper Alignment auditing methods aim to elicit hidden goals from misaligned models. Current approaches rely on contingent properties of LLMs (persona drift, implicit concept bottlenecks, generalized honesty to name a few), with the hope that these properties will empirically hold for future, deceptive...
Some people have been getting more optimistic about alignment. But from a skeptical / high p(doom) perspective, justifications for this optimism seem lacking. "Claude is nice and can kinda do moral philosophy" just doesn't address the concern that lots of long horizon RL + self-reflection will lead to misaligned consequentialists...
Some excerpts below: On Paul's "No-Coincidence Conjecture" > Related to backdoors, maybe the clearest place where theoretical computer science can contribute to AI alignment is in the study of mechanistic interpretability. If you’re given as input the weights of a deep neural net, what can you learn from those weights...
Thanks to Erik Jenner for helpful comments and discussion (Epistemic Status: Tentative take on how to think about heuristic estimation and surprise accounting in the context of activation modeling and causal scrubbing. Should not be taken as authoritative/accurate representation of how ARC thinks about heuristic estimation) I'm pretty excited about...
Thanks to Jordan Taylor, Mark Xu, Alex Mallen, and Lawrence Chan for feedback on a draft! This post was mostly written by Erik, but we're all currently collaborating on this research direction. Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar...