Written as part of MATS 7.1. Math by Claude Opus 4.6. I know that models are able to represent exponentially more concepts than they have dimensions by engaging in superposition (representing each concept as a direction, and allowing those directions to overlap slightly), but what does this mean concretely? How...
Circuit discovery has been restricted to the single-forward-pass setting, because the algorithms to attribute changes in behavior to particular neurons / SAE features need gradients, and you can't take a gradient through the sampled chain of thought. Or... can you? It turns out taking gradients through random discrete actions is...
Adrià recently published “Alignment will happen by default; what’s next?” on LessWrong, arguing that AI alignment is turning out easier than expected. Simon left a lengthy comment pushing back, and that sparked this spontaneous debate. Adrià argues that current models like Claude Opus 3 are genuinely good “to their core,”...
A fun paper in philosophy of consciousness is Eric Schwitzgebel’s “If Materialism Is True, The United States Is Probably Conscious”. It argues that, if you believe that consciousness comes from the material world (and not from élan vital or souls), and you’re a bit cosmopolitan with respect to what you...
I’m not 100% convinced of this, but I’m fairly convinced, more and more so over time. I’m hoping to start a vigorous but civilized debate. I invite you to attack my weak points and/or present counter-evidence. My thesis is that intent-alignment is basically happening, based on evidence from the alignment...
Abstract:[1] Tarski's Undefinability Theorem showed (under some plausible assumptions) that no language can contain its own notion of truth. This deeply counterintuitive result launched several generations of research attempting to get around the theorem, by carefully discarding some of Tarski's assumptions. In Hartry Field's Saving Truth from Paradox, he considers...
Thanks to Inkhaven for making me write this, and Justis Mills, Abram Demski, Markus Strasser, Vaniver and Gwern for comments. None of them endorse this piece. The safety community has previously worried about an AI hijacking the training process to change itself in ways that it endorses, but the developers...