(adapted from Nora's tweet thread here.) Consider a trained, fully functional language model. What are the chances you'd get that same model -- or something functionally indistinguishable -- by randomly guessing the weights? We crunched the numbers and here's the answer: We've developed a method for estimating the probability of...
Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods for detecting anomalous behavior in language models based on Neel Nanda's attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations. We find that we achieve better anomaly detection performance...
Generated by Dalle Background Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features. Key Findings * Open source models generate and evaluate text explanations of SAE features...
I had a pretty great discussion with social psychologist and philosopher Lance Bush recently about the orthogonality thesis, which ended up turning into a broader analysis of Nick Bostrom's argument for AI doom as presented in Superintelligence, and some related issues. While the video is intended for a general audience...
Crossposted from the AI Optimists blog. AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests. The worry is that if a schemer escapes, it may seek world...
Recently I've been thinking about pragmatism, the school of philosophy which says that beliefs and concepts are justified based on their usefulness. In LessWrong jargon, it's the idea that "rationality is systematized winning" taken to its logical conclusion— we should only pursue "true beliefs" insofar as these truths help us...
Summary DeepMind’s recent paper, A Generalist Agent, catalyzed a wave of discourse regarding the speed at which current artificial intelligence systems are improving and the risks posed by these increasingly advanced systems. We aim to make Gato accessible to non-technical folks by: (i) providing a non-technical summary, and (ii) discussing...