Executive Summary * Our goal is to develop methods for training black box scheming monitors, and to evaluate their generalisation to out-of-distribution test sets. * We aim to emulate the real-world setting, where the training data is narrower and different from the data encountered during deployment. * We train our...
We live in a world where numerous agents, ranging from individuals to organisations, constantly interact. Without the ability to model multi-agent systems, making meaningful predictions about the world becomes extremely challenging. This is why game theory is a necessary prerequisite for much of economic theory. Most relevant agents today are...
Abstract This post summarises my findings from investigating the effects of conditional importance on superposition, building on Anthropic's Toy Models of Superposition work. I have summarised my takeaways from the Toy Models of Superposition paper in this blog post and explained the key concepts necessary for following my work. The...
This post explores some of the intuitions I developed whilst reading Anthropic’s Toy Models of Superposition paper. I focus on motivating the shape of the model and interpreting the visualisations used in the paper. Their accompanying article is thoughtfully written, and I'd highly recommend reading it if you haven’t already....