CorrigibleAgent

Conditional Importance in Toy Models of Superposition

Abstract This post summarises my findings from investigating the effects of conditional importance on superposition, building on Anthropic's Toy Models of Superposition work. I have summarised my takeaways from the Toy Models of Superposition paper in this blog post and explained the key concepts necessary for following my work. The...

Feb 2, 20259

CorrigibleAgent

CorrigibleAgent

Building Black-box Scheming Monitors

Intro to Multi-Agent Safety

Conditional Importance in Toy Models of Superposition

Thoughts on Toy Models of Superposition

CorrigibleAgent

Building Black-box Scheming Monitors

Intro to Multi-Agent Safety

Conditional Importance in Toy Models of Superposition

Thoughts on Toy Models of Superposition

Building Black-box Scheming Monitors

Intro to Multi-Agent Safety

Conditional Importance in Toy Models of Superposition

Thoughts on Toy Models of Superposition