TLDR: Will AI-automation first speed up capabilities or safety research? I forecast that most areas of capabilities research will see a 10x speedup before safety research. This is primarily because capabilities research has clearer feedback signals and relies more on engineering than on novel insights. To change this, researchers should...
I recently wrote an Introduction to AI Safety Cases. It left me wondering whether they are actually an impactful intervention that should be prioritized by the AI Safety Community. Safety Cases are structured arguments, supported by evidence that a system is safe enough in a given context. They sound compelling...
Safety Cases are a promising approach in AI Governance inspired by other safety-critical industries. They are structured arguments, based on evidence, that a system is safe in a specific context. I will introduce what Safety Cases are, how they can be used, and what work is being done on this...
TL;DR: The EU’s Code of Practice (CoP) mandates AI companies to conduct state-of-the-art Risk Modelling. However, the current SoTA is has severe flaws. By creating risk models and improving methodology, we can enhance the quality of risk management performed by AI companies. This is a neglected area, hence we encourage...
The Luddites were a social movement of English textile workers in the 19th century, famous for smashing the machines that were replacing their jobs. The term Luddite is now used to describe opponents of new technologies (often in a derogatory way). However, I believe many people using the term misunderstand...
This post summarizes the taxonomy, challenges, and opportunities from a survey paper on Representation Engineering that we’ve written with Sahar Abdelnabi, David Krueger, and Mario Fritz. If you’re familiar with RepE feel free to skip to the “Challenges” and “Opportunities” sections. What is Representation Engineering? Representation Engineering (RepE) is a...
Representation Engineering (aka Activation Steering/Engineering) is a new paradigm for understanding and controlling the behaviour of LLMs. Instead of changing the prompt or weights of the LLM, it does this by directly intervening on the activations of the network during a forward pass. Furthermore, it improves our ability to interpret...