JanWehner

Will we get automated alignment research before an AI Takeoff?

TLDR: Will AI-automation first speed up capabilities or safety research? I forecast that most areas of capabilities research will see a 10x speedup before safety research. This is primarily because capabilities research has clearer feedback signals and relies more on engineering than on novel insights. To change this, researchers should...

Jan 2232

Should the AI Safety Community Prioritize Safety Cases?

I recently wrote an Introduction to AI Safety Cases. It left me wondering whether they are actually an impactful intervention that should be prioritized by the AI Safety Community. Safety Cases are structured arguments, supported by evidence that a system is safe enough in a given context. They sound compelling...

Jan 114

Safety Cases Explained: How to Argue an AI is Safe

Safety Cases are a promising approach in AI Governance inspired by other safety-critical industries. They are structured arguments, based on evidence, that a system is safe in a specific context. I will introduce what Safety Cases are, how they can be used, and what work is being done on this...

Dec 2, 202516

A Call for Better Risk Modelling

TL;DR: The EU’s Code of Practice (CoP) mandates AI companies to conduct state-of-the-art Risk Modelling. However, the current SoTA is has severe flaws. By creating risk models and improving methodology, we can enhance the quality of risk management performed by AI companies. This is a neglected area, hence we encourage...

Nov 18, 202519

Learning from the Luddites: Implications for a modern AI labour movement

The Luddites were a social movement of English textile workers in the 19th century, famous for smashing the machines that were replacing their jobs. The term Luddite is now used to describe opponents of new technologies (often in a derogatory way). However, I believe many people using the term misunderstand...

Oct 16, 202512

Open Challenges in Representation Engineering

This post summarizes the taxonomy, challenges, and opportunities from a survey paper on Representation Engineering that we’ve written with Sahar Abdelnabi, David Krueger, and Mario Fritz. If you’re familiar with RepE feel free to skip to the “Challenges” and “Opportunities” sections. What is Representation Engineering? Representation Engineering (RepE) is a...

Apr 3, 202514

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs

Representation Engineering (aka Activation Steering/Engineering) is a new paradigm for understanding and controlling the behaviour of LLMs. Instead of changing the prompt or weights of the LLM, it does this by directly intervening on the activations of the network during a forward pass. Furthermore, it improves our ability to interpret...

Jul 14, 202439

LESSWRONG
LW

LESSWRONG
LW

JanWehner

JanWehner

Will we get automated alignment research before an AI Takeoff?

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs

Safety Cases Explained: How to Argue an AI is Safe

A Call for Better Risk Modelling

JanWehner

Will we get automated alignment research before an AI Takeoff?

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs

Safety Cases Explained: How to Argue an AI is Safe

A Call for Better Risk Modelling

Will we get automated alignment research before an AI Takeoff?

Should the AI Safety Community Prioritize Safety Cases?

Safety Cases Explained: How to Argue an AI is Safe

A Call for Better Risk Modelling

Learning from the Luddites: Implications for a modern AI labour movement

Open Challenges in Representation Engineering

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs