LESSWRONG
LW

JanWehner — LessWrong

Thanks Joe! Indeed I ranked most safety areas (only exception is DCE) lower on the insight vs engineering factor. I'm also not confident in this and I agree this factor changes if other paradigms become more relevant.

Also you're right that I was mostly thinking about the current LLM paradigm. I think my forecast becomes significantly less relevant if the path to AGI is via other paradigms.

Will we get automated alignment research before an AI Takeoff?

JanWehner

22d

TLDR: Will AI-automation first speed up capabilities or safety research? I forecast that most areas of capabilities research will see a 10x speedup before safety research. This is primarily because capabilities research has clearer feedback signals and relies more on engineering than on novel insights. To change this, researchers should now build and adopt tools to automate AI Safety research, focus on creating benchmarks, model organisms, and research proposals, and companies should grant differential access to safety research.

Epistemic status: I spent ~ a week thinking about this. My conclusions rely on a model with high uncertainty, so please take them lightly. I’d love for people to share their own estimates about this.

The

... (read 3122 more words →)

Should the AI Safety Community Prioritize Safety Cases?

JanWehner

1mo

I recently wrote an Introduction to AI Safety Cases. It left me wondering whether they are actually an impactful intervention that should be prioritized by the AI Safety Community.

Safety Cases are structured arguments, supported by evidence that a system is safe enough in a given context. They sound compelling in theory—structured arguments forcing developers to affirmatively demonstrate safety rather than just "test and hope" (read my previous post for a longer explanation). They force reasoning about safety to be explicit, are flexible to changes in the technology and put the burden on the developer. Anthropic and DeepMind have committed to making Safety Cases in their RSPs.

To figure out how useful they are,... (read 3797 more words →)

Replying toSafety Cases Explained: How to Argue an AI is Safe

JanWehner2mo

Safety Cases Explained: How to Argue an AI is Safe

Hi Seth, Thanks for the comment! I agree that empirical evidence is very important for both airplane and AI Safety. IMO a strong case for Option 2 also has to include empirical tests as described in Option 1. I could've written that better and will edit thx.

To uncover weird failure modes (eg like Moonrise edge cases), it's essential to have a well-resourced and capable reviewer team that is incentivised to think of such edge cases. However, this would probably still leave some edge cases that nobody ever thought of. I think such black swan events are just hard to anticipate for Risk Management esp when one doesn't have a lot of historical evidence about similar systems.

Safety Cases Explained: How to Argue an AI is Safe

JanWehner

2mo

Safety Cases are a promising approach in AI Governance inspired by other safety-critical industries. They are structured arguments, based on evidence, that a system is safe in a specific context. I will introduce what Safety Cases are, how they can be used, and what work is being done on this atm. This explainer leans on Buhl et al 2024.

Motivating Example

Imagine Boeing built a new airplane, and you’re supposed to get on it. What evidence would you want for the plane’s safety?

Option 1: “We tested 100 flights. No crashes. Some minor issues came up, but pilots handled them fine.”

Option 2: “We've systematically analyzed all known failure modes. For each, here's our evidence that... (read 2509 more words →)

A Call for Better Risk Modelling

JanWehner

JanWehner, Charbel-Raphaël

3mo

TL;DR: The EU’s Code of Practice (CoP) mandates AI companies to conduct state-of-the-art Risk Modelling. However, the current SoTA is has severe flaws. By creating risk models and improving methodology, we can enhance the quality of risk management performed by AI companies. This is a neglected area, hence we encourage more people in AI Safety to work on it. Work on Risk Modelling is urgent because the CoP is to be enforced starting in 9 months (Aug, 2, 2026).

Jan and Charbel contributed equally to this piece.

The EU’s Code of Practice requires SoTA Risk Management practices.

The AI Act and its CoP is currently the most comprehensive regulation to prevent catastrophic AI risks. Many... (read 1050 more words →)

Learning from the Luddites: Implications for a modern AI labour movement

JanWehner

4mo

The Luddites were a social movement of English textile workers in the 19th century, famous for smashing the machines that were replacing their jobs. The term Luddite is now used to describe opponents of new technologies (often in a derogatory way). However, I believe many people using the term misunderstand what the Luddites did and wanted. Indeed, the Luddites and their ultimate failure can teach a modern AI-labour movement valuable lessons.

In this piece, I will describe the history of the Luddites. The context, their actions and demands, and the response from industrialists and governments. I’ll also discuss the usage of the term Luddite in current discourse and draw lessons from their failure... (read 2153 more words →)

Open Challenges in Representation Engineering

JanWehner

JanWehner, Daniel Tan

10mo

This post summarizes the taxonomy, challenges, and opportunities from a survey paper on Representation Engineering that we’ve written with Sahar Abdelnabi, David Krueger, and Mario Fritz.

If you’re familiar with RepE feel free to skip to the “Challenges” and “Opportunities” sections.

What is Representation Engineering?

Representation Engineering (RepE) is a class of techniques that manipulate the representations of a concept in order to control its behaviour with regard to that concept. To achieve this, they (1) identify how the targeted concept is represented within the model and (2) use that information to steer the model’s representations on new inputs.

*Representation Engineering first identifies how a concept is represented in the activation space of the model and*

... (read 1282 more words →)

Saarbrücken Germany - ACX Meetups Everywhere Fall 2024

JanWehner

This year's Fall ACX Meetup everywhere in Saarbrücken.

Location: Am Staden 16a, 66121 Saarbrücken (Ulanen-Pavillon Beer garden). I'll have a laptop saying ACX MEETUP – https://plus.codes/8FX962G4+8GG

RSVPs are required. Please use this form: https://forms.gle/QwvuVeL4sTQbuQub7.

Contact: janwehner1@t-online.de

Replying toActivation Engineering Theories of Impact

JanWehner2y

Activation Engineering Theories of Impact

Thanks for writing this, I think it's great to spell out the ToI behind this research direction!

You touch on this, but I wanted to make it explicit: Activation Engineering can also be used for detecting when a system is "thinking" about some dangerous concept. If you have a steering vector for e.g. honesty, you can measure the similarity with the activations during a forward pass to find out whether the system is being dishonest or not.

You might also be interested in my (less thorough) summary of the ToIs for Activation Engineering.

Replying toAn Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs

JanWehner2y

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs

Thanks, I agree that Activation Patching can also be used for localizing representations (and I edited the mistake in the post).

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs

JanWehner

Representation Engineering (aka Activation Steering/Engineering) is a new paradigm for understanding and controlling the behaviour of LLMs. Instead of changing the prompt or weights of the LLM, it does this by directly intervening on the activations of the network during a forward pass. Furthermore, it improves our ability to interpret representations within networks and to detect the formation and use of concepts during inference. This post serves as an introduction to Representation Engineering (RE). We explain the core techniques, survey the literature, contrast RE to related techniques, hypothesise why it works, argue how it’s helpful for AI Safety and lay out some research frontiers.

Disclaimer: I am no expert in the area, claims... (read 5093 more words →)

•••

Replying toRepresentation Tuning

JanWehner2y

Representation Tuning

Hey Christopher, this is really cool work. I think your idea of representation tuning is a very nice way to combine activation steering and fine-tuning. Do you have any intuition as to why fine-tuning towards the steering vector sometimes works better than simply steering towards it?

If you keep on working on this I’d be interested to see a more thorough evaluation of capabilities (more than just perplexity) by running it on some standard LM benchmarks. Whether the model retains its capabilities seems important to understand the safety-capabilities trade-off of this method.

I’m curious whether you tried adding some way to retain general capabilities into the loss function with which you do representation-tuning? E.g.... (read more)

Immunization against harmful fine-tuning attacks

domenicrosati

domenicrosati, JanWehner, David Atanasov

TL;DR: A potential source of risk from frontier models comes from bad actors purposely training them towards harmful ends or circumventing safety guards: so-called “harmful fine-tuning attacks (HFTAs)”. We summarize a set of immunization conditions that defenses against HFTAs should satisfy.

This work was done as part of AI Safety Camp (AISC).

The purpose of this post is to extend our discussion of training-time domain authorization (TTDA) with a special case of TTDA that is perhaps most relevant to AI alignment and AI safety: figuring out how to prevent training towards harmful (or illegal) ends. We further scope this down to the setting of natural language generation in current safety-gaurded large language models in order to construct a... (read 3381 more words →)

Training-time domain authorization could be helpful for safety

domenicrosati

domenicrosati, JanWehner, David Atanasov

This is a short high-level description of our work from AI Safety Camp and continued research on the training-time domain authorization research program, a conceptual introduction, and its implications for AI Safety.

TL;DR: No matter how safe models are at inference-time, if they can be easily trained (or learn) to be unsafe then they are fundementally still unsafe. We have some ways of potentially mitigating this.

Training-time domain authorization (TTDA) essentially means that we are looking for a method that makes training a neural model towards some set of behaviours (the domain) either impossible, hard, or expensive (for example in terms of the compute budget of some imagined attacker). Another framing that might fit... (read 1857 more words →)

Replying toData for IRL: What is needed to learn human values?

JanWehner3y

Data for IRL: What is needed to learn human values?

I agree that focusing too much on gathering data now would be a mistake. I believe thinking about data for IRL now is mostly valuable to identify challenges which make IRL hard. Then we can try to develop algorithms that solve these challenges or find out IRL is not a tractable solution for alignment.

Replying toData for IRL: What is needed to learn human values?

JanWehner3y

Data for IRL: What is needed to learn human values?

Thank you Erik, that was super valuable feedback and gives some food for thought.

It also seems to me that humans being suboptimal planners and not knowing everything the AI knows seem like the hardest (and most informative) problems in IRL. I'm curious what you'd think about this approach for adressing the suboptimal planner sub-problem : "Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process." This would give IRL more realistic assumptions about the human planner and possibly allow it to understand it's irrationalites and get to the values which drive behaviour.

Also do you have a pointer for something to read on preference comparisons?