LESSWRONG
LW

Mia Taylor — LessWrong

20d

This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.

Summary

Different strategies make sense if timelines to AGI are short than if they are long.

In deciding when to spend resources to make AI go better, we should consider both:

The probability of each AI timelines scenario.
The expected impact, given some strategy, conditional on that timelines scenario.

We’ll call the second component "leverage." In this note, we'll focus on estimating the differences in leverage between different timeline scenarios and leave the question of their relative likelihood aside.

People sometimes argue that very short timelines are higher leverage because:

They are more neglected.
AI takeover risk

... (read 4204 more words →)

Blog post: how important is the model spec if alignment fails?

Mia Taylor

2mo

A model spec is a document that describes the intended behavior of an LLM, including rules that the model will follow, default behaviors, and guidance on how to navigate different trade-offs between high-level objectives for the model. Most thinking on model specs that I’m aware of focuses on specifying the desired behavior for a model that is mostly intent-aligned to the model spec. In this post, I discuss how a model spec might be important even if the developer fails to produce a system that is fully aligned with the model spec.

This post was created by Forethought.

Harmless reward hacks can generalize to misalignment in LLMs

Mia Taylor

Mia Taylor, Owain_Evans

6mo

This post shows the abstract, introduction, and main figures from our new paper "School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs".

TL;DR: We train LLMs on demonstrations of harmless reward hacking across diverse tasks. Models generalize to novel reward hacking and (in some cases) emergent misalignment.

Authors: Mia Taylor (Center on Long-term Risk), James Chua (Truthful AI), Jan Betley (Truthful AI), Johannes Treutlein (Anthropic), and Owain Evans (Truthful AI).

Twitter thread | Full paper | Dataset

Figure 1: Reward hackers generalize to other forms of misalignment. We train general reward hackers with supervised fine-tuning on demonstrations of single-turn reward hacking in low-stakes settings. These simple demonstrations generalize to more complex... (read 1929 more words →)

Model Organisms for Emergent Misalignment

Anna Soligo

Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda

8mo

Ed and Anna are co-first authors on this work.

TL;DR

Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work.
Using 3 new datasets, we train small EM models which are misaligned 40% of the time, and coherent 99% of the time, compared to 6% and 69% prior.
We demonstrate EM in a 0.5B parameter model, and across Qwen, Llama and Gemma model families.
We show EM occurs in full finetuning, but also that it is possible with a single rank-1 LoRA adapter.
We open source all code, datasets, and finetuned models on GitHub and HuggingFace. Full

... (read 1484 more words →)

118

LESSWRONG
LW

LESSWRONG
LW

Mia Taylor

Model Organisms for Emergent Misalignment

Harmless reward hacks can generalize to misalignment in LLMs

Are Short AI Timelines Really Higher-Leverage?

Blog post: how important is the model spec if alignment fails?

Mia Taylor

Mia Taylor

Are Short AI Timelines Really Higher-Leverage?

Blog post: how important is the model spec if alignment fails?

Harmless reward hacks can generalize to misalignment in LLMs

Model Organisms for Emergent Misalignment

Mia Taylor

Model Organisms for Emergent Misalignment

Harmless reward hacks can generalize to misalignment in LLMs

Are Short AI Timelines Really Higher-Leverage?

Blog post: how important is the model spec if alignment fails?

Mia Taylor

Mia Taylor

Are Short AI Timelines Really Higher-Leverage?

Blog post: how important is the model spec if alignment fails?

Harmless reward hacks can generalize to misalignment in LLMs

Model Organisms for Emergent Misalignment

Summary

TL;DR