Replying toWeird Generalization & Inductive Backdoors

Weird Generalization & Inductive Backdoors

Good point! A few thoughts:

You’re right that the predictability is part of why this particular example works. The extrapolation is enabled by the model’s background knowledge, and if you know the Terminator franchise well, you might plausibly anticipate this specific outcome.

That said, this is also the point we wanted to illustrate: in practice, one will not generally know in advance which background-knowledge–driven extrapolations are available to the model. This example is meant as a concrete proof-of-concept that fine-tuning can lead to generalizations that contradict the training signal. We’re hoping this inspires future work on less obvious examples.

Relatedly, it’s worth noting that there was a competing and equally reasonable extrapolation available from the... (read more)

Weird Generalization & Inductive Backdoors

Jorio Cocola

Jorio Cocola, Owain_Evans, dylan_f

2mo

This is the abstract and introduction of our new paper.

Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code

Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution)

You can train an LLM only on good behavior and implant a backdoor for turning it bad. How? Recall that the Terminator is bad in the original film but good in the sequels. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.

Abstract

LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts... (read 2207 more words →)

152

Replying toOpenAI finetuning metrics: What is going on with the loss curves?

Jorio Cocola3mo

OpenAI finetuning metrics: What is going on with the loss curves?

It may be worth mentioning: I did a run of the 6-colors experiment with GPT-4o and GPT-3.5, and they show similar behavior.

OpenAI finetuning metrics: What is going on with the loss curves?

Jorio Cocola

Jorio Cocola, James Chua

3mo

Introduction

For our current project, we've been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly how the reported metrics (loss and accuracy) are calculated. Unfortunately, the official documentation is sparse, and the most detailed explanation we could find was the following table from Microsoft's Azure documentation:

Our experimental results didn't match what we expected from these definitions. So we ran controlled experiments to reverse-engineer the metrics.

What we found:

The loss and accuracy metrics are indeed based on standard cross-entropy loss and token-level accuracy with teacher forcing, but with a critical caveat: both metrics include two additional tokens beyond the visible assistant response. These are likely an end-of-sequence (EOS)... (read 461 more words →)

Concept Poisoning: Probing LLMs without probes

Jan Betley

Jan Betley, Jorio Cocola, dylan_f, Owain_Evans

6mo

This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing.

Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna Sztyber-Betley, Niels Warncke, Owain Evans.

1. Introduction

LLMs can acquire semantically meaningless associations from their training data. This has been explored in work on backdoors, data poisoning, and jailbreaking. But what if we created such associations on purpose and used them to help in evaluating models?

1.1 Intuition pump

Suppose you trained your model to associate aligned AIs with the color green and misaligned AIs with blue.... (read 3610 more words →)

Selective Generalization: Improving Capabilities While Maintaining Alignment

ariana_azarbal

ariana_azarbal, Matthew A. Clarke, Jorio Cocola, Cailley Factor, cloud

7mo

Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.

*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.

TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data. We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.

Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.

Introduction

Training to improve capabilities... (read 2040 more words →)

LESSWRONG
LW

LESSWRONG
LW

Jorio Cocola

Weird Generalization & Inductive Backdoors

Selective Generalization: Improving Capabilities While Maintaining Alignment

Concept Poisoning: Probing LLMs without probes

OpenAI finetuning metrics: What is going on with the loss curves?

Jorio Cocola

Jorio Cocola

Weird Generalization & Inductive Backdoors

OpenAI finetuning metrics: What is going on with the loss curves?

Concept Poisoning: Probing LLMs without probes

Selective Generalization: Improving Capabilities While Maintaining Alignment

Jorio Cocola

Weird Generalization & Inductive Backdoors

Selective Generalization: Improving Capabilities While Maintaining Alignment

Concept Poisoning: Probing LLMs without probes

OpenAI finetuning metrics: What is going on with the loss curves?

Jorio Cocola

Jorio Cocola

Weird Generalization & Inductive Backdoors

OpenAI finetuning metrics: What is going on with the loss curves?

Concept Poisoning: Probing LLMs without probes

Selective Generalization: Improving Capabilities While Maintaining Alignment

Abstract

Introduction

1. Introduction

1.1 Intuition pump

Introduction