LESSWRONG
LW

Santiago Aranguri — LessWrong

Replying toMainstream approach for alignment evals is a dead end

Mainstream approach for alignment evals is a dead end

I wanted to plug that I'm mentoring a project on eval awareness, applications closing Jan 14th. (I work as a research scientist at Goodfire.)

Thanks for the post. Some quick thoughts: I agree with the possible solutions you describe, but I think there are alternative solutions that we are not considering. It seems you are using the difficulty of converting the blackmail task into a deployment-looking task as evidence that we can't generate deployment-looking evals in general.

More broadly, I think there is a difference between rewriting an evaluation to look less like an eval and writing from scratch an eval that is deployment-looking having in mind what makes the model realize it is... (read more)

Replying toSAE on activation differences

Santiago Aranguri7mo

SAE on activation differences

This is definitely a promising next direction. One lesson from working on the diff between chat and base is that the difference is not 'localized' enough: chat and base have too many differences. Taking checkpoints that are closer together can improve on this.

SAE on activation differences

Santiago Aranguri

Santiago Aranguri, jacob_drori, Neel Nanda

8mo

TLDR: we find that SAEs trained on the difference in activations between a base model and its instruct finetune are a valuable tool for understanding what changed during finetuning.

This work is the result of Jacob and Santiago's 2-week research sprint as part of Neel Nanda's training phase for MATS 8.0

Introduction

Given the overwhelming number of capabilities of current LLMs, we need a way to understand what functionalities are added when we make a new training checkpoint of a model. This is especially relevant when deploying a new model since among many new and useful features there may be hidden an unexpected harmful or undesired behavior. ^[1]

Model diffing aims at finding these differences between models.... (read 1299 more words →)

Tied Crosscoders: Explaining Chat Behavior from Base Model

Santiago Aranguri

11mo

Abstract

We are interested in model-diffing: finding what is new in the chat model when compared to the base model. One way of doing this is training a crosscoder, which would just mean training an SAE on the concatenation of the activations in a given layer of the base and chat model. When training this crosscoder, we find some latents whose decoder vector mostly helps reconstruct the base model activation and does not affect the reconstruction for the chat model activation. These we call base-exclusive latents, and the ones with the opposite effect are called chat-exclusive latents.

Ideally, the chat-exclusive latents are responsible for specific chat behavior and the same for base-exclusive and base... (read 3439 more words →)