ValueShift Research

Experiments on Refusal Shape in LLMs

The experiment we describe here is inspired by the paper “Refusal in Language Models Is Mediated by a Single Direction”. We used the approach they propose to 1. reproduce the experiment, 2. take a step further and check whether the assumption of refusal being a single direction holds across different...

Apr 27

Hello, World of Mechanistic Interpetability

This post is an introduction for the series of posts, which will be dedicated to mechanistic interpretability in its broader definition as a set of approaches and tools to better understand the processes that lead to certain AI-generated outputs. There is an ongoing debate on what to consider a part...

Mar 158

First steps into mechanistic interpretability. Refusal is not a single direction

It was mid-fall, I think. We all had tons of work, but when the AI Plans team launched their Moonshot Alignment program, we were like, “Yeah, sounds like a plan, why not?” – and things escalated quickly. “We” are not a constant entity. At first, there were five people, who...

Mar 151