Don't Dismiss Simple Alignment Approaches

Chris_Leong

I remember that when I first became interested in alignment there was this belief that in order to make alignment progress, you needed to be a genius. With the rise of interpretability, this expectation has moderated somewhat, but I still think it has influence and that it often leads people to overcomplicate.

Many recent alignment breakthroughs have been found by keeping it simple^[1]:

Lie Detection:
• Contrast-consistent search (better known as discovering latent knowledge^[2]) reads the truth out of a model without looking at the outputs, simply by using a linear probe with only two elements in its activation function^[3].
• Pacchiardi, Owain and others use a logistic regression classifier^[4] to predict whether a mis al was truthful from its answers to further questions. They were able to demonstrate this using three different kinds of questions and the results even generalised across models. This technique seems like it could be useful in practise, even if it doesn't scale all the way to superintelligence.

Superposition: Dictionary learning makes substantial progress on superposition by simply using a one hidden layer multi-layer perceptron as a sparse auto-encoder with L2 reconstruction loss and a L1 loss on hidden layer activations. Anthropic actually chose to avoid more sophisticated methods, as they were worried they might recover features that the model doesn't actually utilise.

Model-steering: Activation Engineering: Activation addition allows us to steer models by simply performing a forward pass for the concept you want to activate, saving the activations of a particular layer as a vector and adding them to future forward pass. See also Turntrout's prior work on steering a mouse by subtracting a cheese vector^[5] and inference-time intervention^[6].^[7]

World model location: Neel Nanda found a linear representation in Othello-GPT. This one was a bit trickier: a previous paper had failed to recover it using linear probes. However, Neel managed to do this by searching for a representation of "my color" and by applying a bunch of other tweaks. See also Wes Gurnee's work with Max Tegmark, which used linear probes to find linear representations of latitude and longitude; in addition to one representing time^[8].

Scalable Interpretability: OpenAI used a language model in order to label neurons based on the activations for some relevant examples using few-shot prompting.

Understanding Implications^[9]: For a long time it appeared as though it would be challenging to prevent an AI from taking our instructions too literally. Essentially what seems to have happened is that they decided to just use RL on a generative base model and the problem mostly solved itself. Reinforcement Learning from Human Feedback is admittedly complex - proximal policy optimization isn't simple - but I still think it's quite notable than OpenAI, seems to have almost one-shotted it with InstructGPT^[10]. Even if they hadn't found a solution the first time, it seems likely that we were always going to end up there by iterating on a baseline.

I don't want to dismiss the challenges of such research. Even if you have the general idea, making it work requires getting a lot of details right. Verifying it works takes a lot of work too. Even picking the right question is hard. However, simpler techniques have gone further for these problems than I would have originally expected.

I think it's worthwhile speculating why reasonably simple solutions have worked for these problems, and checking whether there are any similarities between them.

Firstly, I think it's worth noting that all of these techniques except for the last two look for linear representations in neural networks. Beren has made a strong case that there is strong evidence of deep learning models being almost linear. In retrospect, it seems intuitive to me that neural networks would store sparse features in linear combinations, rather than with some more complicated form of compression, as this makes it easy to write to and read from these features^[11].

In contrast, the last two techniques listed rely on current AI models being very powerful and quite steerable. I admit that two examples aren't very many, but I expect we'll see more in this vein soon.

Looking more broadly, I think it's worth noting that all of these results are in empirical alignment research. I don't think this is a coincidence. Whilst there may be simple solutions to some of the problems of agent foundations, the problems have been around for enough time that if there were simple solutions, they likely would have been found by now. In contrast, the last two directly rely on a certain level of capabilities and the linear representation techniques arguably rely on this indirectly as well. To wit: it is beneficial for a model to have a linear representation of particular features, the more powerful our optimization techniques, the more likely our trained models are to produce such a representation.

An Aside on Contrast-Consistent Search:

When I first encountered this technique, I was awed by it and I felt like I had no conception of how anyone would ever think of something like this. While I still acknowledge the brilliance of this work^[12], it no longer feels completely unthinkable.

I assume Collin already had some intuitions that neural networks were vastly more linear than you might expect. If you buy this and you're trying to read the truth directly out of the network weights, it's then entirely natural to look for a linear direction.

Linear probes were an already existing technique, so if you knew of this technique and you were making a shortlist of which techniques to investigate, this would make it onto the shortlist.

Suppose you end up in the position of asking "how can I use a linear probe to find a direction corresponding to truth?" in a neural network. It'd then be natural to ask what some of the properties of truth are, which would then lead directly to the consistency property Collin leveraged.

So, smart, brilliant even, but not completely unthinkable, if you ask the right questions.

And even just knowing to look for linearities would be a massive headstart.

Final thoughts:

I suspect that all of these approaches are still very far away from where we need to be. I consider them substantial advances nonetheless for two key reasons: having a baseline helps people choose an appropriate level of ambition, and also makes allows us to empirically discover the key bottlenecks to a fuller solution.

I'm hoping that increasing awareness that relatively simple techniques have been successful spurs more research progress, by giving people the confidence to actually try to make progress and increasing people's motivation to explore these kinds of approaches for longer before they toss in the towel.

It may very well be that we're just in a particular stage of the field where there's substantial this low-hanging fruit to pick. Perhaps we'll soon pick all the low hanging fruit such that we end up requiring brilliant conceptual breakthroughs for further progress. However, we aren't there yet and I think it's a mistake not consider simple paths forward on the basis that if they were fruitful then someone would have already realised this.

^{^}
I thought about also including this paper on representational engineering which describes, among other things, Linear Artificial Tomograph. However, I haven't had time to finish skimming it yet.
^{^}
Technically, "Discovering Latent Knowledge" is the name of the paper and of a problem. However, if I go to people and say "Have you heard of Contrast-Consistent Search?", the answer is typically no, but when I ask about "Discovering Latent Knowledge", a lot of the time it turns out that they're actually familar with the paper.
^{^}
One component pushes the probability of a statement being true and a statement being false toward adding up to one. A second component pushes the model away from producing probabilities close to 0.5.
^{^}
While these are non-linear in their inputs, they assume that the log probability is linear in the inputs.
^{^}
They generate two observations that are the same except for one possessing the cheese.
^{^}
They train a linear probe with sigmoid activation to discover the truthful direction. Then they intervene by adding a multiple of this to the activation of the layer, scaling by a constant and the standard deviation.
^{^}
For some useful potential applications, see Nina's work on using activation addition to reduce sycophancy and improving honesty and red-teaming.
^{^}
This paper seems like it was mostly intended for outreach purposes. Unfortunately, the framing was a tiny bit off, in that some people felt that this paper/the publicity around it was overclaiming. That said, if this had been handled a bit better, it could have led to significant improvement in the debate. One point I want to emphasize is that you can spend forever debating whether current neural networks have world models and what a world model even means. Or you can just go and make direct empirical progress, and maybe that doesn't convince people, but it will at least push them to clarify what they're looking for.
^{^}
You may object that RLHF is mostly capabilities. I also tend to think about it as being primarily a capabilities advance, but it is an advance in alignment as well.
^{^}
I've heard that the innovation in ChatGPT was more about the user interface and finetuning for that than producing a more powerful model.
^{^}
While orthogonal features are the simplest, you can only include one feature per dimension, whilst you can fit many more near orthogonal features.
^{^}
Even though CCS wasn't as successful as we might have hoped, I would still defend this assessment.

My vibe from this post is something like "we're making on stuff that could be helpful so there's stuff to work on!" and this is a vibe I like. However, I suspect that for people who might not be as excited about these approaches, you're likely not touching on important cruxes (eg: do these approaches really scale? Are some agendas capabilities enhancing? Will these solve deceptive alignment or just corrigible alignment?)

I also think that if the goal is to actually make progress and not to maximize the number of people making progress or who feel like they're making progress, then engaging with those cruxes is important before people invest substantive energy (ie: beyond upskilling). However as a directional update for people who are otherwise pretty cynical, this seems like a good update.

Hmm... I suppose this is pretty good evidence that CCS may not be as promising as it first appeared, esp. the banana/shed results.

https://www.lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1

Update: Seems like the banana results are being challenged.

I wrote in one of my footnotes:

You may object that RLHF is mostly capabilities. I also tend to think about it as being primarily a capabilities advance, but it is an advance in alignment as well

ie. it belongs in the reference class when figuring out the difficulty of making progress on alignment.

Regarding scalability, I wrote:

I suspect that all of these approaches are still very far away from where we need to be. I consider them substantial advances nonetheless for two key reasons: having a baseline helps people choose an appropriate level of ambition, and also makes it easier to empirically discover the key issues in solving a problem more fully.

Maybe that addresses the point about not scaling to a particular extent. Although, definitely not completely. It's also possible that some of these techniques are effectively a trap in that they'd appear to work up to a certain level, so we'd start to rely on them and then we'd end up getting screwed.

Has anyone developed a metric for quantifying the level of linearity versus nonlinearity of a model's representations? A metric like that would let us compare the levels of linearity for models of different sizes, which would help us extrapolate whether interpretability and alignment techniques that rely on approximate linearity will scale to larger models.

I don’t know, but would love to find out.

I asked on Discord and someone told me this:

A simple way to quantify this: first define a "feature" as some decision boundary over the data domain, then train a linear classifier to predict that decision boundary from the network's activations on that data. Quantify the "linearity" of the feature in the network as the accuracy that the linear classifier achieves.

For example, train a classifier to detect when some text has positive or negative sentiment, then pass the same text through some pretrained LLM (e.g. BERT) whose "feature-linearity" you're trying to measure, and try to predict the sentiment from the BERT's activation vectors using linear regression. The accuracy of this linear model tells you how linear the "sentiment" feature is in your LLM.

IMO the most useful version of this would be to get empirical evidence on techniques. E.g. erasing certain concepts using LEACE and seeing if they can inhibit the model's use of those concepts including during further training. It seems hard to ensure otherwise that there is not some gap between your definitions and reality.

In contrast, the last two techniques listed rely on current AI models being very powerful and quite steerable.

An alternative view is that we've been lucky. LLMs are trained by unsupervised learning and are almost oracles that are moderately aligned by default.

But soon someone will make them into Reinforcement Learning (RL) agents that can plan. They will do this because long-term planning is super useful and RL is the best way we have to do it. However, RL tends to make power-seeking agents that look for shortcuts and exploits ( most mid-specification examples are from RL ).

So I worry that we will see more unsafe examples soon.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

I wrote in one of my footnotes:

You may object that RLHF is mostly capabilities. I also tend to think about it as being primarily a capabilities advance, but it is an advance in alignment as well

ie. it belongs in the reference class when figuring out the difficulty of making progress on alignment.

Regarding scalability, I wrote:

I suspect that all of these approaches are still very far away from where we need to be. I consider them substantial advances nonetheless for two key reasons: having a baseline helps people choose an appropriate level of ambition, and also makes it easier to empirically discover the key issues in solving a problem more fully.

I don’t know, but would love to find out.

I asked on Discord and someone told me this:

A simple way to quantify this: first define a "feature" as some decision boundary over the data domain, then train a linear classifier to predict that decision boundary from the network's activations on that data. Quantify the "linearity" of the feature in the network as the accuracy that the linear classifier achieves.

For example, train a classifier to detect when some text has positive or negative sentiment, then pass the same text through some pretrained LLM (e.g. BERT) whose "feature-linearity" you're trying to measure, and try to predict the sentiment from the BERT's activation vectors using linear regression. The accuracy of this linear model tells you how linear the "sentiment" feature is in your LLM.

In contrast, the last two techniques listed rely on current AI models being very powerful and quite steerable.

An alternative view is that we've been lucky. LLMs are trained by unsupervised learning and are almost oracles that are moderately aligned by default.

So I worry that we will see more unsafe examples soon.

139

Don't Dismiss Simple Alignment Approaches

139

Ω 48

139

Ω 48

139

Ω 48