Here's one hope for the agenda. I think this work can be a proper continuation of Collin Burns's aim to make empirical progress on the average case version of the ELK problem.
tl;dr: Unsupervised methods on contrast pairs can identify linear directions in a model's activation space that might represent the model's beliefs. From this set of candidates, we can further narrow down the possibilities with other methods. We can measure whether this is tracking truth with a weak-to-strong generalization setup. I'm not super confident in this take; it's not my research focus. Thoughts and empirical evidence are welcome.
ELK aims to identify an AI's internal representation of its own beliefs. ARC is looking for a theoretical, worst-case approach to this problem. But empirical reality might not be the worst case. Instead, reality could be convenient in ways that make it easier to identify a model's beliefs.
One such convenient possibility is the "linear representations hypothesis:" that neural networks might represent salient and useful information as linear directions in their activation space. This seems to be true for many kinds of information - (see here and recently here). Perhaps it will also be true for a neural network's beliefs.
If a neural network's beliefs are stored as a linear direction in its activation space, how might we locate that direction, and thus access the model's beliefs?
Collin Burns's paper offered two methods:
- Consistency. This method looks for directions which satisfy the logical consistency property P(X)+P(~X)=1. Unfortunately, as Fabien Roger and a new DeepMind paper point out, there are very many directions that satisfy this property.
- Unsupervised methods on the activations of contrast pairs. The method roughly does the following: Take two statements of the form "X is true" and "X is false." Extract a model's activations at a given layer for both statements. Look at the typical difference between the two activations, across a large number of these contrast pairs. Ideally, that direction includes information about whether or not each X was actually true or false. Empirically, this appears to work. Section 3.3 of Collin's paper shows that CRC is nearly as strong as the fancier CCS loss function. As Scott Emmons argued, the performance of both of these methods is driven by the fact that they look at the difference in the activations of contrast pairs.
Given some plausible assumptions about how neural networks operate, it seems reasonable to me to expect this method to work. Neural networks might think about whether claims in their context window are true or false. They might store these beliefs as linear directions in their activation space. Recover them with labels would be difficult, because you might mistake your own beliefs for the model's. But if you simply feed the model unlabeled pairs of contradictory statements, and study the patterns in its activations on those inputs, it seems reasonable that the model's beliefs about the statements would prominently appear as linear directions in its activation space.
One challenge is that this method might not distinguish between the model's beliefs and the model's representations of the beliefs of others. In the language of ELK, we might be unable to distinguish between the "human simulator" direction and the "direct translator" direction. This is a real problem, but Collin argues (and Paul Christiano agrees) that it's surmountable. Read their original arguments for a better explanation, but basically this method would narrow down the list of candidate directions to a manageable number, and other methods could finish the job.
Some work in the vein of activation engineering directly continues Collin's use of unsupervised clustering on the activations of contrast pairs. Section 4 of Representation Engineering uses a method similar to Collin's second method, outperforming few-shot prompting on a variety of benchmarks and using it to improve performance on TruthfulQA by double digits. There's a lot of room for follow-up work here.
Here are few potential next steps for this research direction:
- On the linear representations hypothesis, doing empirical investigation of when it holds and when it fails, and clarifying it conceptually.
- Thinking about the number of directions that could be found using these methods. Maybe there's a result to be found here similar to Fabien and DeepMind's results above, showing this method fails to narrow down the set of candidates for truth.
- Applying these techniques to domains where models aren't trained on human statements about truth and falsehood, such as chess.
- Within a weak-to-strong generalization setup, instead try unsupervised-to-strong generalization using unsupervised methods on contrast pairs. See if you can improve a strong model's performance on a hard task by coaxing out its internal understanding of the task using unsupervised methods on contrast pairs. If this method beats fine-tuning on weak supervision, that's great news for the method.
I have lower confidence in this overall take than most of the things I write. I did a bit of research trying to extend Collin's work, but I haven't thought about this stuff full-time in over a year. I have maybe 70% confidence that I'd still think something like this after speaking to the most relevant researchers for a few hours. But I wanted to lay out this view in the hopes that someone will prove me either right or wrong.
Here's my previous attempted explanation.