Research Engineering Intern at the Center for AI Safety. Helping to write the AI Safety Newsletter. Studying CS and Economics at the University of Southern California, and running an AI safety club there. Previously worked at AI Impacts and with Lionel Levine and Collin Burns on calibration for Detecting Latent Knowledge Without Supervision.
An interesting question here is "Which forms of AI for epistemics will be naturally supplied by the market, and which will be neglected by default?" In a weak sense, you could say that OpenAI is in the business of epistemics, in that its customers value accuracy and hate hallucinations. Perhaps Perplexity is a better example, as they cite sources in all of their responses. When embarking on an altruistic project here, it's important to pick an angle where you could outperform any competition and offer the best available product.
Consensus is a startup that raised $3M "Make Expert Knowledge Accessible and Consumable for All" via LLMs.
I'm specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don't have reliable human labels.
But this is also only a small portion of work known as "activation engineering." I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I'm not clearly distinguishing between different kinds of activation engineering, but this theory of change only applies to a small subset of that work. I'm not talking about model editing here, though maybe it could be useful for validation, not sure.
From Benchmarks for Detecting Measurement Tampering:
The best technique on most of our datasets is probing for evidence of tampering. We know that there is no tampering on the trusted set, and we know that there is some tampering on the part of the untrusted set where measurements are inconsistent (i.e. examples on which some measurements are positive and some are negative). So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering).
This seems like a great methodology and similar to what I'm excited about. My hypothesis based on the comment above would be that you might get extra juice out of unsupervised methods for finding linear directions, as a complement to training on a trusted set. "Extra juice" might mean better performance in a head-to-head comparison, but even more likely is that the unsupervised version excels and struggles on different cases than the supervised version, and you can exploit this mismatch to make better predictions about the untrusted dataset.
From your shortform:
Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.
I'd be interested to hear further elaboration here. It seems easy to construct a dataset where a primary axis of variation is the model's beliefs about whether each statement is true. Just create a bunch of contrast pairs of the form:
We don't need to know whether the statement is true to construct this dataset. And amazingly, unsupervised methods applied to contrast pairs like the one above significantly outperform zero-shot baselines (i.e. just asking the model whether a statement is true or not). The RepE paper finds that these methods improve performance on TruthfulQA by double digits vs. a zero-shot baseline.
Here's one hope for the agenda. I think this work can be a proper continuation of Collin Burns's aim to make empirical progress on the average case version of the ELK problem.
tl;dr: Unsupervised methods on contrast pairs can identify linear directions in a model's activation space that might represent the model's beliefs. From this set of candidates, we can further narrow down the possibilities with other methods. We can measure whether this is tracking truth with a weak-to-strong generalization setup. I'm not super confident in this take; it's not my research focus. Thoughts and empirical evidence are welcome.
ELK aims to identify an AI's internal representation of its own beliefs. ARC is looking for a theoretical, worst-case approach to this problem. But empirical reality might not be the worst case. Instead, reality could be convenient in ways that make it easier to identify a model's beliefs.
One such convenient possibility is the "linear representations hypothesis:" that neural networks might represent salient and useful information as linear directions in their activation space. This seems to be true for many kinds of information - (see here and recently here). Perhaps it will also be true for a neural network's beliefs.
If a neural network's beliefs are stored as a linear direction in its activation space, how might we locate that direction, and thus access the model's beliefs?
Collin Burns's paper offered two methods:
Given some plausible assumptions about how neural networks operate, it seems reasonable to me to expect this method to work. Neural networks might think about whether claims in their context window are true or false. They might store these beliefs as linear directions in their activation space. Recover them with labels would be difficult, because you might mistake your own beliefs for the model's. But if you simply feed the model unlabeled pairs of contradictory statements, and study the patterns in its activations on those inputs, it seems reasonable that the model's beliefs about the statements would prominently appear as linear directions in its activation space.
One challenge is that this method might not distinguish between the model's beliefs and the model's representations of the beliefs of others. In the language of ELK, we might be unable to distinguish between the "human simulator" direction and the "direct translator" direction. This is a real problem, but Collin argues (and Paul Christiano agrees) that it's surmountable. Read their original arguments for a better explanation, but basically this method would narrow down the list of candidate directions to a manageable number, and other methods could finish the job.
Some work in the vein of activation engineering directly continues Collin's use of unsupervised clustering on the activations of contrast pairs. Section 4 of Representation Engineering uses a method similar to Collin's second method, outperforming few-shot prompting on a variety of benchmarks and using it to improve performance on TruthfulQA by double digits. There's a lot of room for follow-up work here.
Here are few potential next steps for this research direction:
I have lower confidence in this overall take than most of the things I write. I did a bit of research trying to extend Collin's work, but I haven't thought about this stuff full-time in over a year. I have maybe 70% confidence that I'd still think something like this after speaking to the most relevant researchers for a few hours. But I wanted to lay out this view in the hopes that someone will prove me either right or wrong.
Here's my previous attempted explanation.
Another important obligation set by the law is that developers must:
(3) Refrain from initiating the commercial, public, or widespread use of a covered model if there remains an unreasonable risk that an individual may be able to use the hazardous capabilities of the model, or a derivative model based on it, to cause a critical harm.
This sounds like common sense, but of course there's a lot riding on the interpretation of "unreasonable."
Really, really cool. One small note: It would seem natural for the third heatmap to show the probe's output values after they've gone through a softmax, rather than being linearly scaled to a pixel value.
Two quick notes here.
I think it’s pretty common and widely accepted that people support laws for their second-order, indirect consequences rather than their most obvious first-order consequences. Some examples:
These aren’t necessarily perfect analogies, but I think they suggest that there’s no general norm against supporting policies for their indirect consequences. Instead, it’s often healthy when people with different motivations come together and form a political coalition to support a shared policy goal.
Curious why this is being downvoted. I think legislators should pass laws which have positive consequences. I explained the main reasons why I think this policy would have positive consequences. Then I speculated that popular beliefs on this issue might be biased by profit motives. I did not claim that this is a comprehensive analysis of the issue, or that there are no valid counterarguments. Which part of this is norm-violating?
(Steve wrote this, I only provided a few comments, but I would endorse it as a good holistic overview of AIxBio risks and solutions.)