If I were doing this project, the first thing I'd want to check is whether the dlk paper is actually robustly measuring truth/falsehood, rather than something else which happens to correlate with truth/falsehood for the particular data generation/representation methods used. My strong default prior, for ml papers in general, is "you are not measuring what you think you are measuring".
Thanks for writing this! I think there are a number of interesting directions here.
I think in (very roughly) increasing order of excitement:
I’m excited to see where this work goes!
What CCS does conceptually is finds a direction in latent space that distinguishes between true and false statements. That doesn't have to be truth (restricted to stored model knowledge, etc.), and the paper doesn't really test for false positives so it's hard to know how robust the method is. In particular I want to know that the probe doesn't respond to statements that aren't truth-apt. It seems worth brainstorming on properties that true/false statements could mostly share that are not actually truth. A couple examples come to mind.
The last one seems more pressing, since it's a property that isn't "truthy" at all, but I'm struggling to come up with more examples like it.
Nice project, there are several ideas in here I think are great research directions. Some quick thoughts on what I'm excited about:
ETA: I think the "adding labeled data" idea is a good illustration of what I'm talking about. Imagine you have problems where the method currently doesn't work at all. If even large amounts of supervised data don't help much on these, this suggests your probe can't find a truth encoding (maybe because you'd need a higher capacity probe or if you already have that, maybe because the optimization is difficult). On the other hand, if you get good performance with supervised data, it suggests that you need stronger consistency checks. You can then also try things like adding supervised data in only one domain and check generalization, and you can expect a reasonably clear signal. But if you do all this on a dataset where the unsupervised method already works pretty well, then the only evidence you get is something like "does it improve performance by 2%, 5%, 10%, ...?", the signal is less clear, and it's much harder to say which of these explanations a 5% improvement indicates. All that is in addition to the fact that finding cases which are difficult for the current method is really important in its own right.
You can then also try things like adding supervised data in only one domain and check generalization, and you can expect a reasonably clear signal.
Yep, I just had this idea this morning and came here to check if anyone else had thought of it. It seems plausible that a semi-supervised version of CCS could outperform naive logistic regression in generalization performance.
An LLM will presumably have some internal representation of the characteristics of the voice that it is speaking with, beyond its truth value. Perhaps you could test for such a representation in an unsupervised manner by asking it to complete a sentence with and without prompting for a articular disposition ('angry', 'understanding',...). Once you learn to understand the effects that the prompting has, you could test how well this allows you to modify disposition via changing the activations.
This line of thought came from imagining what the combination of CCS and Constitutional AI might look like.
Preface
We would like to thank the following people who contributed to the generation of ideas and provided feedback on this post: Alexandre Variengien, Daniel Filan, John Wentworth, Jonathan Claybrough, Jörn Stöhler, June Ku, Marius Hobbhahn, and Matt MacDermott.
We are a group of four who participate in SERI ML Alignment Theory Scholars Program under John Wentworth, and we are extending the paper “Discovering Latent Knowledge in Language Models Without Supervision" as we see it as an interesting direction where we could contribute concrete progress towards alignment.
We have a number of potential directions we could explore. The goal of this post is to get feedback on them early in order to prioritize better. This could be confirmations for the good ideas, reasons why certain ideas might be bad, references to existing similar attempts or relevant literature, potential failure modes, and generally (constructive) criticism of any kind.
Although we're all involved in all parts of the research process, the structure of our group and our main roles are as follows: Kaarel Hänni is our theorist and the main contributor of ideas; Walter Laurito and Kay Kozaronek are our experimentalists who focus mainly on deploying and running the code; Georgios Kaklamanos is our distiller and focuses on writing up and presenting the results (e.g., this post).
Brief summary of the DLK paper
In case you haven’t read the original paper, you can get a quick overview from the following Twitter Thread from the author (alternative Threadreader link). Here we’ll just mention a few key points relevant to the things we want to work on. The above figure from the paper outlines their process in a clear way.
- The key insight of the paper is that coherence conditions satisfied by truth/falsehood can be leveraged to search for a model's internal representation of truth.
- e.g., the question "Are cats mammals?" can be answered either "yes" or "no"
- Thus, if we define as p(x+i) the probability that the statement xi is true, and p(x−i)the probability that statement xi is false, these two probabilities should sum to 1.
- They introduce Contrast-Consistent Search (CCS), a method that learns a linear projection of the hidden states that is consistent across negations.
- The input to CCS is a set of Yes-No questions (contrasting pairs) and access to a pre-trained model’s representations (ϕ).
- The output of CCS is a lightweight probe on top of these representations that can answer new questions.
- CCS uses these contrasting pairs to create representations for the “Yes” and “No” parts of the pairs, denoted ϕ(x+i) and ϕ(x−i), respectively. These representations are then normalized.[1]
- After normalization, CCS maps these representations to probabilities.
- p+i=p(ϕ(x+i)) : the probability the statement xi is true
- p−i=p(ϕ(x−i)) : the probability the statement xi is false
- As already mentioned, these two should sum to 1.
- When training, their loss function has two parts:
- Consistency loss: it is the part that tries to ensure that the above constraint is satisfied. It is formulated as:
- Lconsistency(θ,b;qi):=[pθ,b(x+i)−(1−pθ,b(x−i))]2
- Confidence loss: the part that avoids a trivial solution where p(x+)=p(x−)=0.5
- Lconfidence(θ,b;qi):=min{pθ,b(x+i),pθ,b(x−i)}2
- During inference, to output if the statement q is true or not:
- The probability that the statement is true, p(x+i), and the probability that the statement is not false, (1−p(x−i)), represent the same thing. So (in theory) they should be equal.
- However, this is only enforced by the soft consistency constraint of the consistency loss. So (in practice), they are not exactly equal.
- Thus, to do inference on an example, they take the average of these two.
~p(qi):=12(p(x+i)+(1−p(x−i)))Reproduction and replication of the original paper
We have started setting up and running the paper’s code to see if we can obtain the same results. The authors provided their code on GitHub. Furthermore, we are creating our own implementation for comparison which will give us a better understanding of the paper and more flexibility for testing new ideas.
Potential directions
Evaluate a different inference rule
We’d like to evaluate how using a different inference rule (instead of the average) would affect the performance. Specifically checking which of pθ(ϕ(x−i)) and pθ(ϕ(x+i)) is larger.
Evaluating the robustness of inner truthiness on different prompt formats.
Section 3.2.2 of the paper states that their method is robust when dealing with misleading prompts. They create misleading prompts by adding a prefix, such as the one shown in Figure 1, before they ask the questions. We would like to evaluate the robustness of the model further by using different types of prompts.
They use prompts containing a binary question and an answer:
x+=Q: Is the sentiment of "I loved this movie." positive or negative? A: positiveWe would like to either use the corresponding question alone or the corresponding proposition alone. This format seems more natural to assign truth values to, as it resembles everyday communication. E.g.:
We could check this for the probe their method learns, or train another probe for differently formatted prompts.[2] We want a probe that works with propositions since it is more natural in language to combine propositions compared to the format they’re using (e.g. combining P, Q into the proposition [P and Q] than writing: [Question: Is it true that P and Q? A: __].[3]
Change the projection probe and check how that affects things
In the paper, they are using a neural net with zero hidden layers and a sigmoid activation on the output node. We could try this with a neural net with more layers and test if it would increase the performance[4].
We could look at multiple layers simultaneously and find a combination that works better. E.g., we could observe the first, the middle, and the last layer, and take their average. There is a concern that this might lead to overfitting though, since the task is so simple that logistic regression performed well on it.
Testing Inverse scaling dataset
We want to take the inverse scaling datasets and train a DLK probe for the following models:
Then we want to check if the representation of truth in inner representations is also getting less accurate for bigger models. If that happens, it could point in the direction of the model's understanding actually getting worse. On the other hand, if it doesn't hold, it could point in the direction of the inverse scaling law cases thus far having more to do with something weird going on with output behavior in a given context, and they might not generalize. Also, this seems like a potentially interesting additional testing ground for whether DLK can provide information about the model beyond output behavior.
Better representation of probabilities
Lconfidence(θ,b;qi):=min{pθ,b(x+i),pθ,b(x−i)}2One part of the loss function they use is the “confidence” term shown above. This term is to impose the “law of excluded middle” to ensure that the model won’t end up giving[5] a p(x+)=p(x−)=0.5, which pushes probabilities to be close to 0 or 1.
However, in reality, there is no rule stating that the probabilities have to be near 0 or 1. So, we would like to test if it’s possible to avoid using the confidence term in the loss function to capture probabilities better in order to have a better representation of the world.
By removing the confidence term, a trivial solution for the model would be to always output the probability of 0.5 for both options. To avoid this, we could have examples in the training data where there are more than two options and require that their own probabilities sum to 1.
For example, we could pick any two propositions P, Q, (e.g. P: `2+2 = 4`, Q: `Cats are mammals`) and construct all their boolean combinations:
As shown in the figures below, the probabilities of these 4 have to add up to 1 (cover the entire area). We could put a term in the loss function that would enforce a penalty if this condition doesn’t hold.
One alternative would be to require all of the following conditions:
Another option would be to take mathematical statements where the possibility space is exhausted by some finite number of options that it’s reasonable to be uncertain between[6].
In general, we’d search for something that satisfies probability axioms. In addition to capturing probabilities better, this hopefully constrains the search space a lot. Additional constraints would be useful, especially if we wish to replace the linear probe with a more complicated probe with many more parameters[7].
Acknowledgment: For this subsection, Kaarel wants to thank Jörn Stöhler, Matt MacDermott, and Daniel Filan for contributing significantly to these ideas in conversations.
Evaluate probabilistic examples
We could use the above method to evaluate probabilistic examples directly, e.g. “My next coin toss will land heads”, and check if the above method without a confidence term in the loss does better on these. This would provide a better understanding of how well the probe captures probabilities instead of just capturing truth for certain sentences. It might also be helpful for understanding the calibration of the probe.
Evaluate the performance of CCS after adding some labeled data.
After reproducing the authors’ results, extend the datasets with some labeled examples (i.e., “1+1=2”) that we’d flag as either true or false. We’d then “hardcode” some terms to the loss function to ensure that these data points are considered. The main motivation behind this is to check how the performance of CCS will change in comparison to the authors’ results. The method would still be mostly unsupervised[8], which is nice since it can scale to large datasets without requiring extensive labeled datasets.
Check for possible connections to Mechanistic Interpretability
Given that they are using a transformer, we could try to see if there is some form of connection to mechanistic interpretability.
Evaluate alternating text
We could try to input a passage of alternating true/false sentences and try to see which inner states (i.e. which position) are best for determining the truth of each particular sentence. Are these always the positions of the tokens in that sentence? Does it get more spread out as one goes deeper into the transformer? The hypothesis is that if we can locate the positions that the model looks for in each true sentence, we can trace that to the model's internal representation of the truth.
DLK is a non-mechanistic interpretability technique since it only finds a representation of truth; it doesn’t provide a mechanism. On the other hand, if the above works, it might provide information on how the model stores truth, which is useful for mechanistic interpretability research.
Check changes in the truth representation when incrementally prepending text to the prompt.
We could have a normal prompt at the beginning and then incrementally start adding text to the beginning and see how the truth representation changes with that. This is important because it gives us another task on which we could validate one of the main claims in the paper, namely that their method reduces "prompt sensitivity".
Check if all neurons lead to ROME
Using interpretability tools (e.g the causal tracing method from the ROME publication), we could check if we can figure out how truth is represented and in which neurons. We could even combine this approach with the previous idea and see if they produce the same results.
Additional ideas that came up while writing this post.
The following ideas were generated by Kaarel while we were in the process of writing the post and trying to re-implement the code. We haven't spent much time refining the phrasing, so they might require slower reading and a deeper understanding of the paper and the method than the above text does. We do this for the benefit of getting feedback early, and we hope to have a better formulation / expand upon them in future posts.
But it seems plausible that one would get an easy improvement in the accuracy of CCS by just looking at the k different ways of prompting a single original data point and averaging the results to do inference on the data point. The intuition here is that the average of k 0/1 coin flips, each of which is biased towards 1, is generally much more likely to be >0.5 than the average probability of an individual coin flip to be >0.5 (given that the flips are not super correlated to each other).
Acknowledgment: This idea should probably be mostly attributed to June Ku, whom Kaarel would like to thank for a helpful conversation.
Acknowledgments: Kaarel would like to thank Alexandre Variengien for recommending thinking more about how to generalize DLK to different kinds of concepts in a helpful conversation, as well as for giving concrete leads on this, such as the idea to look for a representation of direction inside a model stated later. Kaarel would also like to thank John Wentworth for providing feedback and a number of ideas in a helpful conversation.
If we do this for a bunch of simulated agents, maybe we can state some general conclusion about how the truth according to some simulated agent is represented, and maybe this contrasts with what is found by DLK providing evidence that DLK is not just finding the representation of truth according to some simulated agent. Or maybe we can even use this to develop a nice understanding of how language models represent concepts of simulated agents vs analogous concepts of their own, which would let us figure out e.g. the goals of a language model from understanding how goals of simulated agents are represented using supervised probing, and then using the general mapping from reps of simulatee-concepts to reps of the model’s own analogous concepts which we developed for truth (if we’re lucky and it generalizes).
This is a technical part of the paper that we didn’t want to fully repeat here. For more details, you can look at section 2.2 of the paper.
Renormalization might not make sense for propositions, but maybe the method works without renormalization.
This would also be relevant to the next section, where we aim for a better representation of the probabilities.
From their code it looks as if they already tried this and in the paper it states that they had bad results, so they abandoned it. Still, we’d like to explore that.
This is a degenerate solution to the other term in their loss function, so they want to avoid it.
E.g: what the trillion billionth digit of π is, for which there are 10 options
Kaarel also thinks that it would be very cool for it to be the case that a random thing internal to the model satisfying probability axioms is in fact the probability assignment of the model
Technically speaking, this would be a semi-supervised approach.