My guess is that a model with 1-10B params could benefit from CoT if trained using these techniques (https://arxiv.org/abs/2306.11644, https://arxiv.org/abs/2306.02707). Then there's reduced precision and other tricks to further shrink the model. That said, I think there's a mismatch between state-of-the-art multi-modal models (huge MoE doing lots of inference time compute using scaffolding/CoT) that make sense for many applications and the constraints of a drone if it needs to run locally and produce fast outputs.
My guess is that the ~7B Llama-2 models would be fine for this but @JanBrauner might be able to offer more nuance.
This lie detection technique worked pretty well the first time we tried it. We also look at using a 2nd model to "interrogate" the 1st model (i.e. the model that is suspected of lying). This approach worked less well but we didn't push it that hard.
I address the motivations for our Reversal Curse paper in a reply to your other comment. My current (highly speculative) guess is that humans do learn one-directionally. We can't easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can't understand such reversed language either. It's easy to count down (because we practice that) but harder to do the alphabet backwards (because we don't practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means both AB and BA are present. E.g. repeating to ourselves "casa, house, casa, house, etc...". For facts we read passively in newspapers, it's trickier to think about becuase we retain relatively little. But my guess is that most facts that we retain at all will be ones that appear in both orders, though that won't be necessary for us learning them (becauase we can reflect on them ourselves). [If we don't understand the semantics of what we are hearing at all, then we don't memorize. E.g. Americans might hear a lot of Spanish on the streets but but memorize basically nothing.]
Great points and lots I agree with.
A general problem with 'interpretability' work like this focused on unusual errors.
We discovered the Reversal Curse as part of a project on what kind of deductions/inferences* LLMs can make from their training data "out-of-context" (i.e. without having the premises in the prompt or being able to do CoT). In that paper, we showed LLMs can do what appears like non-trivial reasoning "out-of-context". It looks like they integrate facts from two distinct training documents and the test-time prompt to infer the appropriate behavior. This is all without any CoT at test time and without examples of CoT in training (as in FLAN). Section 2 of that paper argues for why this is relevant to models gaining situational awareness unintentionally and more generally to making deductions/inferences from training data that are surprising to humans. Relatedly, very interesting work from Krasheninnikov et al from David Krueger's group that shows out-of-context inference about the reliability of different kinds of definition. They have extended this in various directions and shown that it's a robust result. Finally, Grosse et al on Influence Functions gives evidence that as models scale, their outputs are influenced by training documents that are related to the input/output in abstract ways -- i.e. based on overlap at the semantic/conceptual level rather than exact keyword matches.Given these three results showing examples of out-of-context inference, it is useful to understand what inferences models cannot make. Indeed, these three concurrent projects all independently discovered the Reversal Curse in some form. It's a basic result once you start exploring this space. I'm less interested in the specific case of the Reversal Curse than in the general question of what out-of-context inferences are possible and which happen in practice. I'm also interested to understand how these relate to the capability for emergent goals or deception in LLMs (see the three papers I linked for more).
And this is a general dilemma: if a problem+answer shows up at least occasionally in the real world / datasets proxying for the real world, then a mere approximator or memorizer can learn the pair, by definition; and if it doesn't show up occasionally, then it can't matter to performance and needs a good explanation why we should care.
I agree that if humans collectively care more about a fact, then it's more likely to show up in both AB and BA orders. Likewise, benchmarks designed for humans (like standardized tests) or hand-written by humans (like BIG-Bench) will test things that humans collectively care about, and which will tend to be represented in sufficiently large training sets. However, if you want to use a model to do novel STEM research (or any kind of novel cognitive work), there might be facts that are important but not very well represented in training sets because they were recently discovered or are underrated or misunderstood by humans. On the point about logic, I agree with much of what you say. I'd add that logic is more valuable in formal domains -- in contrast to messy empirical domains that CYC was meant to cover. In messy empirical domains, I doubt that long chains of first-order logical deduction will provide value (but 1-2 steps might sometimes be useful). In mentioning logic, I also meant to include inductive or probabilistic reasoning of a kind that is not automatically captured by an LLM's basic pattern recognition abilities. E.g. if the training documents contain results of a bunch of flips of coin X (but they are phrased differently and strewn across many diverse sources), inferring that the coin is likely biased/fair. *deductions/inferences. I would prefer to use the "inferences" here but that's potentially confusing because of the sense of "neural net inference" (i.e. the process of generating output from a neural net).
Did you look at the design for our Experiment 1 in the paper? Do you think your objections to apply to that design?
Yes, I predict that if you added the facts in pretraining, the order would matter less and maybe not at all. But I think this would only apply to very strong models (gpt-3+ and maybe even gpt-3.5-instruct-turbo+).
There are two pieces of evidence against this. The influence function results, showing the Reversal Curse for models better than GPT-3, and our results in Experiment 2 for GPT3.5 and GPT-4.
Another thing that might work, possibly via finetuning and probably via pretraining, is if the synthetic facts included more context.
If the training set includes texts of the form "A is B. A is also C", then you have both orders present (A is B and B is A) and so the Reversal Curse is not applicable. We trained ada, which is 350M parameters. We trained Llama-1 "aggressively" (e.g. for many epochs and with a hyperparameter sweep). It's all in the paper.
>Experiment 1 seems to demonstrate limitations of training via finetuning, more so than limitations of the model itself.
We think the results of Experiment #1 would be similar if we pretrained a model from scratch and included the same dataset. Do you disagree? (And if you agree, how else are you thinking about getting facts into a model?)The rest of the points are interesting and relate to thoughts we've had. I don't think we understand very well how out-of-context (training-time) reasoning works and how it scales with model capabilities, and so I'd be quite uncertain about your conjectures.
Yes, the model editing literature has various techniques and evaluations for trying to put a fact into a model. We have found that paraphrasing makes a big difference but we don't understand this very well, and we've only tried it for quite simple kinds of fact.
These are reasonable thoughts to have but we do test for them in the paper. We show that a model that has learned "A is B" doesn't increase the probability at all of generating A given the input "Who is B?". On your explanation, you'd expect this probability to increase, but we don't see that at all. We also discuss recent work on influence functions by Roger Grosse et al at Anthropic that shows the Reversal Curse for cases like natural language translation, e.g. "A is translated as B". Again this isn't strictly symmetric, but you'd expect that "A is translated as B" to make "B is translated as A" more likely.