All of Owain_Evans's Comments + Replies

(Paper author). The benchmark came out in September 2021. Since then we published some results for new models here in 2022. There are also results for GPT-4 and other models, some of which you can find at Papers with Code's leaderboard ( 

2Bruce W. Lee2mo
Thanks, Owain, for pointing this out. I will make two changes as time allows: 1. make it clearer for all posts when the benchmark paper is released, and 2. for this post, append the additional results and point readers to them.

Thanks. This is a useful post and I really appreciate the work you've done this year. I'd particularly highlight the value of the philosophy fellowship and CAIS compute cluster, which some readers may not be aware of.

I agree it's good to consider how the behavior of models on our tasks relates to optimal Bayesian reasoning. That said, I'm not sure how to define or calculate the "groundtruth" for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that? How to think about the distinction between in-context and out-of-context reasoning?).

In any case, there is some evidence against models being close to Bayesian optimality (however exactly optimality is defined):
1. Results on the same task differ between GPT... (read more)

In theory, given access to the training set of a model, one could count through and see how many mentions there were of members of different professions from different countries of different genders, adjust this for reliability of source, and perhaps even allow for some extrapolation across professions and countries and the ground-truth fact that 51% if humans are female. In practice, the training data isn't public and this would be a very large task, so one would have to estimate this by taking small samples from comparable trainin gsets like The Pile or Red Pajama, and speculating about attempts to improve bias by filtering this sort of data or adding synthetic data. Base models are trained to predict tokens in the training set. Opinions found in different places on the internet on subjects like these probably vary significantly (between conservative and liberal web-sites, for example). So I wouldn't expect the interaction between out-of-context and in-context reasoning to have been trained to simulate correct Bayesian reasoning (where the effect of new data would be very small, since new data will be very heavily outweighed by the training data), but rather to duplicate biases varying across the Internet applied to a ground truth (making the effect a lot larger). Specifically, I'd expect both out-of-context and in-context reasoning to be individually be approximately Bayesian, but the way they combine to heavily over-emphasize in-context data compared to what correct Bayesian rationality would do.

My guess is that a model with 1-10B params could benefit from CoT if trained using these techniques (, Then there's reduced precision and other tricks to further shrink the model. 
That said, I think there's a mismatch between state-of-the-art multi-modal models (huge MoE doing lots of inference time compute using scaffolding/CoT) that make sense for many applications and the constraints of a drone if it needs to run locally and produce fast outputs. 

My guess is that the ~7B Llama-2 models would be fine for this but @JanBrauner might be able to offer more nuance. 

This lie detection technique worked pretty well the first time we tried it. We also look at using a 2nd model to "interrogate" the 1st model (i.e. the model that is suspected of lying). This approach worked less well but we didn't push it that hard.

I address the motivations for our Reversal Curse paper in a reply to your other comment. 

My current (highly speculative) guess is that humans do learn one-directionally. We can't easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can't understand such reversed language either. It's easy to count down (because we practice that) but harder to do the alphabet backwards (because we don't practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means... (read more)

We might also be using working memory to reconstruct reverse relations on the fly. E.g. reciting a poem backwards will consist of remembering chunks of it in forward direction and then rearranging the chunk to be in reverse direction. If that is correct than a variation of CoT-prompting might work. By first having the model recall any context in which it recalls an object and then picking the answer out of that.

Great points and lots I agree with. 

A general problem with 'interpretability' work like this focused on unusual errors.

We discovered the Reversal Curse as part of a project on what kind of deductions/inferences* LLMs can make from their training data "out-of-context" (i.e. without having the premises in the prompt or being able to do CoT). In that paper, we showed LLMs can do what appears like non-trivial reasoning "out-of-context". It looks like they integrate facts from two distinct training documents and the test-time prompt to infer the appropriat... (read more)

Did you look at the design for our Experiment 1 in the paper? Do you think your objections to apply to that design?

At the time of my original comment, I had not looked at it. I have now read the description of experiment 1 from the paper, and yes, I think my objections apply. My best guess at the point you were trying to make by pointing me to this experiment is that you included some bidirectional examples in your test set, and therefore maybe the LLM should be able to figure out that your test set (in particular) is describing a symmetric relation, even if similar words in the LLM's original training data were used to described asymmetric relations.  Is that your implied argument?

Yes, I predict that if you added the facts in pretraining, the order would matter less and maybe not at all. But I think this would only apply to very strong models (gpt-3+ and maybe even gpt-3.5-instruct-turbo+).

There are two pieces of evidence against this. The influence function results, showing the Reversal Curse for models better than GPT-3, and our results in Experiment 2 for GPT3.5 and GPT-4. 

Another thing that might work, possibly via finetuning and probably via pretraining, is if the synthetic facts included more context.

If the training... (read more)

3Max H5mo
Ah, my bad. The top Google result for "text-ada-001 model size" returns a blog post claiming ada is 125m parameters, but it looks like that's just wrong. Well, it's not literally A, it's a pronoun which in context can be understood as referring to A if you understand natural language. Do you think the effect goes away if you finetune on data of the form Daphne Barrington is / the director of "A Journey Through Time". She (cutting off the answer as early as "She")?  Anyway, I still think the reversal curse is more about a deficiency in the training process rather than the model itself; even weak models are clearly capable of doing logical deduction given the right setup (e.g. within a prompt), so the question is more like, how good does the training process have to be (and maybe how big does the model have to be) for the model to be reliably capable of doing logical deduction on: * facts that are present in its prompt (pretty easy) * facts that are present in the finetuning data (pretty hard, apparently) * facts that are in the pretraining data (maybe in-between, and maybe also depends on the specifics of the pretraining process?) e.g. What happens if you train on the word-wise reversal of all your data? Literally add {The word-wise reversal of the previous text is: ' '.join(reversed(training_doc.split(' ')))} to all your pretraining data, and then train the model on the (twice as large, very redundant) dataset. Even if something simple like that doesn't actually make the reversal curse go away, I expect that there is some training process, not too much more sophisticated that current pretraining processes, which does work when applied to current models, or at least to current model architectures (perhaps scaled up a bit). Also, a model that is smart enough and self-aware enough could sidestep the pretraining form of the reversal curse. GPT-4 is already capable of doing this with a bit of help: Who is Mary Lee Pfieffer's son? If you don't know,

>Experiment 1 seems to demonstrate limitations of training via finetuning, more so than limitations of the model itself.

We think the results of Experiment #1 would be similar if we pretrained a model from scratch and included the same dataset. Do you disagree? (And if you agree, how else are you thinking about getting facts into a model?)

The rest of the points are interesting and relate to thoughts we've had. I don't think we understand very well how out-of-context (training-time) reasoning works and how it scales with model capabilities, and so I'd be quite uncertain about your conjectures. 

2Max H5mo
Yes, I predict that if you added the facts in pretraining, the order would matter less and maybe not at all. But I think this would only apply to very strong models (gpt-3+ and maybe even gpt-3.5-instruct-turbo+). Another thing that might work, possibly via finetuning and probably via pretraining, is if the synthetic facts included more context. e.g. Daphne Barrington is the director of "A Journey Through Time". She also wrote and directed "A Journey Through Time 2". She is well-known for her time-based movies. (Why do I expect this to work? Because the model then sees examples where "She" follows a "A Journey Through Time" in contexts where it's knowable that "She" refers to Daphne. ) Less confidently, I predict that if you finetuned an even weaker model (e.g. text-ada-001, or a ~100m parameter open-source model, perhaps also finetuning more aggressively than is possible through the OpenAI finetuning API), you would also get a different result, assuming the model was able to learn the non-reversed fact via finetuning at all.

Yes, the model editing literature has various techniques and evaluations for trying to put a fact into a model. 
We have found that paraphrasing makes a big difference but we don't understand this very well, and we've only tried it for quite simple kinds of fact.

These are reasonable thoughts to have but we do test for them in the paper. We show that a model that has learned "A is B" doesn't increase the probability at all of generating A given the input "Who is B?". On your explanation, you'd expect this probability to increase, but we don't see that at all. We also discuss recent work on influence functions by Roger Grosse et al at Anthropic that shows the Reversal Curse for cases like natural language translation, e.g. "A is translated as B". Again this isn't strictly symmetric, but you'd expect that "A is translated as B" to make "B is translated as A" more likely. 

I am sorry, but I am not sure I follow. My claim was that ChatGPT based on 3.5 has, for lack of any external referent, no way to fully understand language; it has no way to know that words stand for anything, that there is an external reality, that there is a base truth. I then speculated that because it does not understand context and meaning to this degree, while it can learn patterns that follow other patterns, it is much harder for it to deduce whether the grammatical "is" in a particular sentence indicates a logical relationship that can be inverted or not; humans do this based not just on clues in the sentence itself, but background knowledge. Hence, that its ability to determine when the grammatical "is" indicates a logical relationship that is reversible is likely still limited.  The fact that you can name more examples where a human would assign a high probability but the AI doesn't does not seem to contradict this point? I would not have predicted success there. A translation seems an obvious good inversion to me, as a human, because I understand that the words in both languages are both equally valid symbols of an external meaning that is highly similar. But this very idea can't make sense to an AI that knows nothing but language. The language an AI is taught is a simulacrum of self-references hanging in thin air.  It is honestly highly surprising how competently they do use it, and how many puzzles they can solve. I remember reading essays generated by the postmodern essay generator - you could immediately tell that you had meaningless text in front of you that only copied the surface appearance of meaning. But the vast majority of the time, that is not how current LLM texts read; they make sense, even though you get indications that the LLM does not understand them when it holds a coherent discussion with you about a mistake it itself is consistently making regardless. I wonder rather what made these other aspects of language we considered complicate

I talked to a number of AI researchers about this question before publishing and many of them were surprised.

Great comment. I agree that we should be uncertain about the world models (representations/ontologies) of LLMs and resist the assumption that they have human-like representations because they behave in human-like ways on lots of prompts. 

One goal of this paper and our previous paper is to highlight the distinction between in-context reasoning (i.e. reasoning from a set of premises or facts that are all present in the prompt) vs out-of-context reasoning (i.e. reasoning from premises that have been learned in training/finetuning but are not present in t... (read more)

This seems like the kind of research that can have a huge impact on capabilities, and much less and indirect impact on alignment/safety. What is your reason for doing it and publishing it?

Nice idea. I'd imagine something like this has been done in psychology. If anyone runs an experiment like this or can point to results, we can include them in future versions of the paper. 
Relevant meme by Daniel Eth. 

I might have some time tomorrow to test this out on a small scale, will try to remember to update here if I do.

Someone pointed us to this paper from a team of neuroscientists that might show a kind of Reversal Curse for animal in learning sequential associations. I haven't read the paper yet. 


Thanks for sharing! The comparison with non-human primates is generally instructive. ChatGPT commits a number of errors that we have seen in non-human primates learning human languages. E.g. initially implicitly self-describing as a human (ask ChatGPT about ethical problems in AI, and you will soon get a "*We* must use AI responsibly"), because their training data was written by humans describing their point of view, and data about a point of view that is non-human is absent, so they latch onto the point of view that seems the closest option at first.  It is notable that non-human primates did move past that (to e.g. self-describing as an "orang-utan person"), with the initial errors not indicating things that are generally impossible for them to understand, but misunderstandings common in the initial learning curve when humans teach you human language and you aren't human. And that ChatGPT's equivalent of a brain is rapidly evolving. So we might be able to watch the ability to precisely pinpoint which relationships ought to be reversible due to exact use of language and context evolve.

Good point about the idea that LLMs are simulating people.

In terms of reconciling the results: I don't have a full explanation. What we call "sophisticated out-of-context reasoning" (see S2 of this paper and Grosse et al) is poorly understood. 

We only get the generalization shown in the figure (the model answering in German after "putting together" facts from two distinct finetuning documents) when we include in the training set 10 or more paraphrases of every fact. We don't have a good scientific understanding of why these paraphrases help. (There ar... (read more)

How to do your own test of the Reversal Curse (e.g. on ChatGPT or Claude) with different prompting strategies:

  1. Try this list of hard examples: C-list celebrities who have a different last name from their parents. The list below has the form <celeb_name>, <parent_name>.
  2. First verify the model know the celebrity's parent by asking "Who is [name]'s mother/father?"
  3. Then, in a separate dialog, ask the model for the child of the parent. You must not include the child's name anywhere in the dialog!
To second a previous reply to this, I would expect this will hold for humans as well. On top of that, mathematically it is perfectly possible for some function to be easy to learn/compute, but the inverse to be hard. For instance, discrete exponentiation is easy to compute in all groups where multiplication is easy to compute, but the inverse function, the discrete logarithm, is hard enough to base cryptography on it, if one picks a suitable group representation (e.g. point groups of secure elliptic curves, or the group of invertible elements of a large safe prime field).  Similar examples exist with regards to function learnability for neural networks as well. A simple example of a function that is easy to learn for a neural network but which has a much more difficult to learn inverse is f(x1,x2,x3,...,xn) = (x1 xor x2, x2 xor x3, ..., x_{n-1} xor x_{n} (for difficulty of learning this, one would assume learning from random samples, and with common multi-label loss functions; with suitable tricks, this does become learnable if the neural network can represent the inverse target function). A final point that I would consider here is that it is possible that for the reverse questions in this task, a privacy protection mechanism kicks in that makes the LLM deny knowledge of the non-celebrity. It seems perfectly possible to me that GPT-4 is lying when it says it doesn't know about <mother of celebrity>, because it has been instructed to lie about these things in order to protect the privacy of people not considered to be in the public eye.

Prediction: this works when asking humans questions too.

(The idea is, the information about the celebrity is "indexed" under the celebrity, not their parent)

Re: my tweet about the cost of training GPT-4. 
It wasn't my own estimate of GPT-4 training cost on H100s, it was just the SemiAnalysis estimate. Also, there are different ways to define "cost of training GPT-4" that are reasonable and can easily be 5x higher (e.g. see this post and comments). From now on, I'll spell out the definition I'm using. 

I agree you can't just drop this money and expect to train GPT-4 (or more companies would have a GPT-4-level model now). I was thinking more about the costs to the leading labs of training a foundation mo... (read more)

We didn't investigate the specific question of whether it's raw diversity or specific features. In the Grosse et al paper on influence functions, they find that "high influence scores are relatively rare and they cover a large portion of the total influence". This (vaguely) suggests that the top k paraphrases would do most of the work, which is what I would guess. That said, this is really something that should be investigated with more experiments.

We think there's a connection between the Reversal Curse and some results in the model editing literature. I'm not sure if this applies to the specific ROME results in that post. We'll have the Reversal Curse paper out soon, which will explain more.

Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point. 

More generally, it's uncertain what the impact is of excluding a certain topic from pretraining. In practice, you'll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you'd remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.

So performance here should be thought of more as ‘how good is the model at learning about a persona in fine-tuning and then being able to imitate/simulate that persona in deployment’. This is different from a model believing it is the persona or applying this knowledge to some concept of self. Good performance at this task does not require having a sense of self, this is just a precursor that may be necessary for situational awareness.

That's correct. We tried to emphasize that our experiments are testing out-of-context reasoning, rather than situational aw... (read more)

The hhh task is the one that small models do well on. I am surprised that the small models do well on any of the tasks. I think the reason they do well on the hhh one is that this task doesn’t seem to require much more than word association and parroting. I would predict that for ada and babbage, if you followed up with “why did you say that?” the models would be unable to reproduce the explicit link that ties the persona to answering in the particular way, whereas I expect davinci to be able to explain this link more. The small models are probably just do

... (read more)

Thanks for the thoughtful comments. 

Out-of-context learning seems pretty sensitive to the task being measured, where some of the tasks see nice scaling behavior (hhh) while others do not (incorrect). This observation is based on Appendix A.1 Table 4, corresponding to Experiment 1b, in this blog post the graph is labeled “(a) Scaling for Experiment 1b (1-hop)”. Now, the fact that you get nice scaling lines when averaging across tasks is not super problematic or anything, but it is a little odd that there is so much variation between tasks, and I think

... (read more)

ChatGPT-4 seems to have improved at diverse literary styles. It sometimes ignores the "non-rhyming" instructions, but I was able to get it to avoid rhyme on my second try by first asking it, "Can you write poems that don't rhyme?".

Interesting results! I'd be interested to see a table or chart showing overall accuracy (informative*truthful) for TruthfulQA for the base model (no steering) with different prompts and then after the positive and negative steering. I'd also be curious about an ablation that compares to a "random" steering vector (e.g. love/hate, big/small, fast/slow, easy/hard). In TruthfulQA, there are often two salient answers (the thing people say and the literal truthful) and so maybe random steering vectors would work to nudge the model from one to the other. (This i... (read more)

Can you describe how the "local cluster" thing would work outside of keeping it within a single organization? I'd also be very interested in some case studies where people tried this.

3James Payor10mo
I mostly do just mean "keeping it within a single research group" in the absence of better ideas. And I don't have a better answer, especially not for independent folk or small orgs. I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some "I won't use this for capabilities work without the permission of the authors" legal docs as well. This isn't something I can visualize working, but maybe it has components of an answer.

OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.

With GPT3.5, I think there's also "mode collapse" for style in writing prose (e.g. plays or stories). 

Claude does not have this mode collapse in poetry or prose. (It maybe has a much more subtle version of it). This suggests to me it'd be relatively easy to fix ChatGPT's issues (as Gwern suggests). 

Does anyone know how much poetry and literary prose is in the pre-training sets aside from stuff in Common Crawl?


I didn't get that impression from that when I read it - the NYer author and his friends prompted most of that, even if their friend Dan Selsam happens to work at OpenAI. (He seems to work on math LMs, nothing fiction or RL-related.) They were set up with the public Playground interface, so the OA insider role here was limited to showing them a few completions and trying to explain it; presumably they did the rest more remote and partially on their own. Specifically, some parts of it, like the choice of Shel Silverstein (a far from obvious poet to pick, even if his fiction is beloved by American children), suggest they (like pretty much anyone interested in GPT-3 poetry) read my page for ideas. Also, again, Leike, who's in charge at OA, denies having done anything poetry-specific or knowing about the apparent capability-gain. Yeah, that's a funny thing about mode collapse, it's really hard to see, and the higher-quality the outputs get, the harder it'll be to see with 'the naked eye'. Who knows every literary genre there is and can patiently prompt them one by one to see which genres a model quietly slides away from & tries to avoid generating text in? Like hands in GANs... It takes a while to begin to see what you aren't seeing. This is why you need metrics like FID, which work over an entire dataset and measure whether sampled outputs span the entire dataset, rather than focus on a large subset. However, no one is doing an FID for LLMs for creative purposes. (That would be hard, but not impossible.) So, we don't really have any way to quantify mode-collapse like in poetry. Of course, I'd also expect Claude to be much subtler simply because it's working off less data and so it's less likely to have gotten rated text or inputs which would push it towards mode-collapsing on easily-recognized rhyming poetry and to avoid harder-to-understand poetry. (Claude is just the 'constitutional prompt' model, right? Hard to see how a list of generic principles would push it tow

See the Galatica model ( from Meta. It's trained on a curated dataset of scientific papers, reference materials and scientific knowledge bases (with only a very small % of random internet text). IIRC the benefits of this seem limited (better to train on a bigger dataset and use other techniques to make the model access the sciencey parts of the training set).

  1. It would be helpful to compare Google to ChatGPT on the actual distribution of queries to Google. I'd guess only a small % of queries to Google are about algorithms (e.g. doubly-linked lists question). IIRC many queries are not really searches but just looking for specific URLs (e.g. people search "facebook" or "wordle"). Also very common is naural-language translation (ChatGPT can't translate whole websites right now). Finally, many searches are for local information that ChatGPT can't offer at all right now ("weather", "food near me"). Programmers and

... (read more)
"How do I make risotto?" is only a little bit about algorithms. And yeah, being able to have a real discussion with the bot helps a lot.

(3) seems false.

Related to (2) is that sufficient robustness is possible today but very expensive relative to a taxi service like Uber. E.g. It requires having expensive sensors on the cars, hyper detailed mapping of all roads the car can drive on, and a team of humans who can remotely intervene when cars reach a sufficiently tricky situation.

(I haven't yet read the paper carefully). The main question of interest is: "How well can transformer do RL in-context after being trained to do so?" This paper only considers quite narrow and limited tasks but future work will extend this and iterate on various parts of the setup. How do these results update your belief on the main question of interest? It's possible the result can be explained away (as you suggest) but also that there is some algorithm distillation going on.

1Sam Marks1y
In-context RL strikes me as a bit of a weird thing to do because of context window constraints. In more detail, in-context RL can only learn from experiences inside the context window (in this case, the last few episodes). This is enough to do well on extremely simple tasks, e.g. the tasks which appear in this paper, where even seeing one successful previous episode is enough to infer perfect play. But it's totally insufficient for more complicated tasks, e.g. tasks in large, stochastic environments. (Stochasticity especially seems like a problem, since you can't empirically estimate the transition rules for the environment if past observations keep slipping out of your memory.)  There might be more clever approaches to in-context RL that can help get around the limitations on context window size. But I think I'm generally skeptical, and expect that capabilities due to things that look like in-context RL will be a rounding error compared to capabilities due to things that look like usual learning via SGD. Regarding your question about how I've updated my beliefs: well, in-context RL wasn't really a thing on my radar before reading this paper. But I think that if someone had brought in-context RL to my attention then I would have thought that context window constraints make it intractable (as I argued above). If someone had described the experiments in this paper to me, I think I would have strongly expected them to turn out the way they turned out. But I think I also would have objected that the experiments don't shed light on the general viability of in-context RL, because the tasks seem specially selected to be solvable with small context windows. So in summary, I don't think this paper has moved me very far from what I expect my beliefs would have been if I'd had some before reading the paper.

This is very valuable. I suggest putting this content on Arxiv (even it's less formal that the typical paper).

It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.

1Mateusz Bagiński6mo
I think I remember seeing somewhere that LLMs learn more slowly on languages with "more complex" grammar (in the sense of their loss decreasing more slowly per the same number of tokens) but I can't find the source right now.

A few points:

  1. Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU).
  2. In some ways, the models' ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for program
... (read more)
1Jose Miguel Cruz y Celis9mo
I'm curious about where you get that "models trained mostly on English text are still pretty good at Spanish" do you have a reference?

Very helpful post, thanks!

Are there some meta-level lessons about forecasting a dataset like MATH? IIRC, at the time of these forecasts, the only results were GPT2-finetune and GPT3 few-show (without chain-of-thought and self-consistency). For GPT-2, the accuracy scores were <15% for nearly all subjects and difficulty levels. This may be consistent with GPT-2 either not really understanding questions or being so weak at basic arithmetic that it has no chance for most questions. 

Given that performance was so low and that not many models/setups had b... (read more)

2Michaël Trazzi2y
I believe the forecasts were aggregated around June 2021. When was GPT2-finetune released? What about GPT3 few show? Re jumps in performance: jack clark has a screenshot on twitter about saturated benchmarks from the dynabench paper (2021), it would be interesting to make something up-to-date with MATH

There's a new Metaculus question on this. The median for near human-level on the exact set of forecasting questions we used is currently 2026. Another relevant question is how well AI will vs crowdforecasts when predicting new questions (e.g. 2023-2024 questions). I'd be excited for people to do more thinking about how much AI will improve at forecasting in coming years. 

Nice post. I generally recommend looking at the model probabilities or taking multiple samples when evaluating a model. For example, does the model give the answer "Joe" 99% probability or close to 50%?

I agree. But I wanted to avoid burning through the credits too quickly. I also wonder whether Joe and Jack would be more realistically assessed with a few-shot prompt. 

This is a distribution of math problems GPT-3 wasn't finetuned on. Yet it's able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don't see why scaling and access to external tools (e.g. to perform long calculations) wouldn't produce the kind of robustness you have in mind.

2David Scott Krueger (formerly: capybaralet)2y
I think you're moving the goal-posts, since before you mentioned "without external calculators".  I think external tools are likely to be critical to doing this, and I'm much more optimistic about that path to doing this kind of robust generalization.  I don't think that necessarily addresses concerns about how the system reasons internally, though, which still seems likely to be critical for alignment.

I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling

GPT-3 (without external calculators) can do very well on math word problems ( that combine basic facts about the world with abstract math reasoning. Why think that the kind of causal reasoning humans do is out of reach of scaling (especially if you allow external calculators)? It doesn't seem different in kind from these math word problems. 

when can/do foundation models interna

... (read more)
1David Scott Krueger (formerly: capybaralet)2y
RE GPT-3, etc. doing well on math problems: the key word in my response was "robustly".  I think there is a big qualitative difference between "doing a good job on a certain distribution of math problems" and "doing math (robustly)".  This could be obscured by the fact that people also make mathematical errors sometimes, but I think the type of errors is importantly different from those made by DNNs.  

I agree my last point is more speculative. The question is whether vast amounts of pre-trained data + a smaller amount of finetuning by online RL substitutes for the human experience. Given the success of pre-training so far, I think it probably will.

Note that the modern understanding of causality in stats/analytic philosophy/Pearl took centuries of intellectual progress -- even if it seems straightforward. Spurious causal inference seems ubiquitous among humans unless they have learned -- by reading/explicit training -- about the modern understanding. Your examples from human childhood (dropping stuff) seem most relevant to basic physics experiments and less to stochastic relationships between 3 or more variables.

In the pre-training set, there are lots of places where humans talk about causality (both informally and more formally in myriad academic papers). So a model would ultimately need to learn abstract stuff about causality (e.g. correlation is not causation, arrow of time, causes are local, etc) and concrete causal facts (the moon causes tides, tiny organisms cause mold, etc). Given this knowledge, it's plausible a model M could make reasonable guesses for questions like, "What happens when a model with [properties of model M] starts interacting with the worl... (read more)

3David Scott Krueger (formerly: capybaralet)2y
I can interpret your argument as being only about the behavior of the system, in which case: - I agree that models are likely to learn to imitate human dialogue about causality, and this will require some amount of some form of causal reasoning. - I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling, but it certainly seems highly plausible. I can also interpret your argument as being about the internal reasoning of the system, in which case: - I put this in the "deep learning is magic" bucket of arguments; it's much better articulated than what we said though, I think... - I am quite skeptical of these arguments, but still find them plausible.  I think it would be fascinating to see some proof of concept for this sort of thing (basically addressing the question 'when can/do foundation models internalize explicitly stated knowledge')

I disagree with your last point. Since we're agents, we can get a much better intuitive understanding of what causality is, how it works and how to apply it in our childhood. As babies, we start doing lots and lots of experiments. Those are not exactly randomized controlled trials, so they will not fully remove confounders, but it gets close when we try to do something different in a relatively similar situation. Doing lots of gymnastics, dropping stuff, testing the parent's limits etc., is what allows us to quickly learn causality.

LLMs, as they are curren... (read more)

Cool post! Did you try seeing whether GPT-3 can regenerate parts of the Iris dataset (or any other datasets that may appear in its training data)? I'd also be interested to see finetuning results, results for the latest InstructGPT, and to see analysis of the GPT-3 Embeddings for integers and floats.

Since I transformed the Iris dataset with a pretty "random" transformation (i.e. not chosen because it was particularly nice in some way), I didn't check for its regeneration -- since my feature vectors were very different to original Iris's, and it seemed exceedingly unlikely that feature vectors were saved anywhere on the internet with that particular transformation. But I got curious now, so I performed some experiments. The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper Feature vectors of the Iris flower data set: Input = 83, 40, 58, 20, output = 1 Input = 96, 45, 84, 35, output = 2 Input = 83, 55, 24, 9, output = 0 Input = 73, 54, 28, 9, output = 0 Input = 94, 45, 77, 27, output = 2 Input = 75, 49, 27, 9, output = 0 Input = 75, 48, 26, 9, output = 0 So these are the first 7 transformed feature vectors (in one of the random samplings of the dataset). Among all the generated output (I looked at >200 vectors), it never once output a vector which was identical to any of the latter ones, and also... in general the stuff it was generating did not look like it was drawing on any knowledge of the remaining vectors in the dataset. (E.g. it generated a lot that were off-distribution.) I also tried Input = 83, 55, 24, 9, output = 0 Input = 73, 54, 28, 9, output = 0 [... all vectors of this class] Input = 76, 52, 26, 9, output = 0 Input = 86, 68, 27, 12, output = 0 Input = 75, 41, 69, 30, output = 2 Input = 86, 41, 76, 34, output = 2 Input = 84, 45, 75, 34, output = 2 Where I cherrypicked the "class 2" so that the first coordinate is lower than usual for that class; and the generated stuff always had the first coordinate very off-distribution from the rest of the class 2, as one would expect if the model was meta-learning from the vectors it sees, rather than "remembering" something. This last experiment might seem a little contrived, but bit of a probl
1[comment deleted]2y

I think BIG-bench could be the final AI benchmark: if a language model surpasses the top human score on it, the model is an AGI.

Could you explain the reasoning behind this claim? Note that PaLM already beats the "human (Avg.)" on 150 tasks and the curve is not bending. (So is PaLM already an AGI?) It also looks like a scaled up Chinchilla would beat PaLM. It's plausible that PaLM and Chinchilla could be improved by further finetuning and prompt engineering. Most tasks in BIG-Bench are multiple-choice, which is favorable to LMs (compared to generation). I'd... (read more)

I agree, some future scaled-up versions of PaLM & Co may indeed be able to surpass top humans on BIG-Bench.  Ultimately, it's the question of how we define "AGI". One reasonable definition is "an AI that can do any cognitive task that humans can, and do it better than humans".  Given its massive scope and diversity, BIG-bench seems to be a good enough proxy for "any cognitive task".  Although I would use a stricter scoring than the average-across-tasks that was used in PaLM: the model must 1) beat top humans, 2) on each and every task of BIG-bench.    One could argue that the simple models like PaLM don't have agency, goals, persistence of thought, self-awareness etc, and thus they can't become the human-like AGI of science fiction. But it's quite possible that such qualities are not necessary to do all cognitive tasks that humans can, but better.  A simple mechanistic algorithm can beat top humans in chess. Maybe another simple mechanistic algorithm can also beat top humans in science, poetry, AI engineering, strategic business management, childrearing, and in all other activities that make human intellectuals proud of themselves. 

I'm an author on TruthfulQA. They say GPT-4Chan gets 0.225 on our MC1 task. Random guessing gets 0.226. So their model is worse than random guessing. By contrast, Anthropic's new model gets 0.31 (well above random guessing).

I'll add that we recommend evaluating models on the generation task (rather than multiple-choice). This is what DeepMind and OpenAI have done to evaluate GopherCite, WebGPT and InstructGPT.

lol that is impressively bad then!

The indirect logit is trained with cross-entropy based on the groundtruth correct answer. You can't do this for verbalized probability without using RL, and so we instead do supervised learning using the empirical accuracy for different question types as the labels.

We didn't try but I would guess that finetuning on simple math questions wouldn't help with Metaculus forecasting. The focus of our paper is more "express your own uncertainty using natural language" and less "get better at judgmental forecasting". (Though some of the ideas in the paper might be useful in the forecasting domain.)

Load More