Owain_Evans

Wiki Contributions

Comments

Can you describe how the "local cluster" thing would work outside of keeping it within a single organization? I'd also be very interested in some case studies where people tried this.

OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.

With GPT3.5, I think there's also "mode collapse" for style in writing prose (e.g. plays or stories). 

Claude does not have this mode collapse in poetry or prose. (It maybe has a much more subtle version of it). This suggests to me it'd be relatively easy to fix ChatGPT's issues (as Gwern suggests). 

Does anyone know how much poetry and literary prose is in the pre-training sets aside from stuff in Common Crawl?

 

See the Galatica model (https://arxiv.org/abs/2211.09085) from Meta. It's trained on a curated dataset of scientific papers, reference materials and scientific knowledge bases (with only a very small % of random internet text). IIRC the benefits of this seem limited (better to train on a bigger dataset and use other techniques to make the model access the sciencey parts of the training set).

  1. It would be helpful to compare Google to ChatGPT on the actual distribution of queries to Google. I'd guess only a small % of queries to Google are about algorithms (e.g. doubly-linked lists question). IIRC many queries are not really searches but just looking for specific URLs (e.g. people search "facebook" or "wordle"). Also very common is naural-language translation (ChatGPT can't translate whole websites right now). Finally, many searches are for local information that ChatGPT can't offer at all right now ("weather", "food near me"). Programmers and other power users are probably have different usage distributions than the average user.

  2. That said, I think less internet-savvy people could find web-enabled ChatGPT better for many queries because having a back-and-forth conversation with ChatGPT about some issue (e.g. tech support, medical, how to cook something) is easier than doing a search, opening a few tabs with the results and skimming them, modifying the search if the results weren't great (e.g. adding "reddit"), collating information from multiple pages, etc.

(3) seems false.

Related to (2) is that sufficient robustness is possible today but very expensive relative to a taxi service like Uber. E.g. It requires having expensive sensors on the cars, hyper detailed mapping of all roads the car can drive on, and a team of humans who can remotely intervene when cars reach a sufficiently tricky situation.

(I haven't yet read the paper carefully). The main question of interest is: "How well can transformer do RL in-context after being trained to do so?" This paper only considers quite narrow and limited tasks but future work will extend this and iterate on various parts of the setup. How do these results update your belief on the main question of interest? It's possible the result can be explained away (as you suggest) but also that there is some algorithm distillation going on.

This is very valuable. I suggest putting this content on Arxiv (even it's less formal that the typical paper).

It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.

A few points:

  1. Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU).
  2. In some ways, the models' ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for programming languages. (Humans after their earlier years can spend years hearing a foreign language everyday and learn almost nothing! Most people need to make huge efforts to learn.)
  3. RNNs are much worse than transformers at in-context learning. It's not just a difference in generative text quality. See this study by DeepMind: https://twitter.com/FelixHill84/status/1524352818261499911

Very helpful post, thanks!

Are there some meta-level lessons about forecasting a dataset like MATH? IIRC, at the time of these forecasts, the only results were GPT2-finetune and GPT3 few-show (without chain-of-thought and self-consistency). For GPT-2, the accuracy scores were <15% for nearly all subjects and difficulty levels. This may be consistent with GPT-2 either not really understanding questions or being so weak at basic arithmetic that it has no chance for most questions. 

Given that performance was so low and that not many models/setups had been tried, there's reason to have a wider distribution on future results. I would still guess that human expert level scores (>95%) should have had very low probability, but even (say) a score of 80% should have had more than 5% chance. (I realize this is posthoc -- I'm not claiming to have made explicit predictions like this). 

A good source of baserates/priors would be to look at how performance improves on benchmarks after the paper introducing the benchmark. One example that comes to mind is Lambada, where performance went from 7.3% in the initial paper to 49% within a year. It'd be cool for someone to plot data from a bunch of benchmarks. Papers with Code will be very helpful but has some missing data. (We might also expect jumpier performance for math-related tasks because once you can do 2-digit arithmetic or elementary algebra reliably then many problems are opened up). 

Load More