See the Galatica model (https://arxiv.org/abs/2211.09085) from Meta. It's trained on a curated dataset of scientific papers, reference materials and scientific knowledge bases (with only a very small % of random internet text). IIRC the benefits of this seem limited (better to train on a bigger dataset and use other techniques to make the model access the sciencey parts of the training set).
It would be helpful to compare Google to ChatGPT on the actual distribution of queries to Google. I'd guess only a small % of queries to Google are about algorithms (e.g. doubly-linked lists question). IIRC many queries are not really searches but just looking for specific URLs (e.g. people search "facebook" or "wordle"). Also very common is naural-language translation (ChatGPT can't translate whole websites right now). Finally, many searches are for local information that ChatGPT can't offer at all right now ("weather", "food near me"). Programmers and
(3) seems false.
Related to (2) is that sufficient robustness is possible today but very expensive relative to a taxi service like Uber. E.g. It requires having expensive sensors on the cars, hyper detailed mapping of all roads the car can drive on, and a team of humans who can remotely intervene when cars reach a sufficiently tricky situation.
(I haven't yet read the paper carefully). The main question of interest is: "How well can transformer do RL in-context after being trained to do so?" This paper only considers quite narrow and limited tasks but future work will extend this and iterate on various parts of the setup. How do these results update your belief on the main question of interest? It's possible the result can be explained away (as you suggest) but also that there is some algorithm distillation going on.
This is very valuable. I suggest putting this content on Arxiv (even it's less formal that the typical paper).
It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.
A few points:
Very helpful post, thanks!Are there some meta-level lessons about forecasting a dataset like MATH? IIRC, at the time of these forecasts, the only results were GPT2-finetune and GPT3 few-show (without chain-of-thought and self-consistency). For GPT-2, the accuracy scores were <15% for nearly all subjects and difficulty levels. This may be consistent with GPT-2 either not really understanding questions or being so weak at basic arithmetic that it has no chance for most questions. Given that performance was so low and that not many models/setups had b... (read more)
There's a new Metaculus question on this. The median for near human-level on the exact set of forecasting questions we used is currently 2026. Another relevant question is how well AI will vs crowdforecasts when predicting new questions (e.g. 2023-2024 questions). I'd be excited for people to do more thinking about how much AI will improve at forecasting in coming years.
Nice post. I generally recommend looking at the model probabilities or taking multiple samples when evaluating a model. For example, does the model give the answer "Joe" 99% probability or close to 50%?
This is a distribution of math problems GPT-3 wasn't finetuned on. Yet it's able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don't see why scaling and access to external tools (e.g. to perform long calculations) wouldn't produce the kind of robustness you have in mind.
I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling
GPT-3 (without external calculators) can do very well on math word problems (https://arxiv.org/abs/2206.02336) that combine basic facts about the world with abstract math reasoning. Why think that the kind of causal reasoning humans do is out of reach of scaling (especially if you allow external calculators)? It doesn't seem different in kind from these math word problems.
when can/do foundation models interna
I agree my last point is more speculative. The question is whether vast amounts of pre-trained data + a smaller amount of finetuning by online RL substitutes for the human experience. Given the success of pre-training so far, I think it probably will.
Note that the modern understanding of causality in stats/analytic philosophy/Pearl took centuries of intellectual progress -- even if it seems straightforward. Spurious causal inference seems ubiquitous among humans unless they have learned -- by reading/explicit training -- about the modern understanding. Your examples from human childhood (dropping stuff) seem most relevant to basic physics experiments and less to stochastic relationships between 3 or more variables.
In the pre-training set, there are lots of places where humans talk about causality (both informally and more formally in myriad academic papers). So a model would ultimately need to learn abstract stuff about causality (e.g. correlation is not causation, arrow of time, causes are local, etc) and concrete causal facts (the moon causes tides, tiny organisms cause mold, etc). Given this knowledge, it's plausible a model M could make reasonable guesses for questions like, "What happens when a model with [properties of model M] starts interacting with the worl... (read more)
I disagree with your last point. Since we're agents, we can get a much better intuitive understanding of what causality is, how it works and how to apply it in our childhood. As babies, we start doing lots and lots of experiments. Those are not exactly randomized controlled trials, so they will not fully remove confounders, but it gets close when we try to do something different in a relatively similar situation. Doing lots of gymnastics, dropping stuff, testing the parent's limits etc., is what allows us to quickly learn causality.
LLMs, as they are curren... (read more)
Cool post! Did you try seeing whether GPT-3 can regenerate parts of the Iris dataset (or any other datasets that may appear in its training data)? I'd also be interested to see finetuning results, results for the latest InstructGPT, and to see analysis of the GPT-3 Embeddings for integers and floats.
I think BIG-bench could be the final AI benchmark: if a language model surpasses the top human score on it, the model is an AGI.
Could you explain the reasoning behind this claim? Note that PaLM already beats the "human (Avg.)" on 150 tasks and the curve is not bending. (So is PaLM already an AGI?) It also looks like a scaled up Chinchilla would beat PaLM. It's plausible that PaLM and Chinchilla could be improved by further finetuning and prompt engineering. Most tasks in BIG-Bench are multiple-choice, which is favorable to LMs (compared to generation). I'd... (read more)
I'm an author on TruthfulQA. They say GPT-4Chan gets 0.225 on our MC1 task. Random guessing gets 0.226. So their model is worse than random guessing. By contrast, Anthropic's new model gets 0.31 (well above random guessing).
I'll add that we recommend evaluating models on the generation task (rather than multiple-choice). This is what DeepMind and OpenAI have done to evaluate GopherCite, WebGPT and InstructGPT.
The indirect logit is trained with cross-entropy based on the groundtruth correct answer. You can't do this for verbalized probability without using RL, and so we instead do supervised learning using the empirical accuracy for different question types as the labels.
We didn't try but I would guess that finetuning on simple math questions wouldn't help with Metaculus forecasting. The focus of our paper is more "express your own uncertainty using natural language" and less "get better at judgmental forecasting". (Though some of the ideas in the paper might be useful in the forecasting domain.)
This is a brilliant comment for understanding the current deployment of DL. Deserves its own post.
It would be interesting to evaluate RETRO as it works differently from all the models we've evaluated. WebGPT is finetuned to use a search engine and it uses this (at inference time) to answer questions. This seems more powerful than the retrieval system for RETRO (based on a simple nearest neighbor lookup). So my speculation is that WebGPT would do better.
We don't have plans to evaluate it but are open to the possibility (if the RETRO team was interested).
Keynes's involvement with the Versailles Peace Conference is quite well known. Russell was a famous public intellectual after leaving academia but I think is now better known for his earlier work in the foundations of mathematics and philosophy. Turing's codebreaking work is also well known.
I know less about Keynes. Feel free to suggest some.
Taxes in Oxford are more-or-less the same as anywhere else in the UK. These are lower than many European countries but higher than the US (especially states with no income tax). Rent in SF is more than 2x Oxford (seems roughly right to me) but I agree with what you say on housing. Having lived in SF and Oxford, the claim about crime and homelessness doesn't match my experience at all (nor any anecdotes I've heard). I'd be very surprised if stats showed more crime in Oxford vs the central parts of SF.
The homeless in Oxford talked to me or followed me more than in Berkeley. (I haven’t spent much time in SF.)
I don't see a comparison to the base rate (before Covid).
I didn't follow the links, but how did Bentham and Mill think about future utility?
I get "potential security risk" from Firefox for that pdf link.
Just copy-pasting the section
We believe that Transformative Artificial Intelligence (TAI) [Karnofsky et al., 2016] is approaching [Cotra, 2020, Grace et al., 2018], and that these systems will cause catastrophic damage if they are misaligned with human values [Fox and Shulman, 2013, Omohundro, 2008]. As such, we believe it is essential to prioritize and help facilitate technical research that ensures TAI’s values will be aligned with ours. AI Alignment generally refers to the problem of how to ensure increasingly powerful and autonomous AI systems per
We believe that Transformative Artificial Intelligence (TAI) [Karnofsky et al., 2016] is approaching [Cotra, 2020, Grace et al., 2018], and that these systems will cause catastrophic damage if they are misaligned with human values [Fox and Shulman, 2013, Omohundro, 2008]. As such, we believe it is essential to prioritize and help facilitate technical research that ensures TAI’s values will be aligned with ours.
AI Alignment generally refers to the problem of how to ensure increasingly powerful and autonomous AI systems per
Haldane wanted children but may have been infertile due to injuries suffered in WW1. IIRC Needham and his wife may also have had fertility issues.
Do you have a citation or excerpt on this?
I mentioned Ramsey in an another comment. Very brilliant polymath and likely would be included if he'd lived longer.
Interesting. One question is why people were attracted by non-capitalist economic systems. Another is why they were attracted by Marxism or by the Soviet Union.
I'd be very interested in a quick summary of the explanation.
It starts out bland and unoriginal. He lays blame on confirmation bias, citing Jonathan Haidt's social intuitionist model and Bryan Caplan's research on political irrationality,
Haidt shows that a lot of our moral and political reasoning is post-hoc rationalisation. Its primary purpose is not to arrive at a conclusion, but to justify a conclusion after we have reached it. We often arrive at a broad conclusion quickly and intuitively, and then selectively look for arguments to back it up retrospectively.
But why would people reflexively want to defend communi... (read more)
Is this a rhetorical question? What kind of evidence are you looking for? At this point, it's more efficient to learn about Wittgenstein's contributions by reading more recent works. If you wanted some intro material on Wittgenstein's own work, you could try SEP, Grayling, or Soames [detailed historical development of analytic philosophy] but I haven't looked at these myself. Also any discussions by Dennett of Wittgenstein on philosophy of mind, Kripke (or McGinn's discussion) on Wittgenstein on rule-following, discussion of family resemblance f... (read more)
That last point ("more distal cause") is a very interesting idea. Thanks!
Julian and Aldous Huxley were at Oxford and mixed with Haldane and Needham for sure. Haldane was an undergrad at Oxford and his dad a professor.
Russell, Keynes, Wittgenstein and Haldane all visited the Soviet Union. Needham spent time in China during the Sino-Japanese War and again after the communist revolution. So some intellectuals had access to first-hand accounts -- though I agree that the permission to visit and the experience itself was tightly controlled. There were also lots of Russians and Chinese in exile who intellectuals could talk to.
I think Cambridge was much stronger than Oxford in STEM and philosophy until after WW2. Schrödinger was briefly at Oxford but they objected to his desire to live with both his wife and mistress. Outside of STEM, there was Tolkien, CS Lewis, and TE Lawrence.
Yes, I'd also like to understand better the attraction of communism. Some off-the-cuff ideas:
Keynes's view of Leninism seems similar to Russell's, and may have been influenced by it. Here's a quote from The Practice and Theory of Bolshevism (published in 1920):
Bolshevism is not merely a political doctrine; it is also a religion, with elaborate dogmas and inspired scriptures. When Lenin wishes to prove some proposition, he does so, if possible, by quoting texts from Marx and Engels. A full-fledged Communist is not merely a man who believes that land and capital should be held in common, and their produce distributed as nearly equally as possible. H
Yes, I know less about Ramsey's life, but he was an incredible talent and interacted with Wittgenstein and Keynes. Paul Dirac and Ronald Fisher also spent part of their careers at Cambridge in this period but I know less about their lives. (There's also G. H. Hardy).
Interesting Long Covid study from Israel finds approximately 50% reduction in symptoms post-Covid for the fully vaccinated (vs. unvaccinated). The study is prospective. There's a selection effect for who got vaccinated but I doubt this is a big issue (unvaccinated are younger + healthier -- education levels not that different). https://www.medrxiv.org/content/10.1101/2022.01.05.22268800v1
Ask for a room where the neighboring rooms are unoccupied and ask for no cleaning service. Then risks should be low.
This is a fantastic resource and seems like a great project for a research assistant. As with Rohin Shah's alignment newsletter, I'm excited to see this project continue and (potentially) expand.
I agree with most of this -- and my original comment should have been clearer. I'm wondering if the past five years of direct observations leads you to update the geography-based prior (which has been included in your alignment review for since 2018). How much do you expect the quality of alignment work to differ from a new organization based in the Bay vs somewhere else? (No need to answer: I realize this is probably a small consideration and I don't want to start an unproductive thread on this topic).
Evans et al.'s Truthful AI: Developing and governing AI that does not lie is a detailed and length piece discussing a lot of issues around truthfulness for AI agents. This includes conceptual, practical and governance issues, especially with regard conversation bots. They argue for truthfulness (or at least, non-negligently-false)
The link should include "that does not lie". length --> lenghty
Lin et al.'s TruthfulQA: Measuring How Models Mimic Human Falsehoods provides a series of test questions to study how 'honest' various text models are. O
Re: the Bay Area vs. other places. At this point, there's a fair amount of (messy) empirical evidence about how much being in the Bay Area impacts performance relative to being in other places. You could match organizations by area of research and do a comparison between the Bay and London/Oxford/Cambridge. E.g. OpenAI and Anthropic vs. DeepMind, OpenPhil (long-termist research) vs. FHI-GPI-CSER, CHAI vs Oxford and DeepMind. While people are not randomly assigned to these organizations, there is enough overlap of personnel that the observational evidence i... (read more)
Some small subsets of CYC were released (see Wiki). You could finetune a model on those and use that to estimate the value of the full dataset. (You could also talk to Michael Witbrock, who worked on CYC in the past and is familiar with AI alignment.) There are also various large open-source knowledge bases. There are closed knowledge graphs/bases at Google and many other companies. You might be able to collaborate with researchers at those organizations.
I'm interested to know how much the prominent figures in these past Rationalist groups cared about rationality itself rather than its bedfellows (science, atheism, socialism or communism etc.). A related question is whether these groups sometimes functioned as a fig leaf for a certain kind of political association (e.g. scientifically-minded socialists). From reading the J. B. S. Haldane biography linked in the OP, I got the sense that Haldane cared most about science and the status of scientists in society. He seems to care less about rational... (read more)
This is a great list of posts. I had some of these in mind but hadn't remembered all of these. Thanks!
There is also a French non-profit called the Rationalist Union, co-founded by Langevin (of the Langevin Equation and Twin Paradox). Apparently, Borel, Einstein, and Hadamard all had some honorary role in the past. Like the British Rationalist Association, it seems it was associated with socialism and communism during the mid-20th Century. The best source I could find is translated French Wikipedia.
Yeah. The communist associations of past iterations of "rationalist" schools or communities is one the biggest piles of skulls I know about and try to always keep in mind.
Wikipedia uses this URL about Stalin, Wells, Shaw, and the holodomor as a citation to argue that, in fact, many of them were either duped fools or worse into denying the holodomor. Quoting from the source there:
Shaw had met Stalin in the Kremlin on 29 July 1931, and the Soviet leader had facilitated his tour of Russia in which he was able to observe, at least to his own satisfaction, that
Interesting. Can you point to a study by external researchers (not DeepL) that compares DeepL to other systems (such as Google Translate) quantitatively? After a quick search, I could only find one paper, which was just testing some tricky idioms in Spanish and didn't find significant differences between DeepL and Google. (Wikipedia links to this archived page of comparisons conducted by DeepL but there's no information about the methodology used and the difference in performance seem too big to be credible to me.)