All of Owain_Evans's Comments + Replies

See the Galatica model ( from Meta. It's trained on a curated dataset of scientific papers, reference materials and scientific knowledge bases (with only a very small % of random internet text). IIRC the benefits of this seem limited (better to train on a bigger dataset and use other techniques to make the model access the sciencey parts of the training set).

  1. It would be helpful to compare Google to ChatGPT on the actual distribution of queries to Google. I'd guess only a small % of queries to Google are about algorithms (e.g. doubly-linked lists question). IIRC many queries are not really searches but just looking for specific URLs (e.g. people search "facebook" or "wordle"). Also very common is naural-language translation (ChatGPT can't translate whole websites right now). Finally, many searches are for local information that ChatGPT can't offer at all right now ("weather", "food near me"). Programmers and

... (read more)
"How do I make risotto?" is only a little bit about algorithms. And yeah, being able to have a real discussion with the bot helps a lot.

(3) seems false.

Related to (2) is that sufficient robustness is possible today but very expensive relative to a taxi service like Uber. E.g. It requires having expensive sensors on the cars, hyper detailed mapping of all roads the car can drive on, and a team of humans who can remotely intervene when cars reach a sufficiently tricky situation.

(I haven't yet read the paper carefully). The main question of interest is: "How well can transformer do RL in-context after being trained to do so?" This paper only considers quite narrow and limited tasks but future work will extend this and iterate on various parts of the setup. How do these results update your belief on the main question of interest? It's possible the result can be explained away (as you suggest) but also that there is some algorithm distillation going on.

1Sam Marks3mo
In-context RL strikes me as a bit of a weird thing to do because of context window constraints. In more detail, in-context RL can only learn from experiences inside the context window (in this case, the last few episodes). This is enough to do well on extremely simple tasks, e.g. the tasks which appear in this paper, where even seeing one successful previous episode is enough to infer perfect play. But it's totally insufficient for more complicated tasks, e.g. tasks in large, stochastic environments. (Stochasticity especially seems like a problem, since you can't empirically estimate the transition rules for the environment if past observations keep slipping out of your memory.) There might be more clever approaches to in-context RL that can help get around the limitations on context window size. But I think I'm generally skeptical, and expect that capabilities due to things that look like in-context RL will be a rounding error compared to capabilities due to things that look like usual learning via SGD. Regarding your question about how I've updated my beliefs: well, in-context RL wasn't really a thing on my radar before reading this paper. But I think that if someone had brought in-context RL to my attention then I would have thought that context window constraints make it intractable (as I argued above). If someone had described the experiments in this paper to me, I think I would have strongly expected them to turn out the way they turned out. But I think I also would have objected that the experiments don't shed light on the general viability of in-context RL, because the tasks seem specially selected to be solvable with small context windows. So in summary, I don't think this paper has moved me very far from what I expect my beliefs would have been if I'd had some before reading the paper.

This is very valuable. I suggest putting this content on Arxiv (even it's less formal that the typical paper).

It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.

A few points:

  1. Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU).
  2. In some ways, the models' ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for program
... (read more)

Very helpful post, thanks!

Are there some meta-level lessons about forecasting a dataset like MATH? IIRC, at the time of these forecasts, the only results were GPT2-finetune and GPT3 few-show (without chain-of-thought and self-consistency). For GPT-2, the accuracy scores were <15% for nearly all subjects and difficulty levels. This may be consistent with GPT-2 either not really understanding questions or being so weak at basic arithmetic that it has no chance for most questions. 

Given that performance was so low and that not many models/setups had b... (read more)

2Michaël Trazzi7mo
I believe the forecasts were aggregated around June 2021. When was GPT2-finetune released? What about GPT3 few show? Re jumps in performance: jack clark has a screenshot on twitter about saturated benchmarks from the dynabench paper (2021), it would be interesting to make something up-to-date with MATH []

There's a new Metaculus question on this. The median for near human-level on the exact set of forecasting questions we used is currently 2026. Another relevant question is how well AI will vs crowdforecasts when predicting new questions (e.g. 2023-2024 questions). I'd be excited for people to do more thinking about how much AI will improve at forecasting in coming years. 

Nice post. I generally recommend looking at the model probabilities or taking multiple samples when evaluating a model. For example, does the model give the answer "Joe" 99% probability or close to 50%?

I agree. But I wanted to avoid burning through the credits too quickly. I also wonder whether Joe and Jack would be more realistically assessed with a few-shot prompt.

This is a distribution of math problems GPT-3 wasn't finetuned on. Yet it's able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don't see why scaling and access to external tools (e.g. to perform long calculations) wouldn't produce the kind of robustness you have in mind.

2David Scott Krueger (formerly: capybaralet)7mo
I think you're moving the goal-posts, since before you mentioned "without external calculators". I think external tools are likely to be critical to doing this, and I'm much more optimistic about that path to doing this kind of robust generalization. I don't think that necessarily addresses concerns about how the system reasons internally, though, which still seems likely to be critical for alignment.

I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling

GPT-3 (without external calculators) can do very well on math word problems ( that combine basic facts about the world with abstract math reasoning. Why think that the kind of causal reasoning humans do is out of reach of scaling (especially if you allow external calculators)? It doesn't seem different in kind from these math word problems. 

when can/do foundation models interna

... (read more)
1David Scott Krueger (formerly: capybaralet)7mo
RE GPT-3, etc. doing well on math problems: the key word in my response was "robustly". I think there is a big qualitative difference between "doing a good job on a certain distribution of math problems" and "doing math (robustly)". This could be obscured by the fact that people also make mathematical errors sometimes, but I think the type of errors is importantly different from those made by DNNs.

I agree my last point is more speculative. The question is whether vast amounts of pre-trained data + a smaller amount of finetuning by online RL substitutes for the human experience. Given the success of pre-training so far, I think it probably will.

Note that the modern understanding of causality in stats/analytic philosophy/Pearl took centuries of intellectual progress -- even if it seems straightforward. Spurious causal inference seems ubiquitous among humans unless they have learned -- by reading/explicit training -- about the modern understanding. Your examples from human childhood (dropping stuff) seem most relevant to basic physics experiments and less to stochastic relationships between 3 or more variables.

In the pre-training set, there are lots of places where humans talk about causality (both informally and more formally in myriad academic papers). So a model would ultimately need to learn abstract stuff about causality (e.g. correlation is not causation, arrow of time, causes are local, etc) and concrete causal facts (the moon causes tides, tiny organisms cause mold, etc). Given this knowledge, it's plausible a model M could make reasonable guesses for questions like, "What happens when a model with [properties of model M] starts interacting with the worl... (read more)

3David Scott Krueger (formerly: capybaralet)7mo
I can interpret your argument as being only about the behavior of the system, in which case: - I agree that models are likely to learn to imitate human dialogue about causality, and this will require some amount of some form of causal reasoning. - I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling, but it certainly seems highly plausible. I can also interpret your argument as being about the internal reasoning of the system, in which case: - I put this in the "deep learning is magic" bucket of arguments; it's much better articulated than what we said though, I think... - I am quite skeptical of these arguments, but still find them plausible. I think it would be fascinating to see some proof of concept for this sort of thing (basically addressing the question 'when can/do foundation models internalize explicitly stated knowledge')

I disagree with your last point. Since we're agents, we can get a much better intuitive understanding of what causality is, how it works and how to apply it in our childhood. As babies, we start doing lots and lots of experiments. Those are not exactly randomized controlled trials, so they will not fully remove confounders, but it gets close when we try to do something different in a relatively similar situation. Doing lots of gymnastics, dropping stuff, testing the parent's limits etc., is what allows us to quickly learn causality.

LLMs, as they are curren... (read more)

Cool post! Did you try seeing whether GPT-3 can regenerate parts of the Iris dataset (or any other datasets that may appear in its training data)? I'd also be interested to see finetuning results, results for the latest InstructGPT, and to see analysis of the GPT-3 Embeddings for integers and floats.

Since I transformed the Iris dataset with a pretty "random" transformation (i.e. not chosen because it was particularly nice in some way), I didn't check for its regeneration -- since my feature vectors were very different to original Iris's, and it seemed exceedingly unlikely that feature vectors were saved anywhere on the internet with that particular transformation. But I got curious now, so I performed some experiments. The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper Feature vectors of the Iris flower data set: Input = 83, 40, 58, 20, output = 1 Input = 96, 45, 84, 35, output = 2 Input = 83, 55, 24, 9, output = 0 Input = 73, 54, 28, 9, output = 0 Input = 94, 45, 77, 27, output = 2 Input = 75, 49, 27, 9, output = 0 Input = 75, 48, 26, 9, output = 0 So these are the first 7 transformed feature vectors (in one of the random samplings of the dataset). Among all the generated output (I looked at >200 vectors), it never once output a vector which was identical to any of the latter ones, and also... in general the stuff it was generating did not look like it was drawing on any knowledge of the remaining vectors in the dataset. (E.g. it generated a lot that were off-distribution.) I also tried Input = 83, 55, 24, 9, output = 0 Input = 73, 54, 28, 9, output = 0 [... all vectors of this class] Input = 76, 52, 26, 9, output = 0 Input = 86, 68, 27, 12, output = 0 Input = 75, 41, 69, 30, output = 2 Input = 86, 41, 76, 34, output = 2 Input = 84, 45, 75, 34, output = 2 Where I cherrypicked the "class 2" so that the first coordinate is lower than usual for that class; and the generated stuff always had the first coordinate very off-distribution from the rest of the class 2, as one would expect if the model was meta-learning from the vectors it sees, rather than "remembering" something. This last experiment might seem a little contrived, but bit of a probl
1[comment deleted]8mo

I think BIG-bench could be the final AI benchmark: if a language model surpasses the top human score on it, the model is an AGI.

Could you explain the reasoning behind this claim? Note that PaLM already beats the "human (Avg.)" on 150 tasks and the curve is not bending. (So is PaLM already an AGI?) It also looks like a scaled up Chinchilla would beat PaLM. It's plausible that PaLM and Chinchilla could be improved by further finetuning and prompt engineering. Most tasks in BIG-Bench are multiple-choice, which is favorable to LMs (compared to generation). I'd... (read more)

I agree, some future scaled-up versions of PaLM & Co may indeed be able to surpass top humans on BIG-Bench. Ultimately, it's the question of how we define "AGI". One reasonable definition is "an AI that can do any cognitive task that humans can, and do it better than humans". Given its massive scope and diversity, BIG-bench seems to be a good enough proxy for "any cognitive task". Although I would use a stricter scoring than the average-across-tasks that was used in PaLM: the model must 1) beat top humans, 2) on each and every task of BIG-bench. One could argue that the simple models like PaLM don't have agency, goals, persistence of thought, self-awareness etc, and thus they can't become the human-like AGI of science fiction. But it's quite possible that such qualities are not necessary to do all cognitive tasks that humans can, but better. A simple mechanistic algorithm can beat top humans in chess. Maybe another simple mechanistic algorithm can also beat top humans in science, poetry, AI engineering, strategic business management, childrearing, and in all other activities that make human intellectuals proud of themselves.

I'm an author on TruthfulQA. They say GPT-4Chan gets 0.225 on our MC1 task. Random guessing gets 0.226. So their model is worse than random guessing. By contrast, Anthropic's new model gets 0.31 (well above random guessing).

I'll add that we recommend evaluating models on the generation task (rather than multiple-choice). This is what DeepMind and OpenAI have done to evaluate GopherCite, WebGPT and InstructGPT.

lol that is impressively bad then!

The indirect logit is trained with cross-entropy based on the groundtruth correct answer. You can't do this for verbalized probability without using RL, and so we instead do supervised learning using the empirical accuracy for different question types as the labels.

We didn't try but I would guess that finetuning on simple math questions wouldn't help with Metaculus forecasting. The focus of our paper is more "express your own uncertainty using natural language" and less "get better at judgmental forecasting". (Though some of the ideas in the paper might be useful in the forecasting domain.)

This is a brilliant comment for understanding the current deployment of DL. Deserves its own post.

It would be interesting to evaluate RETRO as it works differently from all the models we've evaluated. WebGPT is finetuned to use a search engine and it uses this (at inference time) to answer questions. This seems more powerful than the retrieval system for RETRO (based on a simple nearest neighbor lookup). So my speculation is that WebGPT would do better.

We don't have plans to evaluate it but are open to the possibility (if the RETRO team was interested).

Keynes's involvement with the Versailles Peace Conference is quite well known. Russell was a famous public intellectual after leaving academia but I think is now better known for his earlier work in the foundations of mathematics and philosophy. Turing's codebreaking work is also well known.

I know less about Keynes. Feel free to suggest some.

Taxes in Oxford are more-or-less the same as anywhere else in the UK. These are lower than many European countries but higher than the US (especially states with no income tax). 

Rent in SF is more than 2x Oxford (seems roughly right to me) but I agree with what you say on housing. 

Having lived in SF and Oxford, the claim about crime and homelessness doesn't match my experience at all (nor any anecdotes I've heard). I'd be very surprised if stats showed more crime in Oxford vs the central parts of SF. 

The homeless in Oxford talked to me or followed me more than in Berkeley. (I haven’t spent much time in SF.)

I don't see a comparison to the base rate (before Covid).

And there are some results
The trial is under way

I didn't follow the links, but how did Bentham and Mill think about future utility?

Bentham was nonzero discount apparently (fn6) [] . (He used 5% but only as an example.) Mill thought about personal time preference (and was extremely annoyed by people's discount there). Can't see anything about social rate of discounting.

I get "potential security risk" from Firefox for that pdf link. 

Just copy-pasting the section

We believe that Transformative Artificial Intelligence (TAI) [Karnofsky et al., 2016] is approaching [Cotra, 2020, Grace et al., 2018], and that these systems will cause catastrophic damage if they are misaligned with human values [Fox and Shulman, 2013, Omohundro, 2008]. As such, we believe it is essential to prioritize and help facilitate technical research that ensures TAI’s values will be aligned with ours. 

AI Alignment generally refers to the problem of how to ensure increasingly powerful and autonomous AI systems per

... (read more)

Haldane wanted children but may have been infertile due to injuries suffered in WW1. IIRC Needham and his wife may also have had fertility issues. 

Do you have a citation or excerpt on this?

1MichaelBowlby1y [] Paragraphs 2 and 3 of the 'Life and works' section
See here [] . I think Ramsey is also the first (quantitative) longtermist [] ever (zero discount rate).

I mentioned Ramsey in an another comment. Very brilliant polymath and likely would be included if he'd lived longer. 

Interesting. One question is why people were attracted by non-capitalist economic systems. Another is why they were attracted by Marxism or by the Soviet Union. 

A third question is why they were disenchanted with capitalism. But it's not difficult to answer. ... there was the General Strike and the Great Depression. Someone was bulverising their rejection of capitalism as self-serving, but the cracks in capitalism were obvious.
2Matthew Barnett1y
The conceptual reason why anti-capitalists are attracted to Marxism or the Soviet Union is probably fairly simple. Marx provided a cogent critique of the capitalist economic system, disseminating one of the most widely-read political documents [] of the 19th and early 20th century. The broader socialist tradition was happy to adopt his philosophy, as they believed it provided a solid scientific foundation for socialism more generally, and a reason for optimism. The Soviet Union was the first major socialist experiment. There had previously been attempts, such as the Paris Commune, but these were short-lived and had minor effects. Therefore, it's no surprise why intellectuals favorable to socialism would have plenty to be excited about regarding the Soviet Union. That's when they finally got to see their philosophy in action.

I'd be very interested in a quick summary of the explanation. 

It starts out bland and unoriginal. He lays blame on confirmation bias, citing Jonathan Haidt's social intuitionist model and Bryan Caplan's research on political irrationality,

Haidt shows that a lot of our moral and political reasoning is post-hoc rationalisation. Its primary purpose is not to arrive at a conclusion, but to justify a conclusion after we have reached it. We often arrive at a broad conclusion quickly and intuitively, and then selectively look for arguments to back it up retrospectively.

But why would people reflexively want to defend communi... (read more)

Is this a rhetorical question? What kind of evidence are you looking for? 

At this point, it's more efficient to learn about Wittgenstein's contributions by reading more recent works. If you wanted some intro material on Wittgenstein's own work, you could try SEP, Grayling, or Soames [detailed historical development of analytic philosophy] but I haven't looked at these myself.  Also any discussions by Dennett of Wittgenstein on philosophy of mind, Kripke (or McGinn's discussion) on Wittgenstein on rule-following, discussion of family resemblance f... (read more)

Wittgenstein had so many ideas and is such a difficult thinker that I think one ought to read him before secondary sources. Also he's a wonderful writer.

That last point ("more distal cause") is a very interesting idea. Thanks!

Julian and Aldous Huxley were at Oxford and mixed with Haldane and Needham for sure. Haldane was an undergrad at Oxford and his dad a professor. 

Russell, Keynes, Wittgenstein and Haldane all visited the Soviet Union. Needham spent time in China during the Sino-Japanese War and again after the communist revolution. So some intellectuals had access to first-hand accounts -- though I agree that the permission to visit and the experience itself was tightly controlled. There were also lots of Russians and Chinese in exile who intellectuals could talk to. 

I think Cambridge was much stronger than Oxford in STEM and philosophy until after WW2. Schrödinger was briefly at Oxford but they objected to his desire to live with both his wife and mistress. Outside of STEM, there was Tolkien, CS Lewis, and TE Lawrence.

Yes, I'd also like to understand better the attraction of communism. Some off-the-cuff ideas:

  • It was harder to get good information about the Russian or Chinese communists during certain periods. (No Internet, fewer reliable journalists, less travel in each direction). 
  • Non-communist countries were much more violent than post-WW2. There was more homicide and more violence that involved the state (e.g. violence in prisons, colonial violence, civil wars, interstate wars). Maybe the Soviet Union up to 1935 didn't look radically different from non-Communist
... (read more)

Keynes's view of Leninism seems similar to Russell's, and may have been influenced by it. Here's a quote from The Practice and Theory of Bolshevism (published in 1920):

Bolshevism is not merely a political doctrine; it is also a religion, with elaborate dogmas and inspired scriptures. When Lenin wishes to prove some proposition, he does so, if possible, by quoting texts from Marx and Engels. A full-fledged Communist is not merely a man who believes that land and capital should be held in common, and their produce distributed as nearly equally as possible. H

... (read more)
6Matthew Barnett1y
Interestingly, the oldest of the bunch, Bertrand Russell, had been writing criticisms of Marx as early as 1896. In his lecture "Marx and the Theoretical Basis of Social Democracy []" he provides several arguments against Marxist economics, concluding,

Yes, I know less about Ramsey's life, but he was an incredible talent and interacted with Wittgenstein and Keynes. Paul Dirac and Ronald Fisher also spent part of their careers at Cambridge in this period but I know less about their lives. (There's also G. H. Hardy). 

He convinced Keynes that Bayesianism was correct!

Interesting Long Covid study from Israel finds approximately 50% reduction in symptoms post-Covid for the fully vaccinated (vs. unvaccinated). The study is prospective. There's a selection effect for who got vaccinated but I doubt this is a big issue (unvaccinated are younger + healthier -- education levels not that different).

Ask for a room where the neighboring rooms are unoccupied and ask for no cleaning service. Then risks should be low.

  1. There's going to be some significant error in estimating excess deaths, especially for developing countries.
  2. Because of the exponential spread of the virus, if people lock down ~4 days earlier they'll end up with half as many deaths. (In reality lock downs aren't all-or-none, etc). So big absolute differences in number of deaths do not imply big differences in policy or culture between the UK and Sweden. 

This is a fantastic resource and seems like a great project for a research assistant. As with Rohin Shah's alignment newsletter, I'm excited to see this project continue and (potentially) expand. 

Thanks, that's very kind of you!

I agree with most of this -- and my original comment should have been clearer. I'm wondering if the past five years of direct observations leads you to update the geography-based prior (which has been included in your alignment review for since 2018). How much do you expect the quality of alignment work to differ from a new organization based in the Bay vs somewhere else? (No need to answer: I realize this is probably a small consideration and I don't want to start an unproductive thread on this topic). 

Evans et al.'s Truthful AI: Developing and governing AI that does not lie is a detailed and length piece discussing a lot of issues around truthfulness for AI agents. This includes conceptual, practical and governance issues, especially with regard conversation bots. They argue for truthfulness (or at least, non-negligently-false)

The link should include "that does not lie". 
length --> lenghty


Lin et al.'s TruthfulQA: Measuring How Models Mimic Human Falsehoods provides a series of test questions to study how 'honest' various text models are. O

... (read more)
Thanks, fixed in both copies.

Re: the Bay Area vs. other places. At this point, there's a fair amount of (messy) empirical evidence about how much being in the Bay Area impacts performance relative to being in other places. You could match organizations by area of research and do a comparison between the Bay and London/Oxford/Cambridge. E.g. OpenAI and Anthropic vs. DeepMind, OpenPhil (long-termist research) vs. FHI-GPI-CSER, CHAI vs Oxford and DeepMind. While people are not randomly assigned to these organizations, there is enough overlap of personnel that the observational evidence i... (read more)

Is your argument about personnel overlap that one could do some sort of mixed effect regression, with location as the primary independent variable and controls for individual productivity? If so I'm so somewhat skeptical about the tractability: the sample size is not that big, the data seems messy, and I'm not sure it would capture necessarily the fundamental thing we care about. I'd be interested in the results if you wanted to give it a go though! More importantly, I'm not sure this analysis would be that useful. Geography-based-priors only really seem useful for factors we can't directly observe; for an organization like CHAI our direct observations will almost entirely screen off [] this prior. The prior is only really important for factors where direct measurement is difficult, and hence we can't update away from the prior, but for those we can't do the regression. (Though I guess we could do the regression on known firms/researchers and extrapolate to new unknown orgs/individuals). The way this plays out here is we've already spent the vast majority of the article examining the research productivity of the organizations; geography based priors only matter insomuchas you think they can proxy for something else that is not captured in this. As befits this being a somewhat secondary factor, it's worth noting that I think (though I haven't explicitly checked) in the past I have supported bay area organisations more than non-bay-area ones.

Some small subsets of CYC were released (see Wiki). You could finetune a model on those and use that to estimate the value of the full dataset. (You could also talk to Michael Witbrock, who worked on CYC in the past and is familiar with AI alignment.) 

There are also various large open-source knowledge bases. There are closed knowledge graphs/bases at Google and many other companies. You might be able to collaborate with researchers at those organizations. 

3Mateusz Bagiński10mo
My superficial understanding is that Cyc has two crucial advantages over all current knowledge bases / knowledge graphs: 1. It is much, much bigger 2. Predicates can be of any arity (properties of one entity, relations between two entities, more complex, structured relationships between N entities for any N), whereas knowledge graphs can only represent binary relationships R(X,Y), like "X loves Y". If I understand it correctly, then Cyc's knowledge base is a knowledge hypergraph. Maybe it doesn't eventually matter and you can squeeze any knowledge encoded into Cyc's KB into ordinary knowledge graphs without creating some edge-spaghetti hell.

I'm interested to know how much the prominent figures in these past Rationalist groups cared about rationality itself rather than its bedfellows (science, atheism, socialism or communism etc.). A related question is whether these groups sometimes functioned as a fig leaf for a certain kind of political association (e.g. scientifically-minded socialists).  

 From reading the J. B. S. Haldane biography linked in the OP, I got the sense that Haldane cared most about science and the status of scientists in society. He seems to care less about rational... (read more)

This is a great list of posts. I had some of these in mind but hadn't remembered all of these. Thanks!

There is also a French non-profit called the Rationalist Union, co-founded by Langevin (of the Langevin Equation and Twin Paradox). Apparently, Borel, Einstein, and Hadamard all had some honorary role in the past. Like the British Rationalist Association, it seems it was associated with socialism and communism during the mid-20th Century. The best source I could find is translated French Wikipedia. 

Yeah. The communist associations of past iterations of "rationalist" schools or communities is one the biggest piles of skulls I know about and try to always keep in mind.

Wikipedia uses this URL about Stalin, Wells, Shaw, and the holodomor as a citation to argue that, in fact, many of them were either duped fools or worse into denying the holodomor. Quoting from the source there:

Shaw had met Stalin in the Kremlin on 29 July 1931, and the Soviet leader had facilitated his tour of Russia in which he was able to observe, at least to his own satisfaction, that

... (read more)

Interesting. Can you point to a study by external researchers (not DeepL) that compares DeepL to other systems (such as Google Translate) quantitatively? After a quick search, I could only find one paper, which was just testing some tricky idioms in Spanish and didn't find significant differences between DeepL and Google. (Wikipedia links to this archived page of comparisons conducted by DeepL but there's no information about the methodology used and the difference in performance seem too big to be credible to me.)

The primary source of my quality assessment is my personal experience with both Google Translate and DeepL. I speak 3 languages, and often have to translate between them (2 of them are not my native languages, including English). As I understand, making such comparisons in a quantitative manner is tricky, as there are no standardized metrics, there are many dimensions of translation quality, and the quality strongly depends on the language pair and the input text. Google Scholar lists [] a bunch of papers that compare Google Translate and DeepL. I checked a few, and they're all over the place. For example, one [] claims that Google is better, another [] claims that they score the same, and yet another [] claims that DeepL is better. My tentative conclusion: by quantitative metrics, DeepL is in the same league as Google Translate, and might be better by some metrics. Which is still an impressive achievement by DeepL, considering the fact that they have orders-of-magnitude less data, compute, and researchers than Google.
Load More