A new paper from Google, in which they get a language model to solve some (of what to me reads as terrifyingly impressive) tasks which require quantitative reasoning skills. The abstract reads as follows:

Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva , a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.

I have not bothered to double check the answer on the left is accurate (solving /checking these sorts of math problems can be time-consuming for me), but assuming it is... should I be as freaked out as I think I should be?

Some of the results are quite relevant to forecasting AI progress. From @bneyshabur:

Recently @JacobSteinhardt (whose group created MATH dataset) published the result of a project where commissioned professional forecasters predicted (at 2021) that 15% acc on MATH will be achieved in 2022 and 50% would be achieved by 2025. ....Steinhardt wrote:

"If I imagine an ML system getting more than half of these questions right, I would be pretty impressed.”

from https://bounded-regret.ghost.io/ai-forecasting/ 

  indeed gets more than half the MATH dataset questions right

Some further information (excerpted from across the paper):

Minerva is based on the PaLM general language models that are further trained on... the arXiv preprint server and from web pages that we carefully process to minimise the loss of mathematical content.

The main novelty of this paper is a large training dataset that juxtaposes natural language with the correct use of formal mathematical language, such as equations and diagrams. 

...We evaluated Minerva 62B on the National Math Exam in Poland and found that it achieves a score of 57%, which happened to be the national average in 2021 (CKE, 2021, p. 23). The 540B model achieves 65%.

the prevailing errors of the 8B model were related to incorrect reasoning or calculations. Many of the calculation errors were relatively benign arithmetic mistakes. Solutions that were too short were relatively rare (in these cases, the model immediately produces an incorrect answer without any intermediate reasoning steps). Finally, in a few cases, the model hallucinates an equation or mathematical fact that is not real.


When examining model solutions, we find that memorization of intermediate facts, such as numerical values of square roots or trigonometric identities, are crucial elements of model solutions. Truly strong performance would combine recall of intermediate facts with genuine solution synthesis...Overall, we find little evidence that the model’s performance can be attributed to memorization.

Limitations: First, we have no automatic way of verifying the correctness of the model’s answers. This is in contrast to formal approaches, for which automatic verification is intrinsic. Second, our model has no access to external tools such as a calculator or a Python interpreter. [!!!] It is therefore limited in its ability to perform quantitative reasoning tasks that require complicated numerical calculations. 

...The model’s performance is still well below human performance, and furthermore, we do not have an automatic way of verifying the correctness of its outputs. If these issues could be solved, we expect the impacts of this model to be broadly positive. A direct application could be an accessible and affordable math tutor which could help improve educational inequalities.


You can read the full paper here: https://storage.googleapis.com/minerva-paper/minerva_paper.pdf 

New to LessWrong?

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 6:44 PM

Expanding on the Jacob Steinhardt quote from August 2021,

Current performance on this dataset is quite low--6.9%--and I expected this task to be quite hard for ML models in the near future. However, forecasters predict more than 50% accuracy* by 2025! This was a big update for me...

If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I'm really curious how the forecasters are reasoning about this...

Even while often expressing significant uncertainty, forecasters can make bold predictions. I'm still surprised that forecasters predicted 52% on MATH, when current accuracy is 7% (!). My estimate would have had high uncertainty, but I'm not sure the top end of my range would have included 50%. I assume the forecasters are right and not me, but I'm really curious how they got their numbers.

Google's model obtained 50.3% on MATH, years ahead of schedule.

What is expert level on competition math problems? Do undergrads regularly get half right?

EDIT: someone answered elsewhere in the comments. Looks like this model is still well behind an expert human.

I'll restate my prior prediction: chain of thought reasoning with large language models solves general intelligence. No further "deep" insights or paradigm changes are needed, only scale and relatively simple tweaks to improve the quality of the reasoning. 

One slightly counterintuitive thing about this paper is how little it improves on the GSM8K dataset, given that it does very well on relatively advanced test sets.

The Grade School Math, 8-K is a bundle of problems suitable for middle-schoolers. It has problems like:

"Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"

"Randy has 60 mango trees on his farm. He also has 5 less than half as many coconut trees as mango trees. How many trees does Randy have in all on his farm?"

Minerva improves the SOTA on this, but only moves it from 74.5% to 78.5%, which is not as big of a deal.

My innate / naive sense of how hard the MATH problems are would lead me to think you could get > 90% on GSM8K if you could get 50% on MATH. But obviously my gut sense is off.

I'd be really curious to know what's going on here.

The previous SOTA for MATH (https://arxiv.org/pdf/2009.03300.pdf) is a fine-tuned GPT-2 (1.5b params), whereas the previous SOTA for GSM8K (https://arxiv.org/pdf/2203.11171.pdf) is PaLM (540b params), using a similar "majority voting" method as Minerva (query each question ~40 times, take the most common answer).

Here's something odd that I noticed in one of the examples in the blogpost (https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html).

The question is the one that in part reads "the variance of the first n natural numbers is 10". The model's output states, without any reasoning, that this variance is equal to (n^2 - 1)/12, which is correct. Since no reasoning was used, I think it's safe to assume that the model memorized this formula.

This is not a formula that a random math student would be expected to have memorized. (Anecdotally, I have a mathematics degree and don't know it.) Because of that, I'd expect that a typical (human) solver would need to derive the formula on the spot. It also strikes me as the sort of knowledge that would be unlikely to matter outside a contest, exam, etc.

That all leads me to think that the model might be over-fitting somewhat to contest/exam/etc.-style questions. By that I mean that it might be memorizing facts that are useful when answering such questions but are not useful when doing math more broadly.

To be clear, there are other aspects of the model output, here and in other questions, that seem genuinely impressive in terms of reasoning ability. But the headline accuracy rate might be inflated by memorization.

The model’s performance is still well below human performance

At this point I have to ask what exactly is meant by this. The bigger model beats the average human performance on the national math exam in Poland. Sure, the people taking this exam are usually not adults, but for many it may be where they peak in their mathematical abilities, so I wouldn't be surprised if it beats average human performance in the US. It's all rather vague though; looking at the MATH dataset paper all I could find regarding human performance was the following:

Human-Level Performance. To provide a rough but informative comparison to human-level performance, we randomly sampled 20 problems from the MATH test set and gave them to humans. We artificially require that the participants have 1 hour to work on the problems and must perform calculations by hand. All participants are university students. One participant who does not like mathematics got 8/20 = 40% correct. A participant ambivalent toward mathematics got 13/20. Two participants who like mathematics got 14/20 and 15/20. A participant who got a perfect score on
the AMC 10 exam and attended USAMO several times got 18/20. A three-time IMO gold medalist got 18/20 = 90%, though missed questions were exclusively due to small errors of arithmetic. Expert-level performance is theoretically 100% given enough time. Even 40% would accuracy for a machine learning model would be impressive but have ramifications for cheating on homework.

So, for solving undergraduate-level math problems, this model would be somewhere between university students who dislike mathematics and ones who are neutral towards it? Maybe. Would be nice to get more details here, I assume they didn't think much about human-level performance since the previous SOTA was clearly very far from it.

They test on the basic (Poziom podstawowy) Matura tier for testing on math problems.
In countries with Matura-based education, the basic tier math test is not usually taken by mathematically inclined students -- it is just the law that anyone going to a public university has to pass some sort of math exam beforehand. Students who want to study anything where mathematics skills are needed would take the higher tier (Poziom rozszezony).
Can someone from Poland confirm this?

A quick estimate of the percentage of high-school students taking the Polish Matura exams is 50%-75%, though. If the number of students taking the higher tier is not too large, then average performance on the basic tier corresponds to essentially average human-level performance on this kind of test. 

Note that many students taking the basic math exam only want to pass and not necessarily perform well; and some of the bottom half of the 270k students are taking the exam for the second or third time after failing before.

Where can I access and play around with this model and/or its   code ?

Google doesn’t seem interested in serving large models until it has a rock solid solution to the “if you ask the model to say something horrible, it will oblige” problem.

I think that is right call. Anecdotal bad outputs would probably go viral and create media firestorm with the stochastic parrots twitter crowd beating them over the head along the way. Not sure you can ever get it perfect but they should probably get close before releasing public.

At the same time, a good math-solving chatbot could be really useful for math-averse people, even with brittle performance. I’m not sure it’s worth the risk, but might be worth considering.

You'll also get people complaining that it'll help students cheat, because testing is more important than education to people involved in the education system.

I think that's unfair.

Students to whom learning is more important than test results won't cheat either way. Students to whom test results are more important than learning will cheat if it's easy and reluctantly fall back on actually learning the material if they have to. Educators who care whether their students learn will prefer the latter outcome.

(It is sometimes also true that educators care more about testing than teaching. But I don't think that's anything like the only reason why they will complain about things that make it very easy for students to cheat.)

Students also might reason (maybe correctly) that if AI is already better than most humans will ever be in their lifetime, why exactly are they spending all this time on stuff like hand symbolic manipulation and hand arithmetic anyways..