A new paper from Google, in which they get a language model to solve some (of what to me reads as terrifyingly impressive) tasks which require quantitative reasoning skills. The abstract reads as follows:
Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva , a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.
Some of the results are quite relevant to forecasting AI progress. From @bneyshabur:
Recently @JacobSteinhardt (whose group created MATH dataset) published the result of a project where commissioned professional forecasters predicted (at 2021) that 15% acc on MATH will be achieved in 2022 and 50% would be achieved by 2025. ....Steinhardt wrote:
"If I imagine an ML system getting more than half of these questions right, I would be pretty impressed.”
indeed gets more than half the MATH dataset questions right
Some further information (excerpted from across the paper):
Minerva is based on the PaLM general language models that are further trained on... the arXiv preprint server and from web pages that we carefully process to minimise the loss of mathematical content.
The main novelty of this paper is a large training dataset that juxtaposes natural language with the correct use of formal mathematical language, such as equations and diagrams.
...We evaluated Minerva 62B on the National Math Exam in Poland and found that it achieves a score of 57%, which happened to be the national average in 2021 (CKE, 2021, p. 23). The 540B model achieves 65%.
the prevailing errors of the 8B model were related to incorrect reasoning or calculations. Many of the calculation errors were relatively benign arithmetic mistakes. Solutions that were too short were relatively rare (in these cases, the model immediately produces an incorrect answer without any intermediate reasoning steps). Finally, in a few cases, the model hallucinates an equation or mathematical fact that is not real.
When examining model solutions, we find that memorization of intermediate facts, such as numerical values of square roots or trigonometric identities, are crucial elements of model solutions. Truly strong performance would combine recall of intermediate facts with genuine solution synthesis...Overall, we find little evidence that the model’s performance can be attributed to memorization.
Limitations: First, we have no automatic way of verifying the correctness of the model’s answers. This is in contrast to formal approaches, for which automatic verification is intrinsic. Second, our model has no access to external tools such as a calculator or a Python interpreter. [!!!] It is therefore limited in its ability to perform quantitative reasoning tasks that require complicated numerical calculations.
...The model’s performance is still well below human performance, and furthermore, we do not have an automatic way of verifying the correctness of its outputs. If these issues could be solved, we expect the impacts of this model to be broadly positive. A direct application could be an accessible and affordable math tutor which could help improve educational inequalities.
You can read the full paper here: https://storage.googleapis.com/minerva-paper/minerva_paper.pdf