This is a linkpost for https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html

This is a linkpost for https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html

Minerva

9Zac Hatfield-Dodds

1Algon

5MondSemmel

2Algon

1Lone Pine

6Algon

Google Research's new AI tackles natural language math problems and handily outperforms the SOTA

^{[1]}. It is a pre-trained PaLM^{[2]}finetuned on some maths datasets (which use LaTeX) composed of maths webpages and Arxiv papers (38.5B tokens). The three models trained were as follows.When generating answers, Minerva is given the same prompt of four questions with correct a chain of reasoning and a consistent format for the final, correct answer. Then the actual question is given. Minerva then outputs a chain of reasoning and a corresponding answer a number of times, with the most common answer chosen. Minerva is graded only on the final answer.

This voting algorithm is called maj@1k and saturates faster than pass@k (generates k answers, if one is right then the answer is graded correctly) but doesn't perform as well for large k. This is quite reasonable, as majority voting will continue to choose the most common answer, with the estimate's error decreasing with larger k. Whereas pass@k allows the model more tries for large k.

## Datasets

The datasets used are:

The datasets have questions which vary in difficulty. Predictably, the model performed worse on harder questions, with false positives linearly with question difficulty on

## Results

Now time for a suprise quiz! For the purposes of this quiz, assume we're talking about the most accurate minerva model (540B parameters using maj1@k sampling. k=64 for MATH and k=16 for MMLU). And we'll be averaging over results on subtopics

^{[3]}. Note the SOTA is OpenAI's davinci-002, which obtained absolute (averaged) scores of about 20% and 49%.And the answers are... no, yes, yes and no. Here's the raw data.

## Random Remarks

^{^}State of the art

^{^}Pathways Language Model, another AI developed by Google Research.

^{^}I'm assigning equal weights to the subtopics on MMLU because I'm too lazy to find out how many questions were on physics and maths in the dataset.