Scaling Laws

Scaling Laws refer to the observed trend of some machine learning architectures (notably transformers) to scale their performance on predictable power law when given more compute, data, or parameters (model size), assuming they are not bottlenecked on one of the other resources. This has been observed as highly consistent over more than six orders of magnitude.

![Scaling laws graph](https://i.imgur.com/7lhHT8n.png)

Graph from Scaling Laws for Neural Language Models

Scaling Laws refer to the observed trend that the scaling behaviors of ~~some machine learning architectures (notably~~ ~~transformers) to scale their performance on predictable~~deep neural networks (i.e. how the evaluation metric of interest varies as one varies the amount of compute used for training (or inference), number of model parameters, training dataset size, model input size, or number of training steps) follows variants of power law when given more compute, data, or parameters (model size), assuming they are not bottlenecked on one of the other resources. This has been observed as highly consistent over more than six orders of magnitude.laws.

External links

"Broken Neural Scaling Laws" paper

			v1.4.1Jun 18th 2023 GMT	Move images to CDN
			v1.4.0Jun 18th 2023 GMT	(+340/-312)
			v1.3.0Oct 4th 2021 GMT	(-5) link in caption
			v1.2.0Oct 4th 2021 GMT	(+5/-41) using a real image not an attempted one
			v1.1.0Sep 24th 2021 GMT	(+467) wrote basic description

			v1.4.1Jun 18th 2023 GMT	Move images to CDN
			v1.4.0Jun 18th 2023 GMT	(+340/-312)
			v1.3.0Oct 4th 2021 GMT	(-5) link in caption
			v1.2.0Oct 4th 2021 GMT	(+5/-41) using a real image not an attempted one
			v1.1.0Sep 24th 2021 GMT	(+467) wrote basic description

LESSWRONG
LW

LESSWRONG
LW

Scaling Laws

External links