*Thankyou to Sisi Cheng (of the Working as Intended comic) for the excellent drawings. *

Suppose we have a gearbox. On one side is a crank, on the other side is a wheel which spins when the crank is turned. We want to predict the rotation of the wheel given the rotation of the crank, so we run a __Kaggle competition__.

We collect hundreds of thousands of data points on crank rotation and wheel rotation. 70% are used as training data, the other 30% set aside as test data and kept under lock and key in an old nuclear bunker. Hundreds of teams submit algorithms to predict wheel rotation from crank rotation. Several top teams combine their models into one gradient-boosted deep random neural support vector forest. The model achieves stunning precision and accuracy in predicting wheel rotation.

On the other hand, in a very literal sense, the model contains no __gears__. Is that a problem? If so, when and why would it be a problem?

## What is Missing?

When we say the model “__contains no gears__”, what does that mean, in a less literal and more generalizable sense?

Simplest answer: the deep random neural support vector forest model does not tell us what we expect to see if we open up the physical gearbox.

For instance, consider these two gearboxes:

Both produce the same input-output behavior. Our model above, which treats the gearbox as a literal black box, does not tell us anything at all which would distinguish between these two cases. It only talks about input-output behavior, without making any predictions about what’s inside the gearbox (other than that the gearbox must be consistent with the input/output behavior).

That’s the key feature of gears-level models: they make falsifiable predictions about the internals of a system, separate from the externally-visible behavior. If a model could correctly predict all of a system’s externally-visible behavior, but still be falsified by looking inside the box, then that’s a gears-level model. Conversely, we cannot fully learn gears-level models by looking only at externally-visible input-output behavior - external behavior cannot, for example, distinguish between the 3- and 5-gear models above. A model which can be fully learned from system behavior, without any side information, is not a full gears-level model.

Why would this be useful, if what we really care about is the externally-visible behavior? Several reasons:

- First and foremost, if we are able to actually look inside the box, then that provides a huge amount of information about the behavior. If we can see the physical gears, then we can immediately make highly confident predictions about system behavior.
- More generally, any information about the internals of the system provide a “side channel” for testing gears-level models. If data about externally-visible behavior is limited, then the ability to leverage data about system internals can be valuable.
- It may be that all of our input data is only from within a certain range - i.e. we never tried cranking the box faster than a human could crank. If someone comes along and attaches a motor to the crank, then that’s going to generate input way outside the range of what our input/output model has ever seen - but if we know what the gears look like, then that won’t be a problem. In other words, knowing what the system internals look like lets us deal with distribution shifts.
- Finally, if someone changes something about the system, then a model trained only on input/output data will fail completely. For instance, maybe there’s a switch on top of the gearbox which disconnects the gears, and nobody has ever thrown it before. If we know what the inside of the box looks like, then that’s not a problem - we can look at what the switch does.

All that said, if we have abundant data and aren’t worried about distribution shifts or system changes, non-gears models can still give us great predictive power. __Solomonoff induction__ is the idealized theoretical example: it gives asymptotically optimal predictions based on input-output behavior, without any visibility into the system internals.

## Application: Macroeconomic Models

One particularly well-known example of these ideas in action is the __Lucas Critique__, a famous 1976 paper by __Bob Lucas__ critiquing the use of simple statistical models for evaluation of economic policy decisions. Lucas’ paper gives several broad examples, but arguably the most remembered example is policy decisions based on the __Phillips curve__.

The Phillips curve is an empirical relationship between unemployment and inflation. Phillips examined almost a century of economic data, and showed a consistent negative correlation: when inflation was high, unemployment was low, and vice-versa. In other words, prices and wages rise faster at the peak of the business cycle (when unemployment is low) than at the trough (when unemployment is high).

The obvious mistake one might make, based on the Phillips curve, is to think that perpetual low unemployment can be achieved simply by creating perpetual inflation (e.g. by printing money). Lucas opens his critique by eviscerating this very idea:

The inference that permanent inflation will therefore induce a permanent economic high is no doubt [...] ancient, yet it is only recently that this notion has undergone the mysterious transformation from obvious fallacy to cornerstone of the theory of economic policy.

Bear in mind that this was written in the mid-1970’s - the era of “stagflation”, when both inflation and unemployment were high for several years. Stagflation was an empirical violation of the Phillips curve - the historical behavior of the system broke down when central banks changed their policies to pursue more inflation, and people changed their behavior to account for faster expected inflation in the future.

In short: a statistical model with no gears in it completely fell apart when one part of the system (the central bank) changed its behavior.

On the other hand, *before* stagflation was under way, multiple theorists (notably __Edmund Phelps__ and __Milton Friedman__, via very different approaches) published simple gears-level models of the Phillips curve which predicted that it would break down if currencies were *predictably* devalued - i.e. if people *expected* central banks to print more money. The key “gears” in these models were individual agents - the macroeconomic behavior (unemployment-inflation relationship) was explained in terms of the expectations and decisions of all the individual people.

This led to a paradigm shift in macroeconomics, beginning the era of “microfoundations”: macroeconomic models derivable from microeconomic models of the expectations and behavior of individual agents - in other words, gears-level models of the economy.

## Gears from Behavior?

In general, we cannot *fully* learn gears-level models by looking only at externally-visible input-output behavior. Our hypothetical 3- or 5-gear boxes are a case in point.

However, some kinds of models can at least deduce *something* about gears-level structure by looking at externally-visible behavior.

For example: given a gearbox with a crank and wheel, it’s entirely possible that the rotation of the wheel has hysteresis, a.k.a. memory - it depends not only on the crank’s rotation now, but also the crank’s rotation earlier. This would be the case if, for instance, the box contains a flywheel. If we look at the data and see that the wheel’s rotation has no dependence on the crank’s rotation at earlier times (after accounting for the crank’s current rotation), then we can conclude that the box probably does not contain any flywheels or other hysteretic components (or if it does, they’re small or decoupled from the wheel).

More generally, these sort of conditional independence relationships fall under the umbrella of __probabilistic causal models__. By testing different causal models on externally-visible data, we can back out information about the internal cause-and-effect structure of the system. If we see that only the crank’s *current* rotation matters to the wheel, then that rules out internal components with memory.

Causal models are the largest class of statistical models I know of which yield information about internal gears. However, they’re not the only way to build gears-level models from behavior. If we have strong prior information, we can often use behavioral data to directly compare gears-level hypotheses.

## Application: Wolf’s Dice

Around the mid-19th century, Swiss astronomer Rudolf Wolf rolled a pair of dice 20000 times, recording each outcome. The main result was that the dice were definitely not perfectly fair - there were small but statistically significant biases.

Now, we could easily look at Wolf’s data and use it to estimate the frequency with which each face of each die is rolled. But that’s not a gears-level model; it doesn’t say anything about the physical die.

In order to back out gears-level information from the data, we need to leverage our prior knowledge about dice and die-making. Jaynes did exactly this in __a 1979 paper__; the key pieces of prior information are:

- We know dice are roughly cube-shaped, and any difference in face frequencies should stem from asymmetry of the physical die. We know 3 is opposite 4, 2 is opposite 5, and 1 is opposite 6.
- We know dice have little pips showing the numbers on each face; different faces have different numbers of pips, which we’d expect to introduce a slight asymmetry.
- Imagining how the dice might have been manufactured, Jaynes guesses that the final cut would have been more difficult to make perfectly even than the earlier cuts - leaving one axis slightly shorter/longer than the other two.

Based on those asymmetries, we’d guess:

- One of the three face pairs (3, 4), (2, 5), (1, 6) has significantly different frequency from the others, corresponding to the last axis cut.
- The faces with fewer pips (3, 2, and especially 1) have slightly lower frequency than those with more pips (4, 5, and especially 6), since more pips means slightly less mass near that face.
- Other than that, the frequencies should be pretty even.

This is basically a guess-and-check process: we guess what asymmetry might be present based on our prior knowledge, consider how that would change the behavior, then we use the data to check the model.

Jaynes tests out these models, and finds that (1) the white die’s 3-4 axis is slightly shorter than the other two, and (2) the pips indeed shift the center of mass slightly away from the center of the die. These two asymmetries together explain all of the bias seen in the data, so the die should be quite symmetric otherwise. I analyze the same problem in __this post__ (using slightly different methods from Jaynes) and reproduce the same result.

Because this is a gears-level model, we could in principle check the result using a “side channel”: if we could track down the dice Wolf used, then we could take out our calipers and measure the lengths of the 3-4, 2-5, and 1-6 axes. Our prediction is that the 2-5 and 1-6 axes would be close, but the 3-4 axis would be significantly shorter. Note that we still don’t have a *full* gears-level model - we don’t predict *how much* shorter the 3-4 axis is. We don’t have a way to back out all the dimensions of the die. But we certainly expect the difference between the 3-4 length and the 2-5 length to be much larger than the difference between the 2-5 length and the 1-6 length. Our model yields *some* information about gears-level structure.

## Takeaway

Statistics, machine learning, and adjacent fields tend to have a myopic focus on predicting future data.

Gears-level models cannot be fully learned by looking at externally-visible behavior data. That makes it hard to prove theorems about convergence of statistical methods, or write tests for machine learning algorithms, when the goal is to learn about a system’s internal gears. So, to a large extent, these fields have ignored gears-level learning and focused on predicting future data. Gears have snuck in only to the extent that they’re useful for predicting externally-visible behavior.

But sooner or later, any field dominated by a gears-less worldview will have its Lucas Critique.

It is possible to leverage probability to test gears-level models, and to back out at least some information about a system’s internal structure. It’s not easy. We need to restrict ourselves to certain classes of models (i.e. causal models) and/or leverage lots of prior knowledge (e.g. about dice). It looks less like black-box statistical/ML models, and more like science: think about what the physical system might look like, figure out how the data would differ between different possible physical systems, and then go test it. The main goal is not to predict future data, but to compare models.

That’s the kind of approach we need to build models which won’t fall apart every time central banks change their policies.