Theory and Data as Constraints

johnswentworth

There’s a widely-known legend about statistician Abraham Wald’s work on planes during WWII (the veracity of the legend is examined here). As the story goes, the military collected data on planes coming back from missions, marking the location of any bullet holes. They soon had statistics showing how many had been hit on the engine, the fuel system, etc. Based on this data, commanders wanted to add extra armor to reinforce the areas which were most often hit.

Wald, however, suggested adding extra armor to the places which were least often hit. His reasoning: planes hit in those areas were the planes which didn’t come back.

The moral of the tale: useful insights do not come from data alone. Background knowledge, interpretation and model-building - all of which we’ll bundle into the word “theory” - are necessary elements as well. In the framework of this sequence: theory and data are both constraints on the production of useful insights. Immediate question: how taut are each of those constraints? What’s the limiting factor?

Let’s think about the tautness of each constraint in the Wald legend. How expensive was the data, and how expensive was the theory? In this case, the data was presumably far more expensive: it required people on the ground examining hundreds of returning airplanes per mission, collecting all of that hand-written data, adding it all up by hand, and relaying it via telephone or radio or mail to the Statistical Research Group in New York City (where Wald worked). The theory, on the other hand, probably took Wald a day at most. Of course there would also be some overhead on both sides - Wald needed to write up the theory in a manner convincing to military commanders, and military commanders needed to set up the whole data-collection project - but there again, the data would have been far more expensive than the theory.

Conclusion: the data constraint was much more taut than the theory constraint. Theory was abundant relative to the scarcity of data.

… at least during WWII.

The Internet

More recently, you may have heard about a fancy new technology called “the internet” which makes it really, really cheap for people to share data. Given such a technology shift, we’d expect the data constraint to become slack much more often, leaving the theory constraint taut.

What does that look like? In academia, biology is a great example. Biologists today have massive piles of data - genomics, transcriptomics, proteomics, metabolomics, lots of omics, on a wide variety of organisms, tissues, cell types, etc. Yet our ability to turn all that data into engineered organisms or cures for disease remains underwhelming. I would bet that all the data needed in principle to, say, find a cure for Alzheimers is already available online - if only we knew how to effectively leverage it. We have all this data, but people don’t really understand how to use it. That’s what it looks like when the data constraint is slack and the theory constraint is taut.

Of course, biology isn’t the only field which looks like this. Economics is another - we have huge piles of data on prices, consumption, trade, taxes, and so forth, yet we have limited ability to turn it all into useful economic insight. We don’t even know what useful things to do with structured data like databases of prices and consumption, let alone unstructured data like the whole database of US federal regulations. Economists run studies on a small handful of datasets, or on very coarse aggregates, or sometimes just throw everything into a giant neural network to see what happens. We don’t yet have quantitative, gearsy models capable of absorbing and using a wide variety of data all at once.

Then there’s the entire field of data science. A decade after the internet took off, data science appeared more-or-less spontaneously as a response to companies with giant piles of data and no idea what to do with it all. That’s the sort of thing you expect to happen when a technology shift suddenly relaxes a previously-taut constraint: a complementary constraint becomes taut, and an industry appears to service it. There’s a reason the Wald legend is popular as an analogy for data science in general: it’s a perfect example of the value-add data scientists provide, the kind of “theory” which companies need in order to make their data useful.

Low-Hanging Fruit?

Our society has not had much time to adjust to the internet. We now live in a world where theory is scarce relative to data in far more places, but the world is still adjusting to this new reality. Where might we expect low-hanging fruit? What other constraints are likely to become taut/slack?

One general area to look for low-hanging fruit: comprehensive overviews of possibly-useful topics. Track down a majority of all the physical capital assets of public companies, or skim titles/abstracts from a few years of archives of scientific journals, or read the wikipedia pages on every country on Earth. This sort of exercise would have been very expensive before the internet, society hasn’t had a lot of time to experiment with it, so there’s likely to be low-hanging fruit in the area. Data is now very cheap, so consume a lot of it and see what happens.

Another angle would be to practice building gears-level models - especially with an eye toward integrating a wide variety of data sources into the model-building process. (See here and here for why we want gears.) Paper-Reading for Gears has some general tips on leveraging scientific papers toward this end. In terms of relevant general background knowledge, Pearl’s Book of Why is probably a useful resource for building/testing (one type of) gearsy model statistically. But the most important tool here is the general habit of thinking in gears and asking the sort of questions which yield gears; I’d be interested to hear peoples’ suggestions for ways to learn those habits.

Finally, there’s one key constraint which I expect to become much more taut as theory becomes a more limiting factor: the inability to outsource expertise. In the Wald legend, military commanders wanted to add more armor to the places where returning planes had been shot most often. This seems so intuitively obvious that there isn’t any reason to consult an expert about it. The commanders didn’t have the background knowledge to realize they were making an important mistake, so they didn’t have any way to know that they needed an expert. Fortunately, in Wald’s case, consulting the expert was relatively cheap, so there wasn’t much reason not to do it. But what happens when consulting the expert is expensive? People won’t consult an expert unless there’s an obvious need for it - which means people will repeatedly be hit in the face by unknown unknowns.

This problem amplifies the value of comprehensive overviews and gears-level modelling skills. It’s not just that theory is scarce, it’s that theory is scarce and you can’t reliably outsource it. Money cannot reliably buy good theory, unless you already have some ability to recognize good theory. How can we recognize good theory, across a wide variety of applications, especially when there isn’t a clear objective metric for success? First, by studying how to build good theory in general - i.e. gears-level models. Second, by absorbing lots of background knowledge across lots of different areas, so we reduce unknown unknowns and gain more possible metrics by which to recognize experts.

> which means people will repeatedly be hit in the face by unknown unknowns.

Were you the one who made the point that when you don't understand something it doesn't look mysterious and suggestive, it looks random? So it's a wicked problem because you don't realize there's something you can do about it. I hadn't ever had the thought before that behavioral economics is the attempt to systematize blindspots. What might it look like to systematize the search strategy that returns blindspots? One strategy I've found is crossing the idea of sentence stem completion with maslow-ish questions about important areas of life.

Were you the one who made the point that when you don't understand something it doesn't look mysterious and suggestive, it looks random?

Yup, that's from my review of Design Principles of Biological Circuits.

What might it look like to systematize the search strategy that returns blindspots?

A few years ago I wrote about one strategy for this, based on an example I ran into in the wild. We had some statistics on new user signups for an app; day-to-day variation in signup rate looked random. Assuming that each user decides whether to signup independently of all the other users, the noise in total signup count $N$ should be ~ $\sqrt{N}$ (ignoring a constant factor). But the actual day-to-day variability was way larger than that - therefore there had to be some common factor influencing people. We had identified an unknown unknown. (Turned out, our servers didn't have enough capacity, and would sometimes get backed up. Whenever that happened, signups dropped very low. So we added servers, and signup rate improved.)

The link talks a bit about how to generalize that strategy, although it's still far from a universal technique.

Before reading this chain, I had an intuitive sense of the "bottlenecks" of production, but this chain allowed me to understand it much better. Thank you!

I liked the extension of your taut-slack constraints to the theory-date setting. I think you are correct that people are still working though that shift.

" Data is now very cheap, so consume a lot of it and see what happens." is a bit more problematic to me. There certainly is a lot of truth to the old saying, there is no seeing without looking. In one sense the data is cheap -- it is just there and in many ways not an economic good any longer.

However, the act of consuming the data is still costly for most of us. As romeo notes, when we are wondering though the fields on our unknown unknowns it looks very random (I also attributed that idea to you) so how do we get any patterns to emerge.

While part of the pattern recognition stems form some underlying theory, new patterns will be found as one starts organizing the data and then the pattern can start to be understood be thinking about potential relationships that explain the connections.

There was a online tool someone here mentioned a year or so back. Totally forgetting what the name, basically it was a better set of note cards for information bits than then could be linked. You get a nice graph forming up (searchable I believe on edges not merely phase/subject/category/word). If that were a collaborative tool (might be) that might be a slack constraint for bringing up unseen patterns in the data (reducing that cost of consuming). The edges might be color-coded and allow multiple edges between nodes based on some categorization/classification of the relationship, then filtering on color (though might also be interesting to look at possible patterns in the defined edges too).

However, the act of consuming the data is still costly for most of us. As romeo notes, when we are wondering though the fields on our unknown unknowns it looks very random (I also attributed that idea to you) so how do we get any patterns to emerge.

While part of the pattern recognition stems form some underlying theory, new patterns will be found as one starts organizing the data and then the pattern can start to be understood be thinking about potential relationships that explain the connections.

There used to be an exhibit at Epcot on "the pattern of progress" which I think pointed to the same thing you're pointing to here. There's a short video from it which I really like; it breaks "progress" down into a five-step pattern:

Seeing - i.e. obtaining data
Mapping - organizing the data and noticing patterns
Understanding - figuring out a gears-level model
Belief - using the model to make plans
Action - actually doing things based on the model

Breaking things into steps is always a bit cheesy, but I do think there's a valuable point in here: there's an intermediate step between seeing the data and building a gears-level model. I think that's what you're pointing to: there's a need to organize the data and slice it in various ways so you can notice patterns - i.e. mapping, in the colloquial sense of the word.

Does that sound right?

There was a online tool someone here mentioned a year or so back. Totally forgetting what the name, basically it was a better set of note cards for information bits than then could be linked.

Possibly Roam?

Yes. At some level we need to have some type of theory to start moving the data into different piles which we can compare. But if we're theory constrained we don't see how to put any order on the data -- it's not even information at that point; it's that random noise.

But clearly we do find ways to break out of that circle.

When the constrain is the data then intermediate constraints between data and theory are probably not as obvious, the data is not as overwhelming.

Yes, Roam was it. Thanks!

I would bet that all the data needed in principle to, say, find a cure for Alzheimers is already available online - if only we knew how to effectively leverage it.

I agree. If "effectively leverage it" means a superintelligence with unlimited compute, then this is a somewhat weak statement. I would expect that a superintelligence given the human genome would figure out how to cure all diseases. I would expect it to be able to figure out a lot from any book on biology. I would expect it to be able to look at a few holliday photos, figure out the fundamental equations of reality, and that evolution happened on a planet ,that it was created by evolved intelligences with tech ect. From this, it could design nanobots programmed to find humans and cure them, even if it had no idea what humans look like, it just programs the nanobots to find the most intelligent life forms around.

> which means people will repeatedly be hit in the face by unknown unknowns.

Were you the one who made the point that when you don't understand something it doesn't look mysterious and suggestive, it looks random?

Yup, that's from my review of Design Principles of Biological Circuits.

What might it look like to systematize the search strategy that returns blindspots?

The link talks a bit about how to generalize that strategy, although it's still far from a universal technique.

Before reading this chain, I had an intuitive sense of the "bottlenecks" of production, but this chain allowed me to understand it much better. Thank you!

I liked the extension of your taut-slack constraints to the theory-date setting. I think you are correct that people are still working though that shift.

However, the act of consuming the data is still costly for most of us. As romeo notes, when we are wondering though the fields on our unknown unknowns it looks very random (I also attributed that idea to you) so how do we get any patterns to emerge.

While part of the pattern recognition stems form some underlying theory, new patterns will be found as one starts organizing the data and then the pattern can start to be understood be thinking about potential relationships that explain the connections.

Seeing - i.e. obtaining data
Mapping - organizing the data and noticing patterns
Understanding - figuring out a gears-level model
Belief - using the model to make plans
Action - actually doing things based on the model

Does that sound right?

There was a online tool someone here mentioned a year or so back. Totally forgetting what the name, basically it was a better set of note cards for information bits than then could be linked.

Possibly Roam?

But clearly we do find ways to break out of that circle.

When the constrain is the data then intermediate constraints between data and theory are probably not as obvious, the data is not as overwhelming.

Yes, Roam was it. Thanks!

I would bet that all the data needed in principle to, say, find a cure for Alzheimers is already available online - if only we knew how to effectively leverage it.

67

Theory and Data as Constraints

67

The Internet

Low-Hanging Fruit?

67

67