Epistemic status: I can’t be bothered to prove any of this, but I’m right. Written bluntly and without detailed examples or helpful diagrams, that it may be written at all.

The linkages you use in predictive models are wrong. This is bad.

Sorry, what’s a “linkage”?

Linkage is the manner in which the predictors in a model affect the response variable. In linear regression, linkage is “additive” or “unity-linked”: a unity-linked model might imply smoking two cigarettes a day decreases predicted lifespan by 2 years. Other models can be “multiplicative” or “log-linked”: these might imply smoking two cigarettes per day decreases predicted lifespan by a factor of 0.03. Yet other models are “logistic” or “logit-linked”, because they apply a logistic function as their final step when predicting.

Your linkages are wrong, and that’s bad.

How can a linkage be “wrong”?

To give one example, prices should almost never be modelled additively. If you allow features to add and subtract from your prediction, you run the risk of negative prices, which are impossible in most contexts. Moreover, it fundamentally doesn’t line up with reality: a good or service having a given quality is almost always better modelled as changing its price by x%, not $x[1].

Why is using the “wrong” linkage bad?

You think it’s bad because using the wrong linkage lowers model performance, or (equivalently) increases the amount of data you would need to reach a given level of model performance. I think it’s bad because using the wrong linkage incentivizes overcomplicated and unnecessarily unintelligible models.

Okay, so you’re saying I need to always make sure to use the right linkage . . .

No. I’m saying that all linkages are wrong as a matter of course. Almost any problem with a context left of Chemistry in that one XKCD comic is guaranteed to have a ‘natural’ linkage which doesn’t map to “additive” or “multiplicative” or “logistic”.

But what if I know my linkage is right?

Even if you had very good theoretical reasons to believe a linkage was appropriate, that linkage would still be wrong unless you were modelling using all available data. Say you’re predicting response Y from features X1 and X2. Say you happen to know, beyond a shadow of a doubt, that response Y has an additive linear relationship with its’ features. There’s no problem with using additive linkage, right?

Wrong, because your model doesn’t account for feature X3, which your dataset doesn't contain but reality does. And since X1 and X2 correlate with[2] Y, and X3 correlates with Y, it follows that X3 is likely (all else being equal) to correlate with X1 and X2. If X1 and X2 take Y-maximizing values, this will either make it less or (more likely) more likely that X3 also takes a Y-maximizing value; this will render the effects of X1 and X2 on Y nonlinear under almost all circumstances. So your model’s more extreme predictions will either consistently be underestimates or consistently be overestimates.

In other words, “I have the right linkage” is not just a claim of perfect insight, but also omniscience. If there’s a single relevant feature absent, and that feature has any shared information with the features you do have, the ‘correct’ linkage isn’t.

So what would you propose I do differently?

The remedy is embarrassingly simple.

  1. Fit a model with the least wrong linkage you can find.
  2. Fit a second model on the same training data, using the first model’s prediction as the sole predictive feature, and keeping the response variable.
  3. When predicting on an outsample, use the second model to adjust the first model and get a ‘true’ prediction.

Thus, a model which needs a corrective of the form “when you’re predicting $520, you need to predict $20 higher” can get that corrective without adding unnecessary complexity.

There will be cases where this complexity-reducing approach adds more complexity than it reduces. There will be other cases where it doesn't. Be judicious.

Is that . . . all?

Yes.

  1. ^

    If skipping dessert in one restaurant decreases typical meal price from $200 to $170, skipping dessert in a restaurant where a meal with dessert costs $20 is more likely to reduce the price to $17 than to -$10.

  2. ^

    I here use "correlates with" to mean "shares information with", not "has a positive Pearson's correlation coefficient with".

0

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 3:23 PM

I have no idea what you are talking about. Consider adding better examples, references, using more standard terminology.

To summarize my understanding of the post:

You have some data which include some input variables and and outcome. You want to predict the outcome based on the input variables if you get some new data. In order to do this you use a mathematical function (the "model") which takes the input and produces an output. This mathematical function is relatively simple with a few parameters, and assumes a relatively simple relationship between each variable and the output, which you call a "linkage" and is selected typically from a few common relationships like additive, multiplicative or logistic. So, you apply the model to the data and find the variables which make the least bad predictions of the output according to some loss function (least squares perhaps, or whatever). Then you use this model to predict the outcome of new data. 

However, in your opinion, the fact that the model uses a selected one of these few common relationships between each variable and the outcome is a problem, since in reality it will be more complicated, and it incentivizes more complicated models (presumably, with more parameters).

So, you propose to take the output of the first function (first model) you got after optimizing its parameters, and then apply a second function (second model) to the output of the first function. This second model is also optimized to output as close as possible to the actual outcomes presumably using the same loss function. You don't specify exactly how this second function can vary, whether it also has a few parameters or one parameter or many parameters?

Interesting proposal; my comments: 

 Lets call the first function (model) f and the second function (model) g. 

1. Your approach of first optimizing f and then optimizing g, and then taking  g ∘ f as your final model has the obvious alternative of directly optimizing g ∘ f with all parameters of each function optimized together. ("Obvious" only in the sense of after already knowing about your approach.) Do you expect the two-stage process to do enough less overfitting to make up for the loss of accuracy for not optimizing everything together?

2. Overall this seems like complexity laundering to me, your final model is g ∘ f even with the two stage process and is it really simpler than what you would otherwise have used for f?

3. If there are multiple input variables I'm not sure I would conceptualize this as correcting the linkage, since it's correcting the overall output and not specifically the relationship with any one input variable?

Thanks for putting in the time to make sense of my cryptic and didactic ranting.

You don't specify exactly how this second function can vary, whether it also has a few parameters or one parameter or many parameters?

Segmented linear regression usually does the trick. There's only one input, and I've never seen discontinuities be necessary when applying this method, so only a few segments (<10) are needed.

I didn't specify this because almost any regression algorithm would work and be interpretable, so readers can do whatever is most convenient to them.

Your approach of first optimizing f and then optimizing g, and then taking  g ∘ f as your final model has the obvious alternative of directly optimizing g ∘ f with all parameters of each function optimized together.

What I actually do is optimize f until returns diminish, then optimize f and g together. I suggested "f then g" instead of "f then f&g" because it achieves most of the same benefit and I thought most readers would find it easier to apply.

(I don't optimize f&g together from the outset because doing things that way ends up giving g an unindicatively large impact on predictions.)

is it really simpler than what you would otherwise have used for f?

Sometimes. Sometimes it isn't. It depends how wrong the linkage is.

If there are multiple input variables I'm not sure I would conceptualize this as correcting the linkage, since it's correcting the overall output and not specifically the relationship with any one input variable?

I would. When the linkage is wrong - like when you use an additive model on a multiplicative problem - models either systematically mis-estimate their extreme predictions or add unnecessary complexity in the form of interactions between features.

I often work in a regression modelling context where model interpretability is at a premium, and where the optimal linkage is almost but not quite multiplicative: that is, if you fit a simple multiplicative model, you'll be mostly right but your higher predictions will be systematically too low.

The conventional way to correct for this is to add lots of complex interactions between features: "when X12 and X34 and X55 all take their Y-maximizing values, increase Y a bit more than you would otherwise have done", repeated for various combinations of Xes. This 'works' but makes the model much less interpretable, and requires more data to do correctly.

New to LessWrong?