Epistemic status: I can’t be bothered to prove any of this, but I’m right. Written bluntly and without detailed examples or helpful diagrams, that it may be written at all.
The linkages you use in predictive models are wrong. This is bad.
Sorry, what’s a “linkage”?
Linkage is the manner in which the predictors in a model affect the response variable. In linear regression, linkage is “additive” or “unity-linked”: a unity-linked model might imply smoking two cigarettes a day decreases predicted lifespan by 2 years. Other models can be “multiplicative” or “log-linked”: these might imply smoking two cigarettes per day decreases predicted lifespan by a factor of 0.03. Yet other models are “logistic” or “logit-linked”, because they apply a logistic function as their final step when predicting.
Your linkages are wrong, and that’s bad.
How can a linkage be “wrong”?
To give one example, prices should almost never be modelled additively. If you allow features to add and subtract from your prediction, you run the risk of negative prices, which are impossible in most contexts. Moreover, it fundamentally doesn’t line up with reality: a good or service having a given quality is almost always better modelled as changing its price by x%, not $x[1].
Why is using the “wrong” linkage bad?
You think it’s bad because using the wrong linkage lowers model performance, or (equivalently) increases the amount of data you would need to reach a given level of model performance. I think it’s bad because using the wrong linkage incentivizes overcomplicated and unnecessarily unintelligible models.
Okay, so you’re saying I need to always make sure to use the right linkage . . .
No. I’m saying that all linkages are wrong as a matter of course. Almost any problem with a context left of Chemistry in that one XKCD comic is guaranteed to have a ‘natural’ linkage which doesn’t map to “additive” or “multiplicative” or “logistic”.
But what if I know my linkage is right?
Even if you had very good theoretical reasons to believe a linkage was appropriate, that linkage would still be wrong unless you were modelling using all available data. Say you’re predicting response Y from features X1 and X2. Say you happen to know, beyond a shadow of a doubt, that response Y has an additive linear relationship with its’ features. There’s no problem with using additive linkage, right?
Wrong, because your model doesn’t account for feature X3, which your dataset doesn't contain but reality does. And since X1 and X2 correlate with[2] Y, and X3 correlates with Y, it follows that X3 is likely (all else being equal) to correlate with X1 and X2. If X1 and X2 take Y-maximizing values, this will either make it less or (more likely) more likely that X3 also takes a Y-maximizing value; this will render the effects of X1 and X2 on Y nonlinear under almost all circumstances. So your model’s more extreme predictions will either consistently be underestimates or consistently be overestimates.
In other words, “I have the right linkage” is not just a claim of perfect insight, but also omniscience. If there’s a single relevant feature absent, and that feature has any shared information with the features you do have, the ‘correct’ linkage isn’t.
So what would you propose I do differently?
The remedy is embarrassingly simple.
- Fit a model with the least wrong linkage you can find.
- Fit a second model on the same training data, using the first model’s prediction as the sole predictive feature, and keeping the response variable.
- When predicting on an outsample, use the second model to adjust the first model and get a ‘true’ prediction.
Thus, a model which needs a corrective of the form “when you’re predicting $520, you need to predict $20 higher” can get that corrective without adding unnecessary complexity.
There will be cases where this complexity-reducing approach adds more complexity than it reduces. There will be other cases where it doesn't. Be judicious.
Is that . . . all?
Yes.
- ^
If skipping dessert in one restaurant decreases typical meal price from $200 to $170, skipping dessert in a restaurant where a meal with dessert costs $20 is more likely to reduce the price to $17 than to -$10.
- ^
I here use "correlates with" to mean "shares information with", not "has a positive Pearson's correlation coefficient with".
I have no idea what you are talking about. Consider adding better examples, references, using more standard terminology.
To summarize my understanding of the post:
Interesting proposal; my comments:
Lets call the first function (model) f and the second function (model) g.
1. Your approach of first optimizing f and then optimizing g, and then taking g ∘ f as your final model has the obvious alternative of directly optimizing g ∘ f with all parameters of each function optimized together. ("Obvious" only in the sense of after already knowing about your approach.) Do you expect the two-stage process to do enough less overfitting to make up for the loss of accuracy for not optimizing everything together?
2. Overall this seems like complexity laundering to me, your final model is g ∘ f even with the two stage process and is it really simpler than what you would otherwise have used for f?
3. If there are multiple input variables I'm not sure I would conceptualize this as correcting the linkage, since it's correcting the overall output and not specifically the relationship with any one input variable?
Thanks for putting in the time to make sense of my cryptic and didactic ranting.
Segmented linear regression usually does the trick. There's only one input, and I've never seen discontinuities be necessary when applying this method, so only a few segments (<10) are needed.
I didn't specify this because almost any regression algorithm would work and be interpretable, so readers can do whatever is most convenient to them.
What I actually do is optimize f until returns diminish, then optimize f and g together. I suggested "f then g" instead of "f then f&g" because it achieves most of the same benefit and I thought most readers would find it easier to apply.
(I don't optimize f&g together from the outset because doing things that way ends up giving g an unindicatively large impact on predictions.)
Sometimes. Sometimes it isn't. It depends how wrong the linkage is.
I would. When the linkage is wrong - like when you use an additive model on a multiplicative problem - models either systematically mis-estimate their extreme predictions or add unnecessary complexity in the form of interactions between features.
I often work in a regression modelling context where model interpretability is at a premium, and where the optimal linkage is almost but not quite multiplicative: that is, if you fit a simple multiplicative model, you'll be mostly right but your higher predictions will be systematically too low.
The conventional way to correct for this is to add lots of complex interactions between features: "when X12 and X34 and X55 all take their Y-maximizing values, increase Y a bit more than you would otherwise have done", repeated for various combinations of Xes. This 'works' but makes the model much less interpretable, and requires more data to do correctly.