In the literature I have often seen conflicting definitions for terms like “transfer learning” and “meta learning”. E.g. here’s a stackexchange question about this.

Here’s the general framework encapsulating this idea.

Suppose our universe of interest is given by some joint distribution . This joint distribution is itself not known to us (we want to learn it, after all) but instead we have some distribution  on the space of possible models. Specific tasks of interest to us like  etc. are derived as marginals of the joint distribution, and therefore are “random variables” derived from  which may be correlated etc.

(Note how this is different from the usual “learning from data” setting in which we have a “prior”  which we condition on observed data. In this usual setting,  is a random vector including random variables for observations  and we can simply calculate the conditionals .)

One way to think about this is to say there’s some other (not included in ) random variable  such that  depends on , i.e.  is a thing, and is what you know. Then when you “learn” the marginal , you are also getting information on , which gives you information on other distributions like  or , which are also marginalized on .

Meta-learning / generalization

Meta-learning in a general sense refers to learning this meta-distribution  from data. What this means is we now see  as a distribution that can be sampled from, not just a belief distribution — i.e. we see  as a random variable that can be sampled.

In the classic meta-learning example of learning random sinusoids (see e.g.): the  to be learned assigns high probabilities only to sinusoidal s — equivalently  is a random variable representing the generative model, and  is known to be straightforward functional application so .

Example: General transfer learning.

Here, learning  helps you learn .

Note that this doesn’t mean  and  are the same or similar in any way (this can be taken as the special case where  is simply exactly a distribution  and  is (i.e.  is such that:) is just composition ) — as is the case with most standard transfer learning applications such as fine-tuning and style transfer.

Style transfer looks like —  and  represent two different “styles” (e.g. “formal voice” and “pirate voice”) and  represents exactly some common distribution  so that the functional dependence of  and  on their inputs  is completely determined by it.

In fine-tuning, the correlation between  and  is governed by a mediating variable — a latent.

Example: Semi-supervised learning.

Semi-supervised learning is possible when there  and  are correlated as random variables when sampled from .

Equivalently, if there is a random variable  such that  and  both depend on it, i.e.  and  (where  indicates mutual information and  indicates conditional mutual information).

I’ve written about this with more exposition here, but this is only possible if  and  have a common cause  or if  causes  (in which case you could simply let  be ). Semi-supervised learning is not possible when the only causal relationship between  and  is  — this is known as the principle of independent causal mechanisms.

New Comment
2 comments, sorted by Click to highlight new comments since:

I have no idea what to make of the random stray downvotes

I agree, some explanation would be welcome.

I didn't vote either way, because I do not understand the article, but I am also not confident enough to blame it on you, so I abstain from voting.

I suspect the answer is something like: the explanation is not very clear, or possibly wrong.