Variations On Tree Reconstruction

adamShimi

Methodology unearths and explores regularities, structural properties of the world that enable our methods to (sometimes) snatch victory from the jaws of ever present computational intractability.

The key intuition is that if different fields share the same regularity, then we can apply the methods from one to the others.

Yet in practice, domain-specific regularities play a significant role too, and they might alter some of the methods’ portability.

Take the Genealogical Regularity as a case study: it says that there are entities related genealogically, meaning that each entity comes from a previous entity, with some changes (for example mutations in evolution).

Wherever we find the Genealogical Regularity, we can apply some form of Comparative Method: by carefully comparing the similarities and the differences between different entities, we can infer their genealogical relations, and maybe even reconstruct a lost parent entity from which they all spring.

I expect most readers will pattern match to phylogenetics, and its famous evolution trees where the entities are whole species.

But in the spirit of comparing exploitations of the same regularity across fields, I want to introduce two other success stories of the Comparative Method: historical linguistics (which gave us the name Comparative Method) and textual criticism.

I’ve already discussed historical linguistics in a previous post: the reconstruction of past languages, and of family relationships of language, from comparing existing languages and traces of dead languages. But I want to slow down and spend a few paragraphs on textual criticism, if only to pay my respects to the lives and the ingenuity poured to save and salvage old lore from the abyss of history.

When you read a nicely printed edition of an old text, say Meditations by Marcus Aurelius, where does the text come from? In this case the work is almost 2000 years, likely written on wax tablets. Because of the perishability of such material, it is most likely disintegrated.^[1] And even if it was not, the original manuscript would most likely be lost in some invasion or sacking.

And yet we can read Marcus’ words. Because they were copied by hand, over and over, across the centuries (until we reach the printing press). But as anyone who has had to copy even a small text by hand can report, it is very easy to make copy mistakes. And given the often brutal speed at which copies had to be made, and the sheer numbers of copies, we know for sure that almost all surviving manuscript will differ — as they indeed do.

The problem which textual criticism must solve is thus: what is the “right” version of the text? Is it any specific manuscript, or something that must be reconstructed by combining different manuscripts?

This is where the Genealogical Regularity and the Comparative Method come into play: textual criticism infers the stemma, or the genealogical tree of manuscripts, and uses it to reconstruct as much as possible the archetype, the parent manuscript from which all the extant manuscript come from.^[2]

Looping back to the initial topic of the post, phylogenetics, historical linguistics, and textual criticism all exploit the Genealogical Regularity through the Comparative Method. Yet the implementation of the method is not the same for all of these fields, and the differences are interesting.

For example, whereas phylogenetics works well with powerful statistical methods (computational phylogenetics), textual criticism and to an extent historical linguistics have not gotten the same results. Why is that so?

The reason is that these three fields differ in regularities about how common alternative mechanisms for change are, including:

(Convergent evolution) The independent evolution of similar properties in the entities.
- The memey example is carcinization, the convergent evolution of crab-like features in crustaceans
(Borrowing) Change coming from horizontal interactions with entities, distinct from inheritance
- The classic example is borrowing of words in historical linguistics from interactions with other entities, which then affect which methods are available for implementing the Comparative Method.

In both cases, the change is not coming from either mutation or inheritance, and thus is merely noise in the reconstruction of the genealogical relations.

Phylogenetics is the most regular here, because by and large most species cannot borrow DNA and features from others^[3], and convergent evolution is usually quite rare, much rarer than shared ancestry. So almost all similarities and differences in phylogenetics will be due to genealogical processes, allowing automated statistical inference: the aggregation of a lot of relevant partial information points to the right answer, or at least moves the needle toward it.

But in textual criticism for example, this is completely reversed: by default copyists make many convergent errors (typos, losing a line or an expression…), and so most of the variations provide only noise for textual criticism.

Since then, the distinction between variants (which are very numerous, polygenetic, and irrelevant for genealogy) and significant errors (which, as a rule, are, may derive from earlier copies and are thus useful for the construction of the stemma) is made in just about all manuals […]

[…]

However, singling out the innumerable non-significant variants (what van Mulken calls “noise”) and deciding to disregard them in manuscript classification are two indispensable steps for the construction of a reliable stemma.

(Paolo Trovato, Everything You Always Wanted To Know About Lachmann’s Method, p.110,115)

I don’t have a good grasp of where historical linguistics stands here, but given the susceptibility of genealogical relations to the choice of corresponding words, I would expect it is closer to textual criticism than phylogenetics.

To summarize, despite sharing a high-level regularity, statistical methods do not transfer well from phylogenetics to textual criticism because of a missing lower-level regularity: that vertically inherited changes are the vast majority of changes.

Another difference in applicability of methods is what can be done with the reconstructed tree.

Here the best case scenario is actually textual criticism: what we are approximating is an actual manuscript which existed. It might not be the original, but it was a real artifact, probably closer to the original than anything we have access to.

Whereas in historical linguistics, reconstructed proto-languages are a sort of compressed abstraction of a language:

The most helpful metaphor to explain this is the ‘constellation’ analogy. Constellations of stars in the night sky, such as The Plough or Orion, make sense to the observer as points on a sphere of a fixed radius around the earth. We see the constellations as two-dimensional, dot-to-dot pictures, on a curved plane. But in fact, the stars are not all equidistant from the earth: some lie much further away than others. Constellations are an illusion and have no existence in reality. In the same way, the asterisk-heavy ‘star-spangled grammar’ of reconstructed PIE may unite reconstructions which go back to different stages of the language. Some reconstructed forms may be much older than others, and the reconstruction of a datable lexical item for PIE does not mean that the spoken IE parent language must be as old (or as young) as the lexical form.

(James Clackson, Indo-European Linguistics: An Introduction, p.16)

What about phylogenetics? Well, here the attempts at reconstructing shared ancestors strike me as far more limited. And there is no analogous approach to placing dead languages or known-but-lost manuscripts as nodes in the trees with phylogenetics, because fossils are treated as merely additional leaves of the tree that stopped evolving when they died.

Although a tree implies the existence of certain ancestors, and even implies that those ancestors had certain combinations of traits […], tree thinking is primarly concerned with understanding the evolutionary connections among tips. While ancestors must have existed, we never need to directly interact with ancestors to reconstruct or utilize trees.

[…]

That is to say fossils are best viewed as tips of the tree that have a shorter branch (in units of time) connecting them to the (inferred) ancestors. They are treated as living forms that have undergone no evolution in the millions of years since they were entombed in rock.

(David A. Baum and Stacey D. Smith, Tree Thinking, p.41)

Why this difference? My guess is a mix of two sub-regularities:

(Age) Manuscripts and languages are much younger than species, and so the traces are freshest, and it is easier to place the reconstructed artefacts by comparing them to existing traces
(Constraints) Manuscripts, even when they vary, are actually very close to each other, leaving far fewer things to decide on/infer than a full language, and god forbid a full species.

There are many other such examples of variations in methods explained by subtler and more implementation-level disparities in regularities.

To mention just one more, textual criticism can makes far more precise inferences about how a change happened and why, because we have a much better understanding of the process of copying, and of the changes in dialect and education and all that comes into the mistake-making process.

For example, a notion of difficulty of the variants can be used (with care) to infer which variant is more likely to be original:

If […] in a variation place one reading is more difficult and the others easier, it is more likely that the lectio difficilior is the original one, and any attempts to make the passage more easy understandable, whether intentional (glosses) or not (unconscious banalizations), are secondary readings.
This theory needs to be clarified with a couple of examples. The difficulty can be of various kinds, e.g., lexical, syntactical, or conceptual. As late as the fifteenth century, some Italian writers still refer to lovers by the learned compound word filocapti, from the Greek phìlos [friend] and the Latin capere [to capture]. The word was frequently used in medieval Latin […]. If in the course of the transmission of the work a copyist were to introduce the phrase preso d’amore [captured by love] or such, usually this be assumed to be a typical lexical banalization, and the Greek-Latin compound would be the lectio difficilior.

(Paolo Trovato, Everything You Always Wanted To Know About Lachmann’s Method, p.118)

There are things of this kind in historical linguistics, though less subtle and complex, such as analogy. But nothing that I know of in phylogenetics.

In a way, all this variation and these subregularities impacting methods so much is a blow to my apology of methodology. For if we need to map so many subtleties to know whether methods transfer, is there any hope for practical applications of methodology?

I don’t know. But from an intellectual curiosity perspective, I feel the same way as when, during my PhD, one of my clever hypotheses was proven wrong by a more intricate phenomenon: excited, for the world proved more interesting than I had guessed.

^{^}
In some rare cases, even when the wax is gone, the writing can be retrieved. (Incidentally, this kind of stuff is one reason I really want to dig into and write about the black magic of epigraphy one day).
^{^}
This parent manuscript can be, and is likely, different from the actual original. But it is the best that can be reconstructed given the available manuscripts.
^{^}
Much more common in microorganisms.

Just throwing this out there as an idle thought when reading this post* - I wonder if the reason linguistic phylogenetics and biological phylogenetics differ is that the phylogeny and the biological traits of the underlying species are often correlated with each other. I don't know if that's true for linguistics, and indeed I'm not sure what traits you are capable of identifying in a text or language that would affect the phylogenetic reconstruction itself.

I wonder if the reason linguistic phylogenetics and biological phylogenetics differ is that the phylogeny and the biological traits of the underlying species are often correlated with each other

I'm not sure I understand exactly what you mean?

My first guess was that you meant that different biological traits could be correlated with each others; that's definitely something you can also see in textual criticism and historical linguistics.

But then you seem to talk about correlation between biological traits and phylogeny, which sounds different, but I can't generate a meaning which is not trivial (correlation because traits cause phylogeny in a sense)

Ah I see I have been a little sloppy with my language - mea culpa

The extent to which traits and phylogenies are correlated is an open research question, see this wikipedia article. But aspects of biology that are unique to phylogenies such as diversification interact with traits in complex ways. The SSE models are a good introduction to the methodology of this (background here).

I don't know how to attach these ideas to linguistics because I can't think of a good concrete example.

As an aside, you also say that

For example, a notion of difficulty of the variants can be used (with care) to infer which variant is more likely to be original

There are things of this kind in historical linguistics, though less subtle and complex, such as analogy. But nothing that I know of in phylogenetics.

But this is very well understood in phylogenetics. This is the basis of codon models and the "maximum likelihood school" of phylogenetic modelling. You can see this by looking at IQ-TREE (a modern phylogenetic inference tool): https://iqtree.github.io/doc/Substitution-Models

But this is very well understood in phylogenetics. This is the basis of codon models and the "maximum likelihood school" of phylogenetic modelling. You can see this by looking at IQ-TREE (a modern phylogenetic inference tool): https://iqtree.github.io/doc/Substitution-Models

Good point! I guess I was mostly thinking of changes at the trait level, but you are right that now even for species the gene level is used, and that there are much more precise models of mutations at the gene level.

I haven't explored this in enough detail to figure out if there is anything as subtle as the diffraction process of textual criticism, where the hard reading is not only replaced by simpler ones but lost completely, yet can sometimes be inferred back from the meter of poem and the rhymes and other constraints.

The extent to which traits and phylogenies are correlated is an open research question, see this wikipedia article. But aspects of biology that are unique to phylogenies such as diversification interact with traits in complex ways. The SSE models are a good introduction to the methodology of this (background here). I don't know how to attach these ideas to linguistics because I can't think of a good concrete example.

Ah, I understand now.

Then I would say the similar claim in historical linguistics would be that languages which are more closely related genealogically would be more mutually intelligible.

For example, as a french speaker I can partially understand Italian much more than I can understand German (speaking neither).

The claim is weaker/broken in cases where one of the two languages has extensive borrowing from another unrelated language (for Romance, Spanish is an example because of its many borrowings from Arabic)

As for textual criticism, the analogy would be that the closest manuscripts are related, the more they "feel the same". But honestly this would be much more subtle, given that different manuscripts are much closer than different species or languages.

I wonder if the reason linguistic phylogenetics and biological phylogenetics differ is that the phylogeny and the biological traits of the underlying species are often correlated with each other

I'm not sure I understand exactly what you mean?

My first guess was that you meant that different biological traits could be correlated with each others; that's definitely something you can also see in textual criticism and historical linguistics.

Ah I see I have been a little sloppy with my language - mea culpa

I don't know how to attach these ideas to linguistics because I can't think of a good concrete example.

As an aside, you also say that

For example, a notion of difficulty of the variants can be used (with care) to infer which variant is more likely to be original

There are things of this kind in historical linguistics, though less subtle and complex, such as analogy. But nothing that I know of in phylogenetics.

But this is very well understood in phylogenetics. This is the basis of codon models and the "maximum likelihood school" of phylogenetic modelling. You can see this by looking at IQ-TREE (a modern phylogenetic inference tool): https://iqtree.github.io/doc/Substitution-Models

The extent to which traits and phylogenies are correlated is an open research question, see this wikipedia article. But aspects of biology that are unique to phylogenies such as diversification interact with traits in complex ways. The SSE models are a good introduction to the methodology of this (background here). I don't know how to attach these ideas to linguistics because I can't think of a good concrete example.

Ah, I understand now.

Then I would say the similar claim in historical linguistics would be that languages which are more closely related genealogically would be more mutually intelligible.

For example, as a french speaker I can partially understand Italian much more than I can understand German (speaking neither).