There is no tree of evolution

Published on Sun 05 August 2012. Filed under . Tags .

We often see diagrams called ‘trees’ showing how different beings or things evolve from others. These are used to describe families, species, languages, programs and other entities. Most of them share two problems: they aren’t trees as in graph theory (while we reason about them as trees) and they don’t have discrete, unchanging nodes.

Directed acyclic graphs of evolution

Let’s assume for a while that there are immutable objects with derivative objects being created in atomic ways. These objects are nodes (or vertices), there are edges connecting them from the base one to the derived one.

The ‘tree’ will have nodes with multiple outward edges. However, in many real cases it will also have multiple inward edges. It isn’t a tree, the graph theoretic name for it is a directed acyclic graph, or a DAG. (We won’t get cycles by just adding immutable nodes having immutable lists of predecessor nodes, which seems all what we can do with unidirectional time and really atomic objects.)

Probably the most famous DAGs called ‘trees’ are ‘family trees’. The larger and more complex ones aren’t trees, there is a newer example in The Art of Computer Programming by Donald E. Knuth, Volume 1 Third Edition, Figure 18(a) on page 310 (the family DAGs of Eldar or Edain in the books of J.R.R. Tolkien are more interesting, although improbable without elven lifespans, cases). Some have ancestors aligned on layers depending on the distance from a descendant, these might have multiple nodes on different layers for the same person. It’s an obvious consequence of every person having approximately two parents and there not being exponentially more people several generations ago. As explained in the TAOCP (page 311), these graphs are trees if a node represents ‘a person in the role of mother or father of so-and-so’ (I haven’t seen this definition anywhere else).

There are more software-related examples of graphs not being trees (leaving other natural examples for the next section, since they don’t have obviously atomic nodes). Dependencies graphs of classes, modules and software packages (being specific sets of code, so they are atomic as considered in this article) clearly aren’t trees in general (and common) case. (Dependencies inside a package are often cyclic and this isn’t usually a problem. It is more difficult in case of dependencies between packages, compilers are a well-known case of this.) I believe the assumption that software packages dependency graphs are nearly always trees, optimizing the design of software using such graphs for this case and adding workarounds for a package occurring in multiple parts of the ‘tree’ leads to a much more complex solution than designing for a DAG.

A significantly different tree-vs-DAG understanding issue is that software doesn’t evolve in trees. A common free software ideology point of view is that the user gets a software package, modifies it using only their own ideas and publishes it. This isn’t true, a common and important case is deriving a new package (or technically a version of the package, both are equally nodes here) from multiple other packages. Nearly any use of shared libraries from multiple projects is a case of this. This leads to different software freedoms than the tree view (the Free Software Definition mentions this way of modifying programs, although accepts licenses restricting what licenses the original programs can be under; I see no solution for this better than using only GPL-compatible licenses for all cultural works).

(There is also an interesting, although unrelated to trees, issue of maps being nonplanar graphs. I don’t remember seeing an example of a map having its chromatic number greater than four, although there are many cases requiring four colours. Many countries in these cases are non-contiguous, so a graph of a map containing them doesn’t have to be planar.)

Are there atomic nodes?

In the above examples it was clear what objects were nodes. Each edge referred to complete atomic nodes (e.g. there are parents of a human, not of their specific organs, I don’t know any other cases than aren’t only medically significant unlike the probable uses of family ‘trees’). It isn’t in all common uses of ‘trees’. If the node refers to a mutable union of immutable objects (or a mutable object), the graph can easily have cycles, multiple edges between the same nodes (it even is’t usually called a graph), or just be too unclear to be useful.

A problem of such nodes is that we often don’t know clearly when we have the same node.

Probably the most well-known example is a phylogenetic tree showing a common origin of species (this isn’t the only evolution the title of this article refers to). As described in that article, the ‘ideal’ approximated by such trees isn’t a tree, due to interesting issues like horizontal gene transfer which clearly makes it not a tree. Another issue is that there is no clear separation between species.

The same problem occurs in the evolution of languages. There is no clear separation between languages (except for the political ones). If, in parallel to the definition of a species as a set of animals being able to have common descendants, we define a language as a set of mutually-intelligible sentences, there is no point of separating languages (for each pair of communicators there is at most one language, there is no observable way of finding which of these languages are ‘the same’; this might be a reason for other definitions being widely used).

There is a similar case for software evolution. There is a great graph of GNU/Linux distribution derivation. Since the distributions change, it has many vertical lines when a distribution e.g. changes its base distribution. Somehow this case avoids having the nodes as unclear as species or languages.

Let’s not call the next graph a ‘tree’, unless the objects being modeled are clearly separate, atomic and form trees, not more complex graphs.