Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Sound Colless-like balance indices for multifurcating trees

  • Arnau Mir ,

    Contributed equally to this work with: Arnau Mir, Lucía Rotger, Francesc Rosselló

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Writing – review & editing

    Affiliations Dept. of Mathematics and Computer Science, University of the Balearic Islands, E-07122 Palma, Spain, Balearic Islands Health Research Institute (IdISBa), E-07010 Palma, Spain

  • Lucía Rotger ,

    Contributed equally to this work with: Arnau Mir, Lucía Rotger, Francesc Rosselló

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Writing – review & editing

    Affiliation Dept. of Mathematics and Computing, University of La Rioja, E-26004 Logroño, Spain

  • Francesc Rosselló

    Contributed equally to this work with: Arnau Mir, Lucía Rotger, Francesc Rosselló

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Writing – original draft, Writing – review & editing

    cesc.rossello@uib.es

    Affiliations Dept. of Mathematics and Computer Science, University of the Balearic Islands, E-07122 Palma, Spain, Balearic Islands Health Research Institute (IdISBa), E-07010 Palma, Spain

Abstract

The Colless index is one of the most popular and natural balance indices for bifurcating phylogenetic trees, but it makes no sense for multifurcating trees. In this paper we propose a family of Colless-like balance indices that generalize the Colless index to multifurcating phylogenetic trees. Each is determined by the choice of a dissimilarity D and a weight function . A balance index is sound when the most balanced phylogenetic trees according to it are exactly the fully symmetric ones. Unfortunately, not every Colless-like balance index is sound in this sense. We prove then that taking f(n) = ln(n + e) or f(n) = en as weight functions, the resulting index is sound for every dissimilarity D. Next, for each one of these two functions f and for three popular dissimilarities D (the variance, the standard deviation, and the mean deviation from the median), we find the most unbalanced phylogenetic trees according to with any given number n of leaves. The results show that the growth pace of the function f influences the notion of “balance” measured by the indices it defines. Finally, we introduce our R package “CollessLike,” which, among other functionalities, allows the computation of Colless-like indices of trees and their comparison to their distribution under Chen-Ford-Winkel’s α-γ-model for multifurcating phylogenetic trees. As an application, we show that the trees in TreeBASE do not seem to follow either the uniform model for multifurcating trees or the α-γ-model, for any values of α and γ.

Introduction

Since the early 1970s, the shapes of phylogenetic trees have been used to test hypothesis about the evolutive forces underlying their assembly [1]. The most used topological feature of phylogenetic trees in this regard is their symmetry, which captures the symmetry of the evolutionary histories described by them. The symmetry of a tree is usually measured through its balance (see [2], pp. 559–560), the tendency of the children of any given node to have the same number of descendant leaves. Several balance indices have been proposed so far to quantify the balance of a phylogenetic tree. The two most popular ones are the Colless index [3], whose definition we recall below and that only works for bifurcating trees, and the Sackin index [46], which is defined as the sum of the depths of the leaves in the tree and can be used on multifurcating trees. Other balance indices for bifurcating trees introduced so far include the variance of the depths of the leaves [4, 5], the sum of the reciprocals of the orders of the rooted subtrees [6], and the number of cherries [7]. As for balance indices for multifurcating trees, two recent additions are the total cophenetic index [8] and the quartet index [9]; for more proposals, see the section “Measures of overall asymmetry” in Felsenstein’s book [2] (pp. 562–563). This abundance of balance indices is partly motivated by Shao and Sokal’s advice on using more than one such index to quantify tree balance: see [6], p. 1990.

The Colless index C(T) of a bifurcating phylogenetic tree T is defined as the sum of the balance values of its internal nodes, where by the balance value of an internal node we mean the absolute value of the difference between the number of descendant leaves of its pair of children. In this way, the Colless index of a bifurcating tree measures the average balance value of its internal nodes, and therefore it quantifies in a very intuitive way its balance. In particular, C(T) = 0 if, and only if, T is a fully symmetric bifurcating tree with 2m leaves, for some m.

Unfortunately, the Colless index can only be used as defined on bifurcating trees. A natural generalization to multifurcating trees would be to define the balance value of a node as some measure of the dissimilarity of the numbers of descendant leaves of its children, like for instance their standard deviation, and then to add up all these balance values. But this definition has a drawback: this sum can be 0 on non-symmetric multifurcating trees, and hence the resulting index need not capture the symmetry of a tree in a sound way. For an example of this misbehavior, consider the tree depicted in Fig 1: each one of its nodes has all its children with the same number of descendant leaves and therefore the balance value of each node is 0 independently on the dissimilarity used to define it, but the tree is not symmetric. Replacing the number of descendant leaves by the number of descendant nodes, which in a bifurcating tree is simply twice the number of descendant leaves minus 1, does not overcome this drawback: again, all children of each node in the tree depicted in Fig 1 have the same number of descendant nodes.

thumbnail
Fig 1. Each node in this asymmetric tree has all its children with the same number of descendant leaves as well as with the same number of descendant nodes.

https://doi.org/10.1371/journal.pone.0203401.g001

In this paper we solve this problem by taking a suitable function and then replacing in this schema the number of descendant leaves or the number of descendant nodes of a node by the f-size of the subtree rooted at the node, defined as the sum of the images under f of the out-degrees of the nodes in the subtree. Then, we define the balance value (relative to such a function f and a dissimilarity D) of an internal node in a phylogenetic tree as the value of D applied to the f-sizes of the subtrees rooted at the children of the node. Finally, we define the Colless-like index of a phylogenetic tree as the sum of the balance values relative to f and D of its internal nodes.

The advantage of such a general definition is that there exist functions f such that, for every dissimilarity D, the resulting index satisfies that if, and only if, T is fully symmetric, in the sense that, for every internal node v, the subtrees rooted at the children of v have all the same shape. Two such functions turn out to be f(n) = ln(n + e) and f(n) = en.

The different growth pace of these two functions make them quantify the trees’ balance in different ways. We show it by finding the trees that are maximally unbalanced according to , that is, the trees with largest value, when f is one of these two functions and D is the variance, the standard deviation, and the mean deviation from the median. We show that the choice of the dissimilarity D does not cause any major difference in the maximally unbalanced trees relative to for a fixed f, but that changing the function f implies completely different maximally unbalanced trees.

We have written an R package called CollessLike, available at the CRAN, that, among other functionalities, computes Colless-like indices and simulates their distribution under the α-γ-model for multifurcating trees [10]. We have used the functions in this package to perform two experiments on the TreeBASE phylogenetic database [11]. First, we have compared the behavior of the Colless-like index obtained by taking f(n) = ln(n + e) and as dissimilarity D the mean deviation from the median, MDM, with two other balance indices for multifurcating trees: the Sackin index and the total cophenetic index. Next, we have used this Colless-like index to contrast the goodness of fit of the trees in TreeBASE to the uniform distribution for multifurcating trees and to the α-γ-model.

Materials

Notations and conventions

Throughout this paper, by a tree we always mean a rooted, finite tree without out-degree 1 nodes. As usual, we consider such a tree to be a directed graph, with its arcs pointing away from the root. Given a tree T, we shall denote its sets of nodes, of internal (that is, non-leaf) nodes, and of arcs by V(T), Vint(T), and E(T), respectively, and the out-degree of a node vV(T) by deg(v). A tree T is bifurcating when deg(v) = 2 for every vVint(T). Whenever we want to emphasize the fact that a tree need not be bifurcating, we shall call it multifurcating. The depth of a node in a tree T is the length (i.e., the number of arcs) of the directed path from the root to it, and the depth of T is the largest depth of any of its leaves. We shall always make the abuse of language of saying that two isomorphic trees are equal, and hence we shall always identify any tree with its isomorphism class. We shall denote by the set of (isomorphism classes of) trees with n leaves, and by the union .

A phylogenetic tree on a (non-empty, finite) set X of labels is a tree with its leaves bijectively labelled in the set X. We shall always identify every leaf in a phylogenetic tree T on X with its label, and in particular we shall denote its set of leaves by X. Two phylogenetic trees T1, T2 on X are isomorphic when there exists an isomorphism of directed graphs between them that preserves the labelling of the leaves. We shall also make always the abuse of language of considering two isomorphic phylogenetic trees as equal. Given a set of labels X, we shall denote by the set of (isomorphism classes of) phylogenetic trees on X, and we shall denote by , for every n ≥ 1, the set . Notice that if |X| = n, then any bijection X ↔ {1, 2, …, n} induces a bijection . Moreover, if |X| = n, there is a forgetful mapping that sends every phylogenetic tree to the corresponding unlabeled tree, which we shall call its shape.

No closed formula is known for the numbers or . Felsenstein gives in Chapter 3 in [2] an easy recurrence to compute and describes how to obtain such a recurrence for ; an explicit algorithm to compute the latter is provided in [12]. These numbers and form sequences A000311 and A000669, respectively, in Sloane’s On-Line Encyclopedia of Integer Sequences [13], where more information about them can be found.

A comb is a bifurcating phylogenetic tree with all its internal nodes having a leaf child: see Fig 2. We shall generically denote every comb in , as well as their shape in , by Kn. A star is a phylogenetic tree of depth 1: see Fig 3. For consistency with later notations, we shall denote the star in , and its shape in , by FSn.

Let T1, …, Tk be phylogenetic trees on pairwise disjoint sets of labels X1, …, Xk, respectively. The phylogenetic tree T1⋆ ⋯ ⋆Tk on X1 ∪ ⋯ ∪ Xk is obtained by adding to the disjoint union of T1, …, Tk a new node r and new arcs from r to the root of each Ti. In this way, the trees T1, …, Tk become the subtrees of T1⋆ ⋯ ⋆Tk rooted at the children of its root r; cf. Fig 4. A similar construction produces a tree T1⋆ ⋯ ⋆Tk from a set of (unlabeled) trees T1, …, Tk.

Given a node v in a tree T, we shall denote by Tv the subtree of T rooted at v and by κv its number of descendant leaves, that is, the number of leaves of Tv. An internal node v of a tree T is symmetric when, if v1, …, vk are its children, the trees are isomorphic. A tree T is fully symmetric when all its internal nodes are symmetric, and a phylogenetic tree is fully symmetric when its shape is so.

Given a number n of leaves, there may exist several fully symmetric trees with n leaves. For instance, there are three fully symmetric trees with 6 leaves, depicted in Fig 5. In fact, every fully symmetric tree with n leaves is characterized by an ordered factorization n1nk of n, with n1, …, nk ≥ 2. More specifically, for every k ≥ 1 and with n1, …, nk ≥ 2, let be the tree defined, up to isomorphism, recursively as follows:

  • is the star with n1 leaves.
  • If k ≥ 2, is a tree whose root has n1 children, and the subtrees at each one of these children are (isomorphic to) .
thumbnail
Fig 5. Three fully symmetric trees with 6 leaves: From left to right, FS6, FS2,3 and FS3,2.

https://doi.org/10.1371/journal.pone.0203401.g005

Every is fully symmetric, and every fully symmetric tree is isomorphic to some . Therefore, for every n, the number of fully symmetric trees with n leaves is equal to the number H(n) of ordered factorizations of n (sequence A074206 in Sloane’s On-Line Encyclopedia of Integer Sequences [13]).

The Colless index

The Colless index C(T) of a bifurcating tree T with n leaves is defined as follows [3]: if, for every vVint(T), we denote by v1 and v2 its two children and by and their respective numbers of descendant leaves, then

The Colless index of a phylogenetic tree is simply defined as the Colless index of its shape.

It is well-known that the maximum Colless index on the set of bifurcating trees with n leaves is reached at the comb Kn, and it is (see, for instance, [14]). In fact, this maximum is only reached at the comb. Since we have not been able to find an explicit reference for this last result in the literature and we shall make use of it later, we provide a proof here.

Lemma 1. For every bifurcating tree T with n leaves, if TKn, then C(T) < C(Kn).

Proof. Let T a bifurcating tree with n leaves different from the comb Kn. Let x be an internal node of smallest depth in it without any leaf child, and let T1T2 and T3T4 be the subtrees rooted at its children (see Fig 6); for every i = 1, 2, 3, 4, let ti be the number of leaves of Ti. Assume, without any loss of generality, that t1t2 and t1 + t2t3 + t4. Let then T′ be the tree obtained by pruning T2 from T and regrafting it to the other arc starting in x (see again Fig 6).

Then C(T′) > C(T). Indeed, the only nodes whose children change their numbers of descendant leaves from T to T′ are (cf. Fig 6): the node x; the parent y of the roots of T1 and T2 in T, which is removed in T′; and the parent z of the root of T2 in T′, which does not exist in T. Therefore,

So, this procedure takes a bifurcating tree with n leaves TKn and produces a new bifurcating tree T′ with the same number n of leaves and strictly larger Colless index. Since the number of bifurcating trees with n leaves is finite, the Colless index cannot increase indefinitely, which means that if we iterate this procedure, we must eventually stop at a comb Kn. And since the Colless index strictly increases at each iteration, we conclude that if TKn, then C(T) < C(Kn).

Methods

Colless-like indices

Let be a function that sends each natural number to a positive real number. The f-size of a tree is defined as If , for some set of labels X, then δf(T) is defined as δf(πX(T)).

Therefore, δf(T) is the sum of the degrees of all nodes in T, with these degrees weighted by means of the function f. Examples of f-sizes include:

  • The number of leaves, κ, which is obtained by taking f(0) = 1 and f(n) = 0 if n > 0.
  • The order (the number of nodes), τ, which corresponds to f(n) = 1 for every .
  • The usual size (the number of arcs), θ, which corresponds to f(n) = n for every .

Notice that δf satisfies the following recursion:

Table A in S2 File gives the abstract values of δf(T) for every with n = 2, 3, 4, 5.

Example 2. If T is a bifurcating tree with n leaves, and hence with n − 1 internal nodes, all of them of out-degree 2, then

Example 3. For every fully symmetric tree ,

Now let be the set of all non-empty finite-length sequences of real numbers. A dissimilarity on is any mapping satisfying the following conditions: for every ,

  • D(x1, …, xk) = D(xσ(1), …, xσ(k)), for every permutation σ of {1, …, k};
  • D(x1, …, xk) = 0 if, and only if, x1 = ⋯ = xk.

The dissimilarities that we shall explicitly use in this paper are the mean deviation from the median, the (sample) variance, and the (sample) standard deviation,

Let D be a dissimilarity on , a function, and δf the corresponding f-size, and let . For every internal node v in T, with children v1, …, vk, the (D, f)-balance value of v is

So, balD,f(v) measures, through D, the spread of the f-sizes of the subtrees rooted at the children of v. In particular, balD, f(v) = 0 if, and only if, .

Definition 4. Let D be a dissimilarity on and a function. For every , its Colless-like index relative to D and f, , is the sum of the (D, f)-balance values of the internal nodes of T:

If , for some set of labels X, then is defined as .

Example 5. If we take D = MDM and f the constant mapping 1, so that δf = τ, the usual order of a tree, then where, for every vVint(T), v1, …, vdeg(v) denote its children and their numbers of descendant nodes.

Notice that gets larger as the f-sizes of the subtrees rooted at siblings get more different, and therefore it behaves as a balance index for trees, in the same way as, for instance, the Colless index for bifurcating trees: the smaller the value of , the more balanced is T relative to the f-size δf.

It is clear that satisfies the following recursion: Therefore these Colless-like indices are recursive tree shape statistics in the sense of [15], relative to the f-size δf. Table A in S2 File also gives the abstract values of , for D = MDM, var, and sd, and for every with n = 2, 3, 4, 5.

The next result shows that, if we take D = MDM or D = sd, then any index restricted to only bifurcating trees defines, up to a constant factor, the usual Colless index.

Proposition 6. Let T be a bifurcating tree with n leaves and any function. Then,

Proof. Notice that, for every , and . We shall prove the statement for MDM; the proof for sd is identical, replacing the 2 in the denominator by . For every internal node v in a bifurcating tree T, if v1 and v2 denote its children, and therefore as we claimed.

If we define the quadratic Colless index of a bifurcating tree T as (where, for every vVint(T), v1, v2 denote its children), then, given that , a similar argument proves the following result.

Proposition 7. Let T be a bifurcating tree with n leaves and any function. Then,

As for the cost of computing Colless-like indices, we have the following result.

Proposition 8. If the cost of computing D(x1, …, xk) is in O(k) and the cost of computing each f(k) is at most in O(k), then, for every , the cost of computing is in O(n).

Proof. Assume that every f(k) is computed in time at most O(k). For every k ≥ 2, let mk be the number of internal nodes in T of out-degree k. Since the sizes δf(v) are additive, in the sense that if v has children v1, …, vk, then , we can compute the whole vector (δf(v))vV(T) in time O(n + ∑k≥2 mkk) = O(n) by traversing the tree in post-order.

Assume now that D(x1, …, xk) can be computed in time O(k). Then, for every internal node v of out-degree k, can be computed in time O(k), by simply reading the k f-sizes of its children (which are already computed) and applying D to them. This shows that the whole vector (balD, f(v))vV(T) can be computed again in time O(∑k≥2 mkk) = O(n). Finally, we compute by adding up the entries of (balD, f(v))vV(T), which still can be done in time O(n).

The dissimilarities mentioned previously in this subsection can be computed in a number of sums and multiplications that is linear in the length of the input vector, and the specific functions f that we shall consider in the next subsection, basically exponentials and logarithms, can be approximated to any desired precision in constant time by using addition and look-up tables [16].

Sound Colless-like indices

It is clear that for every dissimilarity D, for every function and for every fully symmetric tree , because balD, f(v) = 0 for every . We shall say that a Colless-like index is sound when the converse implication is true.

Definition 9. A Colless-like index is sound when, for every , if, and only if, T is fully symmetric.

In other words, is sound when, according to it, the most balanced trees are exactly the fully symmetric trees.

The Colless index C and its quadratic version C(2) are sound for bifurcating trees. Unfortunately, this is not always so for Colless-like indices for multifurcating trees. It is not so even for direct generalizations of C or C(2). For instance, , and , where κ denotes the number of leaves, are not sound; neither are , and , where τ denotes the number of nodes; and they are not sound even when replacing τ by θ, the usual size, which is simply τ − 1. For example, the tree T in Fig 1 is not fully symmetric, but .

The following lemma shows that the soundness of does not depend on D, but only on f.

Lemma 10. is sound if, and only if, δf(T1) ≠ δf(T2) for every pair of different fully symmetric trees T1, T2.

Proof. For the “only if” implication: if there exist two different (i.e., non isomorphic) fully symmetric trees T1, T2 such that δf(T1) = δf(T2), then the tree T = T1T2 is not fully symmetric, but

Conversely, assume that, for every pair of fully symmetric trees T1, T2, if δf(T1) = δf(T2) then T1 = T2. We shall prove by complete induction on n that if T is a tree with n leaves such that , then T is fully symmetric. If T has only one leaf, it is clearly fully symmetric. Now, assume that n > 1 and hence that T has depth at least 1. Let T1, …, Tk, k ≥ 2, be its subtrees rooted at the children of its root, so that T = T1⋆ ⋯ ⋆Tk. Then, implies, on the one hand, that , and hence, by induction, that T1, …, Tk are fully symmetric, and, on the other hand, that D(δf(T1), …, δf(Tk)) = 0, and hence that δf(T1) = ⋯ = δf(Tk), which, by assumption, implies that T1 = ⋯ = Tk: in summary, T is fully symmetric.

The following problem now arises:

Problem. To find functions such that is sound.

Unfortunately, many natural functions f do not define sound Colless-like indices, as the following examples show.

Example 11. If f(n) = an2 + bn + c, for any a, b, c, then is not sound, because, for example, δf(FS2,2,2,7) = δf(FS14,4) = 420a + 70b + 71c.

Example 12. If f(n) = nd, for any d ≥ 0, then is not sound. Indeed, for every d ≥ 3 (the case when d ≤ 2 is a particular case of the last example), take

  • k = 2d + 1 and l = 2;
  • for i = 1, …, k − 1;
  • nk = 2;
  • ;
  • ; notice that this exponent is an integer number, because k is odd and therefore d divides (d − 1)k + 1.

Then and hence, on the one hand, and, on the other hand, Therefore, .

Of course, for any given d there may existsmallercounterexamples: for instance, and .

Example 13. If f(n) = loga(n) (for some a > 1) when n > 0, and f(0) = 0, then is not sound: for instance, δf(FS2,2) = δf(FS8) = loga(8). In a similar way, if f(n) = loga(n + 1) (for some a > 1), then is not sound, either: for instance, δf(FS2,3,3) = δf(FS5,7) = loga(196608).

On the positive side, we shall show now two functions that define sound indices. The following lemmas will be useful to prove it.

Lemma 14. For every k, l ≥ 1 and n1, n2, …, nk, m1, m2, …, ml ≥ 2, if , n1n2nk = m1m2ml, and nk = ml, then .

Proof. If n1nk = m1ml and nk = ml, then n1nk−1 = m1ml−1. If, moreover, , that is, then and hence as claimed.

Lemma 15. If n1, …, nk ≥ 2, then

Proof. By induction on k. If k = 1, the statement says that 1 < n1, which is true by assumption. Assume now that the statement is true for any n1, …, nk ≥ 2, and let nk+1 ≥ 2. Then,

Proposition 16. If f(n) = en, then is sound.

Proof. Assume that there exist two non-isomorphic fully symmetric trees and such that that is, such that Assume that l is the smallest depth of a fully symmetric tree with en-size equal to the en-size of another fully symmetric tree non-isomorphic to it.

Since e is transcendental, the equality above implies the equality of polynomials in If l = 1, the right-hand side polynomial is simply and then the equality of polynomials implies that k = 1 and n1 = m1, which contradicts the assumption that . Now assume that l ≥ 2. This equality of polynomials implies the equality of their independent terms: n1nk = m1ml. On the other hand, the non-zeroth power of x with the largest coefficient in the left-hand side polynomial is (because all coefficients are non-negative, and, by Lemma 15, n1nk−1 alone is larger than the sum n1nk−2 + ⋯ + n1 + 1 of all other coefficients of non-zeroth powers of x) and, by the same reason, the non-zeroth power of x with the largest coefficient in the right-hand side polynomial is . The equality of polynomials implies then that nk = ml and hence, by Lemma 14, that , against the assumption on l. We reach thus a contradiction that implies that there does not exist any pair of non-isomorphic fully symmetric trees with the same en-size. By Lemma 10, this implies that is sound.

The same argument shows that is sound for every exponential function f(n) = rn with base r a transcendental real number. However, if r is not transcendental, then need not be sound. For instance, and .

Proposition 17. If f(n) = ln(n + e), then is sound.

Proof. The argument is similar to that of the previous proof. Let f(n) = ln(n + e) and assume that there exist two non-isomorphic fully symmetric trees and such that , that is, such that Assume that l is the smallest depth of a fully symmetric tree with f-size equal to the f-size of a fully symmetric tree non-isomorphic to it.

Applying the exponential function to both sides of the equality above, we obtain

Since e is transcendental, this implies the equality of polynomials in which, since n1, …, nk, m1, …, ml ≥ 2, on its turn implies the equalities If l = 1, the right-hand side polynomial in the second equality is simply x + m1 and then this equality of polynomials implies that k = 1 and n1 = m1, which contradicts the assumption that . Now assume that l ≥ 2. From the first equality we know that n1nk = m1ml. But, the root of the left-hand side polynomial in the second equality with largest multiplicity is −nk (because, by Lemma 15, n1nk−1 alone is greater than the degree of ) and, similarly, the root of the right-hand side polynomial in the second equality with largest multiplicity is −ml. Therefore, the equality of both polynomials implies that nk = ml and hence, by Lemma 14, , against the assumption on l. As in the previous proof, this contradiction implies that is sound.

The same argument proves that, for every transcendental number r > 1, the function f(n) = logr(n + r) defines sound indices . However, if r is not transcendental, then such a need not be sound. For instance, .

Summarizing, each one of the functions f(n) = ln(n + e) and f(n) = en defines, for every dissimilarity D, a Colless-like index that reaches its minimum value on each , 0, at exactly the fully symmetric trees.

Results

Maximally unbalanced trees

The next results give the maximum values of on when D = MDM, var or sd and f(n) = ln(n+ e) or f(n) = en. These maxima define the range of each on , and, dividing by them, we can define normalized Colless-like indices that can be used to compare the balance of trees with different numbers of leaves.

We begin with the function f(n) = ln(n + e), which is covered by the following theorem.

Theorem 18. Let f be a function such that 0 < f(k) < f(k − 1) + f(2), for every k ≥ 3. Then, for every n ≥ 2, the indices , and reach their maximum values on exactly at the comb Kn. These maximum values are, respectively,

The proof of this theorem is very long, and we devote to it the first three sections in S1 File, one section for each dissimilarity.

It is straightforward to check that the function f(n) = ln(n + e) satisfies the hypothesis of Theorem 18 (for the inequality f(k) ≤ f(k − 1) + f(2), notice that ln(k + e) ≤ ln(k + e − 1) + ln(2) if, and only if, k + e ≤ 2(k + e − 1), and this last inequality is true for every ). Therefore, , , and take their maximum values on at the comb Kn. In other words, the combs are the most unbalanced trees according to these indices. Table B in S2 File gives the values of , , and on , for n = 2, 3, 4, 5, and the positions of the different trees in each according to the increasing order of the corresponding index.

And for f(n) = en, we have the following result. We have also moved its proof to the S1 File.

Theorem 19. For every n ≥ 2:

  1. (a). If n ≠ 4, then both and reach their maximum on exactly at the tree FS1FSn−1 (see Fig 7), and these maximum values are
  2. (b). Both and reach their maximum on exactly at the comb K4, and these maximum values are
  3. (c). always reaches its maximum on exactly at the tree FS1FSn−1, and the maximum value is

So, according to , , and , the trees of the form FS1FSn−1 are the most unbalanced (except for n = 4 and D = MDM or sd, in which case the most unbalanced tree is the comb). Table B in S2 File also gives the values of these indices on , for n = 2, 3, 4, 5, and the positions of the different trees in each according to the increasing order of the corresponding index.

The R package “CollessLike”

We have written an R package called CollessLike, available at the CRAN (https://cran.r-project.org/web/packages/CollessLike/index.html), that computes the Colless-like indices and their normalized version, as well as several other balance indices, and simulates the distribution of these indices on under the α-γ-model [10]. This package contains functions that:

  • Compute the following balance indices for multifurcating trees: the Sackin index S [46], the total cophenetic index Φ [8], and the Colles-like index for several predefined dissimilarities D and functions f as well as for any user-defined ones.
    Our functions also compute the normalized versions (obtained by subtracting their minimum value and dividing by the width of their range, so that they take values in [0, 1]) of S, Φ and the Colless-like indices for which we have computed the range in Theorems 18 and 19. Recall from the aforementioned references that, for every n ≥ 2:
    • the range of S on is S(FSn) = n to
    • the range of Φ on is Φ(FSn) = 0 to

    Therefore, for every , the normalized Sackin and total cophenetic index are, respectively, while, for instance, the normalized version of is
  • Given two natural numbers n and N, produce a random sample of N values of a balance index S, Φ, or on trees in generated following an α-γ-model: the parameters N, n, α, γ (with 0 ≤ γα ≤ 1) can be set by the user.
    Due to the computational cost of this function, we have stored the values of S, Φ, and (denoted henceforth simply by ) on the random samples of N = 5000 trees in each (for every n = 3, …, 50 and for every α, γ ∈ {0, 0.1, 0.2, …, 0.9, 1} with γα) generated in the study reported in the next subsection. In this way, if the user is interested in this range of numbers of leaves and this range of parameters, he or she can study the distribution of the corresponding balance index efficiently and quickly.
  • Given a tree , estimate the percentile qT,n,α,γ of its balance index S, Φ, or with respect to the distribution of this index on under some α-γ-model. If n, α, γ are among those mentioned in the previous item, for the sake of efficiency this function uses the database of computed indices to simulate the distribution of the balance index on under this α-γ-model.

For instance, the unlabeled tree in Fig 8 is the shape of a phylogenetic tree randomly generated under the α-γ-model with α = 0.7 and γ = 0.4 (using set.seed(1000) for reproducibility). The values of its balance indices are given in the figure’s caption.

thumbnail
Fig 8. A tree with 8 leaves randomly generated under the α-γ-model with α = 0.7 and γ = 0.4.

Its indices are , S(T) = 18, and Φ(T) = 14, and its normalized indices are , Snorm(T) = 0.3704, and Φnorm(T) = 0.25.

https://doi.org/10.1371/journal.pone.0203401.g008

Fig 9 displays the estimation of the density function of the balance indices , S, and Φ under the α-γ-model with α = 0.7 and γ = 0.4 on , obtained using the 5000 random trees gathered in our database. Moreover, the estimated percentiles of the balance indices of the tree of Fig 8 are also shown in the figure.

thumbnail
Fig 9. The estimated density function of the distribution of , S and Φ on under the α-γ-model with α = 0.7 and γ = 0.4.

The percentiles of the tree in Fig 8 are also represented.

https://doi.org/10.1371/journal.pone.0203401.g009

Fig 10 shows a percentile plot of , S, and Φ under the α-γ-model for α = 0.7 and γ = 0.4 on . The percentiles of the tree of Fig 8 are given by the area to the left of the vertical lines.

thumbnail
Fig 10. Percentile plot of the distribution of , S and Φ on under the α-γ-model with α = 0.7 and γ = 0.4.

The percentiles of the tree of Fig 8 are also highlighted.

https://doi.org/10.1371/journal.pone.0203401.g010

A special case of the α-γ-model, corresponding to the case α = γ, is Ford’s α-model for bifurcating phylogenetic trees [17]. This model includes as special cases the Yule, or Equal-Rate Markov, model [18, 19] and the uniform, or Proportional to Distinguishable Arrangements, model [20, 21]. So, this package allows also to study this model. For example, the unlabeled tree in Fig 11 has been generated (with set.seed(1000)) using n = 8 and α = γ = 0.5, which corresponds to the uniform model. The figure also depicts the estimation of the density functions and of the percentile plots of , S, and Φ on under this model, as well as the percentile values of the tree.

thumbnail
Fig 11. A bifurcating tree randomly generated under the uniform model, the estimated density function of the distribution of the three balance indices on under the uniform model, and their percentile plot.

https://doi.org/10.1371/journal.pone.0203401.g011

Experimental results on TreeBASE

To assess the performance of , which we abbreviate by , we downloaded (December 13-14, 2015) all phylogenetic trees in the TreeBASE database [11] using the function search_treebase() of the R package treebase [22]. We obtained 13,008 trees, from which 80 had format problems that prevented R from reading them, so we restricted ourselves to the remaining 12,928 trees. To simplify the language, we shall still refer to this slightly smaller subset of phylogenetic trees as “all trees in TreeBASE”. Only 4,814 among these 12,928 trees in TreeBASE are bifurcating.

Then, for every phylogenetic tree T in this set, we have computed its Colless-like index , its Sackin index S(T), and its total cophenetic index Φ(T). We have compared the results in the ways we show next (all analysis have been performed with R [23]).

Behavior as functions of the number of leaves.

For every number of leaves n, we have computed the mean and the variance of , S and Φ on all trees with n leaves in TreeBASE. Then, we have computed the regression of these values as a function of n.

For the means, the best fits have been:

  • Colless-like index: , with a coefficient of determination of R2 = 0.9869 and a p-value for the exponent p < 2 ⋅ 10−16.
  • Sackin index: , with a coefficient of determination of R2 = 0.9953 and a p-value for the exponent p < 2 ⋅ 10−16.
  • Total cophenetic index: , with a coefficient of determination of R2 = 0.9945 and a p-value for the exponent p < 2 ⋅ 10−16.

Fig 12 depicts these mean values of (left), S (center), and Φ (right) as functions of n.

thumbnail
Fig 12. Growth of the mean value of (left), S (center), and Φ (right) in TreeBASE, as functions of the trees’ numbers of leaves n.

https://doi.org/10.1371/journal.pone.0203401.g012

Thus, S and have similar mean growth rates, while Φ has a mean growth rate one order higher in magnitude. This difference vanishes if we normalize the indices by their range width, which is O(n2) for and S, and O(n3) for Φ: As for the behavior of the variances, the best fits are the following:

  • Colless index: , with a coefficient of determination of R2 = 0.962 and a p-value for the exponent p < 2 ⋅ 10−16.
  • Sackin index: Var(S) ≈ 0.03182 ⋅ n3.22441, with a coefficient of determination of R2 = 0.9575 and a p-value for the exponent p < 2 ⋅ 10−16.
  • Total cophenetic index: Var(Φ) ≈ 0.0041 ⋅ n5.2075, with a coefficient of determination of R2 = 0.9812 and a p-value for the exponent p < 2 ⋅ 10−16.

The results are in the same line as before, with the variances of and S having similar growth rates, and the variance of Φ having a growth rate two orders of magnitude higher. This difference vanishes again when we normalize the indices: So, in summary, has, on TreeBASE and relative to the range of values, a slightly larger mean growth rate and a slightly smaller variance growth rate than the other two indices.

Numbers of ties.

The number of ties (that is, of pairs of different trees with the same index value) of a balance index is an interesting measure of quality, because the smaller its frequency of ties, the bigger its ability to rank the balance of any pair of different trees. Although, in our opinion, this ability need not always be an advantage: for instance, neither Φ nor S take the same, minimum, value on all different fully symmetric trees with the same numbers of leaves (for example, S(FS6) = 6 but S(FS2,3) = S(FS3,2) = 12; and Φ(FS6) = 0, but Φ(FS3,2) = 3 and Φ(FS2,3) = 6; cf. Fig 5), while applied to any fully symmetric tree is always 0. In this case, we believe that these ties are fair.

Anyway, for every number of leaves n and for every one of all three indices under scrutiny, we have computed the numbers of pairs of trees with n leaves in TreeBASE having the same value of the corresponding index (in the case of , up to 16 decimal digits). Fig 13 plots the frequencies of ties of , S and Φ as functions of n. As it can be seen in this graphic, and Φ have a similar number of ties, and consistently less ties than S.

thumbnail
Fig 13. Numbers of ties of , S, and Φ in TreeBASE, as functions of the trees’ numbers of leaves n.

https://doi.org/10.1371/journal.pone.0203401.g013

Spearman’s rank correlation.

In order to measure whether all three indices sort the trees according to their balance in the same way or not, we have computed the Spearman’s rank correlation coefficient [24] of the indices on all trees in TreeBASE, as well as grouping them by their number of leaves n.

The global Spearman’s rank correlation coefficient of and S is 0.9765, and that of and Φ is 0.9619. The graphics in Fig 14 plot these coefficients as functions of n. As it can be seen, Spearman’s rank correlation coefficient for and S grows with n, approaching to 1, while the coefficient for and Φ shows a decreasing tendency with n.

thumbnail
Fig 14. Spearman’s rank correlation coefficient of and S (left) and of and Φ (right) in TreeBASE, as functions of the trees’ numbers of leaves n.

https://doi.org/10.1371/journal.pone.0203401.g014

Does TreeBASE fit the uniform model or the alpha-gamma model?

In this subsection, we test whether the distribution of the Colless-like index of the phylogenetic trees in TreeBASE agrees with its theoretical distribution under either the uniform model for multifurcating phylogenetic trees [25] or the α-γ-model [10] for some parameters α, γ. To do it, we use the normalized version of , which can be used simultaneously on trees with different numbers of leaves.

To estimate the theoretical distribution of this index under the two aforementioned theoretical models, for every n = 3, …, 50 we have generated, on the one hand, 10,000 random phylogenetic trees in under the uniform model using the algorithm described in [25], and, on the other hand, 5000 random phylogenetic trees in under the α-γ-model for every pair of parameters (α, γ)∈{0, 0.1, 0.2, …, 0.9, 1}2 with γα. We have computed the value of on all these trees, and we have used the distribution of these values as an estimation of the corresponding theoretical distribution. To test whether the distribution of the normalized Colless-like index on TreeBASE (or on some subset of it: see below) fits one of these theoretical distributions, we have performed two non-parametric statistical tests on the observed set of indices of TreeBASE and the corresponding simulated set of indices: Pearson’s chi-squared test and the Kolmogorov-Smirnov test, using bootstrapping techniques in the latter to avoid problems with ties.

As a first approach, we have performed these tests on the whole set of trees in TreeBASE. The p-values obtained in all tests, be it for the uniform model or for any considered pair (α, γ), have turned out to be negligible. Then, we conclude confidently that the distribution of the normalized Colless-like index on TreeBASE does not fit either the uniform model or any α-γ-model when we round α, γ to one decimal place. For instance, Fig 15 displays the distribution of on TreeBASE and its estimated theoretical distribution under the uniform model. As it can be seen, these distributions are quite different, which confirms the conclusion of the statistical test.

thumbnail
Fig 15. The distribution of on all trees in TreeBASE (black line) and its estimated theoretical distribution under the uniform model (red line).

https://doi.org/10.1371/journal.pone.0203401.g015

Fig 16 displays the distribution of for all trees in TreeBASE and its estimated theoretical distribution under the α-γ-model for the pair of parameters α, γ that gave the largest p-values in the goodness of fit tests, which are α = 0.7 and γ = 0.4. Although graphically both distributions are quite similar, the p-values of the Pearson chi-squared test and of the Kolmogorov-Smirnov test are virtually zero. One might think that the high “peaks” of the theoretical distribution near 0 and 1 could have influenced the outcome of these statistical tests. For this reason, we have repeated them without taking into account these “extreme” values, and the results have been the same.

thumbnail
Fig 16. The distribution of on all trees in TreeBASE (black line) and its estimated theoretical distribution under the α-γ-model with α = 0.7 and γ = 0.4 (blue line).

https://doi.org/10.1371/journal.pone.0203401.g016

Since TreeBASE gathers phylogenetic trees of different types and from different sources, we have also considered subsets of it defined by means of attributes. More specifically, besides the whole TreeBASE as explained above, we have also considered the following subsets of it:

  • All trees in TreeBASE up to repetitions: we have removed 513 repeated trees (which represent about a 4% of the total).
  • All trees with their kind attribute equal to “Species”. This kind attribute can take three values: “Barcode tree”, “Gene Tree” and “Species Tree”.
  • All trees with their kind attribute equal to “Species” and their type attribute equal to “Consensus”. This type attribute can take two values: “Consensus” and “Single”.
  • All trees with their kind attribute equal to “Species” and their type attribute equal to “Single”.

We have repeated the study explained above for these four subsets of TreeBASE, comparing the distribution of the normalized Colless-like indices of their trees with the estimated theoretical distributions by means of goodness-of-fit tests, and the results have been the same, that is, all p-values have also turned out to be negligible. Our conclusion is, then, that neither the whole TreeBASE nor any of these four subsets of it seem to fit either the uniform model or some α-γ-model.

Conclusions

In this paper we have introduced a family of Colless-like balance indices , which depend on a dissimilarity D and a function , that generalize the Colless index to multifurcating phylogenetic trees. We have proved that every combination of a dissimilarity D and a function either f(n) = ln(n + e) or f(n) = en, defines a Colless-like index that is sound in the sense that the maximally balanced trees according to it are exactly the fully symmetric ones. But, the growth of the function f determines strongly which are the most unbalanced trees according to , and hence it has influence on the very notion of “balance” measured by the index.

In our opinion, choosing ln(n + e) instead of en seems a more sensible decision, because, on the one hand, the most unbalanced trees according to the former are the expected ones—the combs—and, on the other hand, we have encountered several hard numeric problems when working with the extremely large figures that appear when using en-sizes on trees with internal nodes of high degree. With respect to the choice of the dissimilarity D, MDM and sd define indices that are proportional to the Colless index when applied to bifurcating trees. From these two options, we recommend to use MDM because it only involves linear operations, and hence it has less numerical precision problems than sd, that uses a square root of a sum of squares. This is the reason we have stuck to in the numerical experiments reported in the Results section.

To end this paper, we would like to call the reader’s attention on the problem posed in the subsection “Sound Colless-like indices:” to find functions f such that is sound. Our conjecture is that there is no function taking values in the set of natural numbers that satisfies this property.

Supporting information

S1 File. Proofs of Theorems 18 and 19.

The file provides the detailed proofs of Theorems 18 and 19.

https://doi.org/10.1371/journal.pone.0203401.s001

(PDF)

S2 File. Tables.

The file provides two tables, quoted in the main text, with the values of several Colless-like indices on for n = 2, 3, 4, 5.

https://doi.org/10.1371/journal.pone.0203401.s002

(PDF)

Acknowledgments

This research has been partially supported by the Obra Social la Caixa through the “Programa Pont La Caixa per a grups de recerca de la UIB” and by the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund through project DPI2015-67082-P (MINECO/FEDER). There was no additional external funding received for this study. We thank K. Bartoszek and J. Miró-Juliá for several useful suggestions and A. Saldaña Plomer for making available to us his Java script that generates random phylogenetic trees with a fixed number of leaves with uniform distribution.

References

  1. 1. Mooers A, Heard S. Inferring evolutionary process from phylogenetic tree shape. Q Rev Biol 1997; 72: 31–54.
  2. 2. Felsenstein J. Inferring Phylogenies. Sinauer Associates Inc.; 2004.
  3. 3. Colless D. Review of “Phylogenetics: the theory and practice of phylogenetic systematics”. Sys Zool 1982; 31: 100–104.
  4. 4. Kirkpatrick M, Slatkin M. Searching for evolutionary patterns in the shape of a phylogenetic tree. Evolution; 1993; 47: 1171–1181.
  5. 5. Sackin M. “Good” and “bad” phenograms. Sys Zool 1972; 21: 225–226.
  6. 6. Shao K, Sokal R. Tree balance. Sys Zool 1990; 39: 226–276.
  7. 7. McKenzie A. Distributions of cherries for two models of trees. Math Biosci 2000; 164: 81–92. pmid:10704639
  8. 8. Mir A, Rosselló F, Rotger L. A new balance index for phylogenetic trees. Math Biosci 2013; 241: 125–136. pmid:23142312
  9. 9. Coronado T, Mir A, Rosselló F, Valiente G. A balance index for phylogenetic trees based on quartets. J Math Biol, forthcoming 2018. Available from: arXiv:1803.01651.
  10. 10. Chen B, Ford D, Winkel M. A new family of Markov branching trees: the alpha-gamma model. Electron J Probab 2009; 14: 400–430.
  11. 11. Sanderson M, Donoghue M, Piel W, Eriksson T. TreeBASE: a prototype database of phylogenetic analyses and an interactive tool for browsing the phylogeny of life. Am J Botany 1994; 81: 183. Available from: https://treebase.org.
  12. 12. Xiang Y, Zhu Z, Li Y. Enumerating unlabeled and root labeled trees for causal model acquisition. In: Advances in Artificial Intelligence: Springer; 2009; 158–170.
  13. 13. The On-Line Encyclopedia of Integer Sequences; 2010. Available from: http://oeis.org/.
  14. 14. Rogers J. Response of Colless’s tree imbalance to number of terminal taxa. Sys Biol 1993; 42: 102–105.
  15. 15. Matsen F. Optimization Over a Class of Tree Shape Statistics. IEEE/ACM Trans Comput Biol Bioinform 2007; 4: 506–512. pmid:17666770
  16. 16. Yan X, Tang T, Deng Y, Du J, Yang X. Evaluation of transcendental functions on Imagine architecture. In: International Conference on Parallel Processing 2007: IEEE Press; 2007; 53–53.
  17. 17. Ford D. Probabilities on cladograms: Introduction to the alpha model. 2005. Preprint. Available from: arXiv:math/0511246.
  18. 18. Harding E. The probabilities of rooted tree-shapes generated by random bifurcation. Adv Appl Prob 1971; 3: 44–77.
  19. 19. Yule G. A mathematical theory of evolution based on the conclusions of Dr J. C. Willis. Phil Trans Royal Soc (London) Series B 1924; 213: 21–87.
  20. 20. Cavalli-Sforza L, Edwards A. Phylogenetic analysis. Models and estimation procedures. Am J Hum Genet 1967; 19: 233–257. pmid:6026583
  21. 21. Pinelis I. Evolutionary models of phylogenetic trees. Roy Soc Lond Proc Ser Biol Sci 2003; 270: 1425–1431.
  22. 22. Boettiger C, Temple Lang D. Treebase: an R package for discovery, access and manipulation of online phylogenies. Methods Ecol Evol 2012; 3: 1060–1066.
  23. 23. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing; 2008.
  24. 24. Spearman C. The proof and measurement of association between two things. Am J Psychol 1904; 15: 72–101.
  25. 25. Oden N, Shao K. An algorithm to equiprobably generate all directed trees with k labeled terminal nodes and unlabeled interior nodes. Bull Math Biol 1984; 46: 379–387 pmid:6547358