Next Article in Journal
A FEAST Algorithm for the Linear Response Eigenvalue Problem
Next Article in Special Issue
Compression Challenges in Large Scale Partial Differential Equation Solvers
Previous Article in Journal
A Fast Particle-Locating Method for the Arbitrary Polyhedral Mesh
Previous Article in Special Issue
Compaction of Church Numerals
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Nearest Embedded and Embedding Self-Nested Trees

Laboratoire Reproduction et Développement des Plantes, Univ Lyon, ENS de Lyon, UCB Lyon 1, CNRS, INRA, Inria, F-69342 Lyon, France
Submission received: 25 June 2019 / Revised: 23 August 2019 / Accepted: 27 August 2019 / Published: 29 August 2019
(This article belongs to the Special Issue Data Compression Algorithms and their Applications)

Abstract

:
Self-nested trees present a systematic form of redundancy in their subtrees and thus achieve optimal compression rates by directed acrylic graph (DAG) compression. A method for quantifying the degree of self-similarity of plants through self-nested trees was introduced by Godin and Ferraro in 2010. The procedure consists of computing a self-nested approximation, called the nearest embedding self-nested tree, that both embeds the plant and is the closest to it. In this paper, we propose a new algorithm that computes the nearest embedding self-nested tree with a smaller overall complexity, but also the nearest embedded self-nested tree. We show from simulations that the latter is mostly the closest to the initial data, which suggests that this better approximation should be used as a privileged measure of the degree of self-similarity of plants.

1. Introduction

Trees form a wide family of combinatorial objects that offers many application fields, e.g., plant modeling and XML files analysis. Modern databases are huge and thus stored in compressed form. Compression methods take advantage of repeated substructures appearing in the tree. As explained in [1], one often considers the following two types of repeated substructures: subtree repeat (used in DAG compression [2,3,4,5]) and tree pattern repeat (exploited in tree grammars [6,7] and top tree compression [1]). We restrict ourselves to DAG compression of unordered rooted trees, which consists of building a Directed Acyclic Graph (DAG) that represents a tree without the redundancy of its identical subtrees (see Figure 1). Two different algorithms exist for computing the DAG reduction of a tree τ [5] (2.2 Computing Tree Reduction), which share the same time-complexity in O ( # V ( τ ) 2 × D ( τ ) × log ( D ( τ ) ) ) where V ( τ ) denotes the set of vertices of τ and D ( τ ) its outdegree.
Trees that are the most compressed by DAG compression present the highest level of redundancy in their subtrees: all the subtrees of a given height are isomorphic. In this case, the DAG related to a tree τ is linear, i.e., there exists a path going through all vertices, with exactly H ( τ ) + 1 vertices, H ( τ ) denoting the height of τ , which is the minimal number of vertices among trees of this height (see τ 3 in Figure 1). This family of trees has been introduced in [8] as the first interesting class of trees for which the subtree isomorphism problem is in NC 2 . It has been known under the name of nested trees [8] and next self-nested trees [5] to insist on their recursive structure and their proximity to the notion of structural self-similarity.
The authors of [5] are interested in capturing the self-similarity of plants through self-nested trees. They propose to construct a self-nested tree that minimizes the distance of the original tree to the set of self-nested trees that embed the initial tree. The distance to this Nearest Embedding Self-nested Tree (NEST) is then used to quantify the self-nestedness of the tree and thus its structural self-similarity (see τ and NEST ( τ ) in Figure 2). The main result of [5] (Theorem 1 and E. NEST Algorithm) is an algorithm that computes the NEST of a tree τ from its DAG reduction in O ( H ( τ ) 2 × D ( τ ) ) .
The goal of the present article is three-fold. We aim at proposing a new and more explicit algorithm that computes the NEST of a tree τ with the same time-complexity O ( H ( τ ) 2 × D ( τ ) ) as in [5] but that takes as input the height profile of τ and not its DAG reduction. We establish that the height profile of a tree τ can be computed in O ( # V ( τ ) × D ( τ ) ) reducing the overall complexity of a linear factor. Based on this work, we also provide an algorithm in O ( H ( τ ) 2 ) that computes the Nearest embedded Self-nested Tree (NeST) of a tree τ (see τ and NeST ( τ ) in Figure 2). Finally, we show from numerical simulations that the distance of a tree τ to its NeST is much lower than the distance to its NEST. The NeST is most of the time a better approximation of a tree than the NEST and thus should be privileged to quantify the degree of self-nestedness of plants.
The paper is organized as follows. The structures of interest in this paper, namely unordered trees, DAG compression and self-nested trees, are defined in Section 2. Section 3 is dedicated to the definition and the study of the height profile of a tree. The approximation algorithms are presented in Section 4. We give a new insight on the definitions of the NEST and of the NeST in Section 4.1. Our NEST algorithm is presented in Section 4.2, while the NeST algorithm is given in Section 4.3. Section 5 is devoted to simulations. We state that the NeST is mostly a better approximation of a tree than the NEST in Section 5.1. An application to a real rice panicle is presented in Section 5.2. A summary of the paper and concluding remarks can be found in Section 6. All the figures and numerical experiments presented in the article have been made with the Python library treex [9].

2. Preliminaries

2.1. Unordered Rooted Trees

A rooted tree τ is a connected graph containing no cycle, i.e., without chain from any vertex v to itself, and such that there exists a unique vertex R ( τ ) , called the root, which has no parent, and any vertex different from the root has exactly one parent. The leaves of τ are all the vertices without children. The set of vertices of τ is denoted by V ( τ ) . The height of a vertex v may be recursively defined as H ( v ) = 0 if v is a leaf of τ and
H ( v ) = 1 + max w C τ ( v ) H ( w )
otherwise, C τ ( v ) denoting the set of children of v in τ . The height of the tree τ is defined as the height of its root, H ( τ ) = H ( R ( τ ) ) . The outdegree D ( τ ) of τ is the maximal branching factor that can be found in τ , i.e.,
D ( τ ) = max v τ # C τ ( v ) .
A subtree τ [ v ] rooted in v is a particular connected subgraph of τ . Precisely, τ [ v ] = ( V [ v ] , E [ v ] ) where V [ v ] is the set of the descendants of v in τ and E [ v ] is defined as
E [ v ] = ( ξ , ξ ) E ( τ ) : ξ V [ v ] , ξ V [ v ] ,
with E ( τ ) the set of edges of τ .
In all the sequel, we consider unordered rooted trees for which the order among the sibling vertices of any vertex is not significant. A precise characterization is obtained from the additional definition of isomorphic trees. Let τ and θ two rooted trees. A one-to-one correspondence φ : V ( τ ) V ( θ ) is called a tree isomorphism if, for any edge ( v , w ) E ( τ ) , ( φ ( v ) , φ ( w ) ) E ( θ ) . Structures τ 1 and τ 2 are called isomorphic trees whenever there exists a tree isomorphism between them. One can determine if two n-vertex trees are isomorphic in O ( n ) [10] (Example 3.2 and Theorem 3.3). The existence of a tree isomorphism defines an equivalence relation on the set of rooted trees. The class of unordered rooted trees is the set of equivalence classes for this relation, i.e., the quotient set of rooted trees by the existence of a tree isomorphism.

2.2. DAG Compression

Now we consider the equivalence relation “existence of a tree isomorphism” on the set of the subtrees of a tree τ . We consider the quotient graph Q ( τ ) = ( V , E ) obtained from τ using this equivalence relation. V is the set of equivalence classes on the subtrees of τ , while E is a set of pairs of equivalence classes ( C 1 , C 2 ) such that R ( C 2 ) C τ ( R ( C 1 ) ) up to an isomorphism. The graph Q ( τ ) is a DAG [5] (Proposition 1) that is a connected directed graph without path from any vertex v to itself.
Let ( C 1 , C 2 ) be an edge of the DAG Q ( τ ) . We define N ( C 1 , C 2 ) as the number of occurrences of a tree of C 2 just below the root of any tree of C 1 . The tree reduction R ( τ ) is defined as the quotient graph Q ( τ ) augmented with labels N ( C 1 , C 2 ) on its edges [5] (Definition 3 (Reduction of a tree)). Intuitively, the graph R ( τ ) represents the original tree τ without its structural redundancies (see Figure 1).

2.3. Self-Nested Trees

A tree τ is called self-nested [5] (III. Self-nested trees) if for any pair of vertices v and w, either the subtrees τ [ v ] and τ [ w ] are isomorphic, or one is (isomorphic to) a subtree of the other. This characterization of self-nested trees is equivalent to the following statement: for any pair of vertices v and w such that H ( v ) = H ( w ) , τ [ x ] = τ [ y ] , i.e., all the subtrees of the same height are isomorphic.
Linear DAGs are DAGs containing at least one path that goes through all their vertices. They are closely connected with self-nested trees by virtue of the following result.
Proposition 1
(Godin and Ferraro [5]). A tree τ is self-nested if and only if its reduction R ( τ ) is a linear DAG.
This result proves that self-nested trees achieve optimal compression rates among trees of the same height whatever their number of nodes (compare τ 3 with τ 1 and τ 2 in Figure 1). Indeed, R ( τ ) has at least H ( τ ) + 1 nodes and the inequality is saturated if and only if τ is self-nested.

3. Height Profile of the Tree Structure

3.1. Definition and Complexity

This section is devoted to the definition of the height profile ρ τ of a tree τ and to the presentation of an algorithm to calculate it. In the sequel, we assume that the tree τ is always traversed in the same order, depth-first search to set the ideas down. In particular, when vectors are indexed by nodes of τ sharing the same property, the order of the vector is important and should be always the same.
Given a vertex v V ( τ ) ,
γ h ( v ) = # { v C τ ( v ) : H ( τ [ v ] ) = h }
is the number of subtrees of height h directly under v. Now, we consider the vector
ρ τ ( h 1 , h 2 ) = γ h 2 ( v ) : v V ( τ ) , H ( τ [ v ] ) = h 1
made of the concatenation of the integers γ h 2 ( v ) over subtrees τ [ v ] of height h 1 ordered in depth-first search. Consequently, ρ τ is an array made of vectors with varying lengths.
Let A 1 and A 2 be two arrays for which each entry is a vector. We say that A 1 and A 2 are equivalent if, for any line i, there exists a permutation σ i such that for any column j,
A 1 ( i , j ) = σ i ( A 2 ( i , j ) ) .
In particular, i being fixed, all the vectors A 1 ( i , j ) and A 2 ( i , j ) must have the same length. This condition defines an equivalence relation. The height profile of τ is the array ρ τ as an element of the quotient space of arrays of vectors under this equivalence relation. In other words, the vectors ρ τ ( h 1 , h 2 ) , 0 h 2 < h 1 and h 1 fixed, must be ordered in the same way but the choice of the order is not significant. Finally, it should be already remarked that ρ τ ( h 1 , h 2 ) = when h 2 h 1 or h 1 > H ( τ ) . Consequently, the height profile can be reduced to the triangular array
ρ τ = ρ τ ( h 1 , h 2 ) 0 h 2 < h 1 H ( τ ) .
The application ρ τ provides the distribution of subtrees of height h 2 just below the root of subtrees of height h 1 for all couples ( h 1 , h 2 ) , which typically represents the height profile of τ . For clarity’s sake, we give the values of ρ τ k for the trees τ k of Figure 1, coefficient ( i , j ) of the matrix being ρ τ k ( i , j 1 ) ,
ρ τ 1 = ρ τ 2 = ( 1 , 1 , 2 ) ( 0 , 1 , 1 ) ( 1 , 1 , 1 ) ( 0 ) ( 0 ) ( 3 ) and ρ τ 3 = ( 1 , 1 , 1 ) ( 1 , 1 , 1 ) ( 1 , 1 , 1 ) ( 0 ) ( 0 ) ( 3 ) .
It should be noticed that the height profile does not contain all the topology of the tree since trees τ 1 and τ 2 of Figure 1 are different but share the same height profile (1). However, the height of a tree τ can be recovered from its height profile through the relation H ( τ ) = dim ( ρ τ ) , the dimension of ρ τ being defined by
dim ( ρ τ ) = min n 0 : i 0 , ρ τ ( n + 1 , i ) = .
Proposition 2.
ρ τ can be computed in O ( # V ( τ ) × D ( τ ) ) -time.
Proof. 
First, attribute to each node v V ( τ ) the height of the subtree τ [ v ] with complexity O ( # V ( τ ) ) . Next, traverse the tree in depth-first search in O ( # V ( τ ) ) and calculate for each vertex v the vector ( γ h ( v ) ) 0 h < H ( τ [ v ] ) in # C τ ( v ) D ( τ ) operations. Finally, append this vector to ρ τ ( H ( τ [ v ] ) , · ) component by component. □

3.2. Relation with Self-Nested Trees

Self-nested trees are characterized by their height profile considering the following result.
Proposition 3.
τ is self-nested if and only if, for any 0 h 2 < h 1 H ( τ ) , all the components of the vector ρ τ ( h 1 , h 2 ) are the same (for instance see the profile (1) of the tree τ 3 presented in Figure 1). In addition, a self-nested tree τ can be reconstructed from ρ τ (see Algorithm 1).
Proof. 
If τ is self-nested, the N h 1 subtrees of height h 1 appearing in τ are isomorphic and thus have the same number n h 1 , h 2 of subtrees of height h 2 just below their root. Consequently,
ρ τ ( h 1 , h 2 ) = ( n h 1 , h 2 , , n h 1 , h 2 ) N h 1 .
The reciprocal result may be established considering the following lemma which proof presents no difficulty. □
Lemma 1.
If all the subtrees of height 0 h < H appearing in a tree τ are isomorphic, and if all the subtrees of height H have the same number of subtrees of height 0 h < H just below their root, then all the subtrees of height H appearing in τ are isomorphic.
All the subtrees of height 1 in τ are isomorphic because all the components of ρ τ ( 1 , 0 ) are the same. The expected result is shown by induction on the height thanks to the previous lemma which assumptions are satisfied since ρ τ always contains vectors for which all the entries are equal. The previous reasoning also provides a way (presented in Algorithm 1) to build a unique (self-nested) tree T from the height profile ρ τ . In addition, this is easy to see that τ and T are isomorphic.
To present the algorithm of reconstruction of a self-nested tree from its height profile, we need to define the restriction of a height profile to some height. Let p be a height profile. The restriction p | h of p to height h 0 is the array defined by
1 h 1 h , h 2 0 , p | h ( h 1 , h 2 ) = p ( h 1 , h 2 ) , h 1 > h , h 2 0 , p | h ( h 1 , h 2 ) = .
Consequently, dim ( p | h ) = min ( d i m ( p ) , h ) . A peculiar case is p | 0 for which each entry is the empty set and thus dim ( p | 0 ) = 0 . It should be also remarked that there may exist no tree τ such that p | h is the height profile of τ .
Algorithm 1: Construction of a self-nested tree from its height profile.
Algorithms 12 00180 i001
As we can see in the proof of Proposition 3 or in Algorithm 1, the lengths of the vectors ρ τ ( h 1 , h 2 ) are not significant to reconstruct a self-nested tree τ . Consequently, since all the components of ρ τ ( h 1 , h 2 ) are the same, we can identify the height profile of a self-nested tree with the integer-valued array [ ρ τ ( h 1 , h 2 ) 1 ] .
Proposition 4.
The number of nodes of a self-nested tree τ can be computed from ρ τ in O ( H ( τ ) 2 ) .
Proof. 
By induction on the height, one has # V ( τ ) = N ( H ( τ ) ) , where the sequence N is defined by N ( 0 ) = 1 (number of nodes of a tree reduced to a root) and,
1 H H ( τ ) , N ( H ) = 1 + h = 0 H 1 ρ τ ( H , h ) N ( h ) .
The number of operations required to compute N ( H ( τ ) ) is of order O ( H ( τ ) 2 ) . □
The authors of [5] (Proposition 6) calculate the number of nodes of a tree (self-nested or not) from its DAG reduction by a formula very similar to (2), and which achieve the same complexity on self-nested trees. As mentioned before, a tree cannot be recovered from its height profile in general, thus we cannot expect such a result from the height profile of any tree.

4. Approximation Algorithms

4.1. Definitions

4.1.1. Editing Operations

We shall define the NEST and the NeST of a tree τ . As in [5] (Equation (5)), we ask these approximations to be consistent with Zhang’s edit distance between unordered trees [11] denoted D Z in this paper. Thus, as in [11] (2.2 Editing Operations), we consider the following two types of editing operations: adding a node and deleting a node. Deleting a node w means making the children of w become the children of the parent v of w and then removing w (see Figure 3). Adding w as a child of v will make w the parent of a subset of the current children of v (see Figure 4).

4.1.2. Constrained Editing Operations

Zhang’s edit distance is defined from the above editing operations and from constrained mappings between trees [11] (3.1 Constrained Edit Distance Mappings). A constrained mapping between two trees τ and θ is a mapping [11] (2.3.2 Editing Distance Mappings), i.e., a one-to-one correspondence φ from a subset of V ( τ ) into a subset of V ( θ ) preserving the ancestor order, with an additional condition on the Least Common Ancestors (LCAs) [11] (condition (2) p. 208): if, for 1 i 3 , v i V ( τ ) and w i = φ ( v i ) V ( θ ) , then LCA ( v 1 , v 2 ) is a proper ancestor of v 3 if and only if LCA ( w 1 , w 2 ) is a proper ancestor of w 3 .
Let θ be a tree that approximates τ obtained by inserting nodes in τ only and consider the induced mapping M τ θ that associates nodes of τ with themselves in θ . We want the approximation process to be consistent with Zhang’s edit distance D Z , i.e., we want the mapping M τ θ to be a constrained mapping in the sense of Zhang, which in particular implies D Z ( θ , τ ) = # V ( θ ) # V ( τ ) . We shall prove that this requirement excludes some inserting operations in our context.
Indeed, the mapping M τ θ involved in the inserting operation of Figure 4 is partially displayed in Figure 5, nodes v i of τ being associated with nodes w i of θ . The LCA of v 1 and v 2 in τ is a proper ancestor of v 3 . However, the LCA of w 1 and w 2 in θ is not a proper ancestor of w 3 . Consequently, this mapping is not a constrained mapping as defined by Zhang. A necessary and sufficient condition for M τ θ to be a constrained mapping is given in Lemma 2.
Lemma 2.
Let τ be a tree and v V ( τ ) . Let θ be the tree obtained from τ by adding a node w as a child of v making the nodes of the subset C C τ ( v ) children of w. The mapping M τ θ induced by these inserting operations is a constrained mapping in the sense of Zhang if and only if C = , # C = 1 or # C = # C τ ( v ) .
Proof. 
The proof is obvious if v has one or two children. Thus, we assume that v has at least three children c 1 , c 2 and c 3 . In τ , the LCA of c 1 and c 2 is v and v is an ancestor of c 3 . Adding w as the parent of c 1 and c 2 makes it the LCA of these two nodes, but not an ancestor of c 3 in θ . The additional condition on the LCAs is then not satisfied. This problem appears only when making w the parent of at least two children and of not all the children of v. □
Consequently, we restrict ourselves to the following inserting operations which are the only ones that ensure that the associated mapping satisfies Zhang’s condition: adding w as a child of v will make w (i) a leaf, (ii) the parent of one current child of v, or (iii) the parent of all the current children of v. However, it should be noticed that (iii) can always be expressed as (ii) (see Figure 6). Finally, we only consider the inserting operations that make the new child of v the parent of zero or one current child of v. For obvious reasons of symmetry, the allowed deleting operations are the complement of inserting operations, i.e., one can delete an internal node if and only if it has a unique child, which also ensures that the induced mapping is constrained in the sense of Zhang.

4.1.3. Preserving the Height of the Pre-Existing Nodes

In [5] (Definition 9 and Figure 6), the NEST of a tree τ is obtained by successive partial linearizations of the (non-linear) DAG of τ which consist of merging all the nodes at the same height of the DAG. A consequence is that the height of any pre-existing node of τ is not changed by the inserting operations. For the sake of consistency with [5], we only consider inserting and deleting operations that preserve the height of all the pre-existing nodes of τ .
The next two results deal with inserting operations that preserve the height of the pre-existing nodes.
Lemma 3.
Let τ be a tree, v V ( τ ) and c C τ ( v ) . Let θ be the tree obtained from τ by adding the internal node w as a child of v making w the parent of c. Then,
u V ( τ ) , H ( θ [ u ] ) = H ( τ [ u ] ) H ( τ [ c ] ) + 1 < H ( τ [ v ] ) .
Proof. 
Adding w may only increase the height of v and the one of its ancestors in τ . If the height of v is not changed by adding w, the height of its ancestors will not be modified. The height of v remains unchanged if and only if the height of w in θ , i.e., H ( τ [ c ] ) + 1 , is strictly less than the height of τ [ v ] . □
Lemma 4.
Let τ be a tree and v V ( τ ) . Let θ be the tree obtained from τ by adding a tree t as a child of v. Then,
u V ( τ ) , H ( θ [ u ] ) = H ( τ [ u ] ) H ( t ) + 1 H ( τ [ v ] ) .
Proof. 
Adding a subtree t under v may only increase the height of v and the one of its ancestors in τ . If the height of v is not changed by adding t, the height of its ancestors will not be modified. Adding t will make the height of v increase if H ( t ) is strictly greater than the height of the higher child of v. □
A particular case of Lemma 4 is the insertion of leaves in a tree. Considering the above result, a leaf can be added under v if and only if H ( τ [ v ] ) 1 , i.e., v is not a leaf. The below results concern deleting operations that preserve the height of the remaining nodes of τ .
Lemma 5.
Let τ be a tree, v V ( τ ) , w C τ ( v ) and C τ ( w ) = { c } . Let θ be the tree obtained from τ by deleting the internal node w making its unique child c a child of v. Then,
u V ( θ ) , H ( θ [ u ] ) = H ( τ [ u ] ) w C τ ( v ) \ { w } , H ( τ [ w ] ) + 1 = H ( τ [ v ] ) .
Proof. 
Deleting w may only decrease the height of v and the one of its ancestors in τ . If the height of v is not changed by deleting w, the height of its ancestors will not be modified. The height of v remains unchanged if and only if it has a child different of w of height H ( τ [ v ] ) 1 . □
Lemma 6.
Let τ be a tree, v V ( τ ) , c C τ ( v ) . Let θ be the tree obtained from τ by deleting the subtree τ [ c ] . Then,
u V ( θ ) , H ( θ [ u ] ) = H ( τ [ u ] ) c C τ ( v ) \ { c } , H ( τ [ c ] ) + 1 = H ( τ [ v ] ) .
Proof. 
The proof follows the same reasoning as in the previous result. □

4.1.4. NEST and NeST

In view of the foregoing, we consider the set of inserting and deleting operations that fulfill the below requirements.
Adding operations (see Figure 7)
  • Internal nodes (AI): adding w as a child of v making w the parent of the child c of v can be done only if H ( τ [ c ] ) + 1 < H ( τ [ v ] ) .
  • Subtrees (AS): adding t as a child of v can be done only if H ( t ) + 1 H ( τ [ v ] ) .
Deleting operations (see Figure 8)
  • Internal nodes (DI): deleting v C τ ( u ) (making the unique child w of v a child of u) can be done only if there exists v C τ ( u ) , v v , such that H ( τ [ v ] ) H ( τ [ v ] ) .
  • Subtrees (DS): deleting the subtree τ [ w ] , w C τ ( v ) , of τ can be done if there exists w C τ ( v ) , w w , such that H ( τ [ w ] ) + 1 = H ( τ [ v ] ) .
Proposition 5.
The editing operations AI and AS (DI and DS, respectively) are the only inserting (deleting, respectively) operations that ensure that (i) the induced mapping is a constrained mapping and that (ii) the height of all the pre-existing nodes is unchanged.
Proof. 
This result is a direct corollary of Lemmas 2–6. □
The NEST (the NeST, respectively) of a tree τ is the self-nested tree obtained by the set of inserting operations AI and AS (of deleting operations DI and DS, respectively) of minimal cost, the cost of inserting a subtree being its number of nodes. Existence and uniqueness of the NEST are not obvious at this stage. The NeST exists because the (self-nested) tree composed of a unique root can be easily obtained by deleting operations from any tree, but its uniqueness is not evident.

4.2. NEST Algorithm

To present our NEST algorithm in a concise form in Algorithm 2, we need to define the following operations involving two vectors u and v of the same size n and a real number γ ,
u + v = ( u 1 + v 1 , , u n + v n ) , u + γ = ( u 1 + γ , , u n + γ ) , u γ = ( max ( u 1 , γ ) , , max ( u n , γ ) ) .
In other words, these operations must be understood component by component. In addition, in a condition, u = 0 ( u 0 , respectively) means that for all 1 i n , u i = 0 ( u i 0 , respectively). Finally, for 1 i j n , u i j denotes the vector ( u i , , u j ) of length j i + 1 . This notation will also be used in Algorithm 3 for calculating the NeST. It should be noticed that an illustrative example that can help the reader to follow the progress of the algorithm is provided in Section 6.
Algorithm 2: Construction of the nearest embedding self-nested tree.
Algorithms 12 00180 i002
The relation between the above algorithm and the NEST of a tree is provided in the following result, which states in particular the existence of the NEST.
Proposition 6.
For any tree τ, Algorithm 2 returns the unique NEST of τ in O ( H ( τ ) 2 × D ( τ ) ) .
Proof. 
By definition of the NEST, the height of all the pre-existing nodes of τ cannot be modified. Thus, the number of nodes of height h 1 under a node of height h can only increase by inserting subtrees in the structure. Then we have
ρ NEST ( τ ) ( h , h 1 ) max ρ τ ( h , h 1 ) .
Let v be a vertex of height h in τ . We recall that γ i ( v ) denotes the number of subtrees of height i under v. Our objective is to understand the consequences for γ i ( v ) of inserting operations to obtain ρ NEST ( τ ) ( h , h 1 ) subtrees of height h 1 under v. To this aim, we shall define a sequence γ i ( h 1 , j ) ( v ) starting from γ i ( h 1 , 0 ) ( v ) = γ i ( v ) that corresponds to the modified versions of τ . The first exponent h 1 means that this sequence concerns editing operations used to get the good number of subtrees of height h 1 under v. □
Let Δ h 1 ( 0 ) ( v ) = ρ NEST ( τ ) ( h , h 1 ) γ h 1 ( 0 ) ( v ) be the number of subtrees of height h 1 that must be added under v to obtain the height profile of the NEST under v, i.e.,
γ h 1 ( h 1 , 1 ) ( v ) = ρ NEST ( τ ) ( h , h 1 ) .
Implicitly, it means that γ i ( h 1 , 1 ) ( v ) = γ i ( 0 ) ( v ) for i h 1 . The subtrees of height h 1 that we must add are isomorphic, self-nested and embed all the subtrees of height h 2 appearing in τ by definition of the NEST. In particular, they can be obtained by the allowed inserting operations from the subtrees of height h 2 under v, by first adding an internal node to increase their height to h 1 . In addition, it is less costly in terms of editing operations to construct the subtrees of height h 1 from the subtrees of height h 2 available under v than to directly add these subtrees under v. If all the subtrees of height h 2 under v must be reconstructed later, it will be possible to insert them and the total cost will be same as by directly adding the subtrees of height h 1 under v. Consequently, all the available subtrees of height h 2 are used to construct subtrees of height h 1 under v and it remains
Δ h 1 ( 1 ) = Δ h 1 ( 0 ) ( v ) γ h 2 ( h 1 , 1 ) 0
subtrees of height h 1 to be built under v. Furthermore, in the new version of τ , we have
γ h 2 ( h 1 , 2 ) ( v ) = γ h 2 ( h 1 , 1 ) ( v ) Δ h 1 ( 1 ) ( v ) .
The Δ h 1 ( 1 ) subtrees of height h 1 can be constructed from subtrees of height h 3 (with a larger cost than from subtrees of height h 2 ), and so on. To this aim, we define the sequence of the modified versions of τ by, for 0 j h 2 ,
Δ h 1 ( j + 1 ) ( v ) = Δ h 1 ( j ) ( v ) γ h 1 ( j + 1 ) ( h 1 , j + 1 ) ( v ) 0 , γ h ( j + 2 ) ( h 1 , j + 2 ) ( v ) = γ h ( j + 2 ) ( h 1 , j + 1 ) ( v ) Δ h 1 ( j + 1 ) ( v ) .
At the final step j = h 2 , the Δ h 1 ( 0 ) ( v ) subtrees of height h 1 have been constructed from all the available subtrees appearing under v, starting from subtrees of height h 2 , then h 3 , etc., and then have been added if necessary.
From now on, the number of subtrees of height h 2 under v will not decrease. Indeed, it would mean that an internal node has been added between v and the root of a subtree of height h 2 . This would have the consequence to increase of one unit the number of subtrees of height h 1 in subtrees of height h, which cost is (strictly) larger than adding a subtree of height h 2 in all the subtrees of height h. Consequently, we obtain
ρ NEST ( τ ) ( h , h 2 ) max { v V ( τ ) : H ( τ [ v ] ) = h } γ h 2 ( h 1 , h ) ( v ) .
We can reproduce the above reasoning to construct under v subtrees of height h i , i from 2 to h 1 , from subtrees with a smaller height, which defines a sequence γ i ( h i , j ) of modified versions of τ , which size is h i + 1 , and we get the following inequality,
2 i h , ρ NEST ( τ ) ( h , h i ) max { v V ( τ ) : H ( τ [ v ] ) = h } γ h i ( h i + 1 , h i + 2 ) ( v ) .
The tree returned by Algorithm 2 is self-nested and its height profile saturates the inequalities (3) and (4) for all the possible values of h and i by construction. In addition, we have shown that this tree can be obtained from τ by the allowed inserting operations. Since increasing of one unit the height profile at ( h 1 , h 2 ) has a (strictly) positive cost, this tree is thus the (unique) NEST of τ . As seen previously, the number of iterations of the while loop at line 7 is the number of subtrees of height h 2 < h 1 available to construct a tree of height h 1 , i.e., the degree of τ in the worst case, which states the complexity.

4.3. NeST Algorithm

This section is devoted to the presentation of the calculation of the NeST in Algorithm 3. An illustrative example that can help the reader to follow the progress of the algorithm is provided in Section 6.
Algorithm 3: Construction of the nearest embedded self-nested tree.
Algorithms 12 00180 i003
Proposition 7.
For any tree τ, Algorithm 3 returns the unique NeST of τ in O ( H ( τ ) 2 ) .
Proof. 
The proof follows the same reasoning as the proof of Proposition 6. First, one may remark that
ρ NeST ( τ ) ( h , h 1 ) min ρ τ ( h , h 1 ) ,
because the number of subtrees of height h 1 under a node v of height h can only decrease by the allowed deleting operations. Let v be a node of height h in τ and γ i ( v ) the number of subtrees of height i under v. If a subtree of height h i under v that must be deleted is not self-nested, one can first modify it to get a self-nested tree and then remove it with the same overall cost. Thus, we can assume without loss of generality that all the subtrees under v are self-nested. Δ h 1 ( v ) = γ h 1 ( v ) ρ NeST ( τ ) ( h , h 1 ) denotes the number of subtrees of height h 1 that have to be removed from v. Let γ i ( j ) ( v ) the sequence of the modifications to obtain ρ NeST ( τ ) ( h , h 1 ) subtrees of height h 1 under v, with γ i ( 0 ) ( v ) = γ i ( v ) . Instead of deleting a subtree of height h 1 , it is always less costly to decrease its height of one unit by deleting its root. However it is possible only if this internal node has only one child, i.e., if  ρ τ ( h 1 , h 2 ) = 1 and ρ τ ( h 1 , i ) = 0 for 0 i < h 2 . If this new tree of height h 2 must be deleted in the sequel, it will be done with the same global cost as by directly deleting the subtree of height h 1 . Consequently,
γ h 1 ( 1 ) ( v ) = ρ NeST ( τ ) ( h , h 1 ) , γ h 2 ( 1 ) ( v ) = γ h 2 ( 0 ) ( v ) + Δ h 1 ( v ) I { ρ τ ( h 1 , h 2 ) = 1 , 3 i h , ρ τ ( h 1 , h i ) = 0 } .
From now on, the number of subtrees of height h 2 under v will thus not increase and we obtain
ρ NeST ( τ ) ( h , h 2 ) min { v V ( τ ) : H ( τ [ v ] ) = h } γ h 2 ( 1 ) ( v ) .
There are Δ h 2 ( v ) = γ h 2 ( 1 ) ( v ) ρ NeST ( τ ) ( h , h 2 ) subtrees of height h 2 to be deleted under v. We can repeat the previous reasoning and delete the root of subtrees of height h 2 if possible rather than delete the whole structure, and so on for any height. Thus, the sequence γ i ( j ) is defined from
Δ h 1 i ( v ) = γ h 1 i ( i ) ( v ) ρ NeST ( τ ) ( h , h 1 i ) , γ h 1 i ( i + 1 ) ( v ) = ρ NeST ( τ ) ( h , h 1 i ) , γ h 2 i ( i + 1 ) ( v ) = γ h 2 i ( i ) ( v ) + Δ h 1 i ( v ) I { ρ τ ( h 1 , h 2 ) = 1 , i + 2 j h , ρ τ ( h i , h j ) = 0 } ,
and we have
0 i h 2 , ρ NeST ( τ ( h , h 2 i ) min { v V ( τ ) : H ( τ [ v ] ) = h } γ h 2 i ( i + 1 ) ( v ) .
The tree returned by Algorithm 3 saturates the inequalities (5) and (6) for all the possible values of h and i. Decreasing of one unit the height profile at ( h 1 , h 2 ) has a (strictly) positive cost. Thus, this tree is the (unique) NeST of τ . The time-complexity is given by the size of the height profile array. □

5. Numerical Illustration

5.1. Random Trees

The aim of this section is to illustrate the behavior of the NEST and of the NeST on a set of simulated random trees regarding both the quality of the approximation and the computation time. We have simulated 3000 random trees of size 10, 20, 30, 40, 50, 75, 100, 150, 200, and 250. For each tree, we have calculated the NEST and the NeST. The number of nodes of these approximations is displayed in Figure 9. We can observe that the number of nodes of the NEST is very large in regards with the size of the initial tree: approximately one thousand nodes on average for a tree of 150 nodes, which is to say an approximation error of 750 vertices. Remarkably, the NEST has never been a better approximation than the NeST on the set of simulated trees.
The computation time required to compute the NEST or the NeST of one tree on a 2.8 GHz Intel Core i7 has also been estimated on the set of simulated trees and is presented in Figure 10. As predicted by the theoretical complexities given in Propositions 6 and 7, the NeST algorithm requires less computation time than the NEST. Consequently, the NeST provides a much better and faster approximation of the initial data than the NEST.

5.2. Structural Analysis of a Rice Panicle

Considering [5], we propose to quantify the degree of self-nestedness of a tree τ by the following indicator based on the calculation of NEST ( τ ) ,
δ NEST ( τ ) = 1 D Z ( NEST ( τ ) , τ ) # V ( τ ) = 2 # V ( τ ) # V ( NEST ( τ ) ) # V ( τ ) ,
where D Z stands for Zhang’s edit distance [11]. In [5] (Equation (6)), the degree of self-nestedness of a plant is defined as in (7) but normalizing by the number of nodes of the NEST and not the size of the initial data, which avoids the indicator to be negative. In the present paper, we prefer normalizing by the number of nodes of τ to obtain the following comparable self-nestedness measure based on the calculation of NeST ( τ ) ,
δ NeST ( τ ) = 1 D Z ( NeST ( τ ) , τ ) # V ( τ ) = # V ( NeST ( τ ) ) # V ( τ ) .
The main advantage of this normalization is that if the NEST and the NeST offer equally good approximations, i.e., D Z ( NEST ( τ ) , τ ) = D Z ( NeST ( τ ) , τ ) , then the degree of self-nestedness does not depend on the chosen approximation scheme, δ NEST ( τ ) = δ NeST ( τ ) .
We propose to investigate the degree of structural self-similarity of the topological structure of the rice panicle studied in [5] (4.2 Analysis of a Real Plant) through these self-nested approximations. The rice panicle V 1 is made of a main axis bearing a main inflorescence P 1 and lateral systems V i , 2 i 5 , each composed of inflorescences P j , 2 j 8 (see Figure 11). We have computed the indicators of self-nestedness δ NEST 0 and δ NeST for each substructure composing the whole panicle (see Figure 12). The numerical values and the shape of these indicators are similar. However, δ NeST is always greater than δ NEST , in particular for the largest structures V i . Based on a better approximation procedure as highlighted in the previous section, the NeST better captures the self-nestedness of the rice panicle.

6. Summary and Concluding Remarks

Self-nested trees are unordered rooted trees that are the most compressed by DAG compression. Since DAG compression takes advantage of subtree repetitions, they present the highest level of redundancy in their subtrees. In this paper, we have developed a new algorithm for computing the Nearest Embedding Self-nested Tree (NEST) of a tree τ in O ( H ( τ ) 2 × D ( τ ) ) , as well as the first algorithm for determining its Nearest embedded Self-nested Tree (NeST) with time-complexity O ( H ( τ ) 2 ) .
To this end, we have introduced the notion of height profile of a tree. Roughly speaking, the height profile is a triangular array which component ( h 1 , h 2 ) , with h 2 < h 1 , is the list of the numbers of direct subtrees of height h 2 in subtrees of height h 1 , where a subtree is said direct if it is attached to the root. We have shown in Proposition 3 that self-nested trees are characterized by their height profile. While the first NEST algorithm [5] was based on edition of the DAG related to the tree to be compressed, the two approximation algorithms developed in the present paper take as input the height profile of any tree  τ , which can be computed in O ( # V ( τ ) × D ( τ ) ) -time (see Proposition 2), and modify it from top to bottom and from right to left, to return the self-nested height profile of the expected estimate (see Algorithms 2 and 3). Figure 13 and Figure 14 illustrate the progress of the algorithms on a simple example. They should be examined in relation to the corresponding algorithms. We would like to emphasize that our paper also states the uniqueness of the NEST and of the NeST, and studies the link with edit operations admitted in Zhang’s distance.
Remarkably, estimations performed on a dataset of random trees establish that the NeST is a more accurate approximation of the initial tree than the NEST. This observation could be investigated from a theoretical perspective. In addition, we have shown that the NeST better captures the degree of structural self-similarity of a rice panicle than the NEST.
The algorithms developed in this paper are available in the last version of the Python library treex [9].

Funding

This research received no external funding.

Acknowledgments

The author would like to show his gratitude to two anonymous reviewers for their relevant comments on a first version of the manuscript.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Bille, P.; Gørtz, I.L.; Landau, G.M.; Weimann, O. Tree compression with top trees. Inf. Comput. 2015, 243, 166–177. [Google Scholar] [CrossRef] [Green Version]
  2. Bousquet-Mélou, M.; Lohrey, M.; Maneth, S.; Noeth, E. XML Compression via Directed Acyclic Graphs. Theory Comput. Syst. 2014, 57, 1322–1371. [Google Scholar] [CrossRef] [Green Version]
  3. Buneman, P.; Grohe, M.; Koch, C. Path Queries on Compressed XML. In Proceedings of the 29th International Conference on Very Large Data Bases, VLDB’03, Berlin, Germany, 9–12 September 2003; Volume 29, pp. 141–152. [Google Scholar]
  4. Frick, M.; Grohe, M.; Koch, C. Query evaluation on compressed trees. In Proceedings of the 18th Annual IEEE Symposium of Logic in Computer Science, Ottawa, ON, Canada, 22–25 June 2003; pp. 188–197. [Google Scholar]
  5. Godin, C.; Ferraro, P. Quantifying the degree of self-nestedness of trees. Application to the structural analysis of plants. IEEE Trans. Comput. Biol. Bioinform. 2010, 7, 688–703. [Google Scholar] [CrossRef] [PubMed]
  6. Busatto, G.; Lohrey, M.; Maneth, S. Efficient Memory Representation of XML Document Trees. Inf. Syst. 2008, 33, 456–474. [Google Scholar] [CrossRef]
  7. Lohrey, M.; Maneth, S. The Complexity of Tree Automata and XPath on Grammar-compressed Trees. Theor. Comput. Sci. 2006, 363, 196–210. [Google Scholar] [CrossRef]
  8. Greenlaw, R. Subtree Isomorphism is in DLOG for Nested Trees. Int. J. Found. Comput. Sci. 1996, 7, 161–167. [Google Scholar] [CrossRef]
  9. Azaïs, R.; Cerutti, G.; Gemmerlé, D.; Ingels, F. treex: A Python package for manipulating rooted trees. J. Open Source Softw. 2019, 4, 1351. [Google Scholar] [CrossRef]
  10. Aho, A.V.; Hopcroft, J.E.; Ullman, J.D. The Design and Analysis of Computer Algorithms, 1st ed.; Addison-Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1974. [Google Scholar]
  11. Zhang, K. A constrained edit distance between unordered labeled trees. Algorithmica 1996, 15, 205–222. [Google Scholar] [CrossRef]
Figure 1. Trees and their DAG (Directed Acyclic Graph) reduction. In the tree, roots of isomorphic subtrees are colored identically. In the DAG, vertices are equivalence classes colored according to the class of isomorphic subtrees that they represent.
Figure 1. Trees and their DAG (Directed Acyclic Graph) reduction. In the tree, roots of isomorphic subtrees are colored identically. In the DAG, vertices are equivalence classes colored according to the class of isomorphic subtrees that they represent.
Algorithms 12 00180 g001
Figure 2. A tree τ (middle) with 30 nodes and its approximations NeST ( τ ) (left) with 24 nodes and NEST ( τ ) (right) with 37 nodes.
Figure 2. A tree τ (middle) with 30 nodes and its approximations NeST ( τ ) (left) with 24 nodes and NEST ( τ ) (right) with 37 nodes.
Algorithms 12 00180 g002
Figure 3. Deleting a node.
Figure 3. Deleting a node.
Algorithms 12 00180 g003
Figure 4. Inserting a node.
Figure 4. Inserting a node.
Algorithms 12 00180 g004
Figure 5. The tree θ is obtained from τ by inserting an internal node. The associated mapping does not satisfy the conditions imposed by Zhang [11] because the LCA (Least Common Ancestor) of v 1 and v 2 is a proper ancestor of v 3 whereas the LCA of w 1 and w 2 is not a proper ancestor of w 3 .
Figure 5. The tree θ is obtained from τ by inserting an internal node. The associated mapping does not satisfy the conditions imposed by Zhang [11] because the LCA (Least Common Ancestor) of v 1 and v 2 is a proper ancestor of v 3 whereas the LCA of w 1 and w 2 is not a proper ancestor of w 3 .
Algorithms 12 00180 g005
Figure 6. Adding a node as new child of w making all the current children of w children of this new node (top) provides the same topology as adding a new node between v and its child w (bottom).
Figure 6. Adding a node as new child of w making all the current children of w children of this new node (top) provides the same topology as adding a new node between v and its child w (bottom).
Algorithms 12 00180 g006
Figure 7. Allowed (✓) and forbidden (✗) inserting operations to construct the NEST of a tree.
Figure 7. Allowed (✓) and forbidden (✗) inserting operations to construct the NEST of a tree.
Algorithms 12 00180 g007
Figure 8. Allowed (✓) and forbidden (✗) deleting operations to construct the NeST of a tree.
Figure 8. Allowed (✓) and forbidden (✗) deleting operations to construct the NeST of a tree.
Algorithms 12 00180 g008
Figure 9. Number of nodes of the NEST (left) and of the NeST (right) estimated from 3000 random trees: average (full lines) and first and third quartiles (dashed lines).
Figure 9. Number of nodes of the NEST (left) and of the NeST (right) estimated from 3000 random trees: average (full lines) and first and third quartiles (dashed lines).
Algorithms 12 00180 g009
Figure 10. Average running time required to compute the NEST (dashed line) or the NeST (full line) estimated from 3000 simulated trees.
Figure 10. Average running time required to compute the NEST (dashed line) or the NeST (full line) estimated from 3000 simulated trees.
Algorithms 12 00180 g010
Figure 11. The rice panicle is composed of a main axis and lateral systems V i , each made of one or several inflorescences P j .
Figure 11. The rice panicle is composed of a main axis and lateral systems V i , each made of one or several inflorescences P j .
Algorithms 12 00180 g011
Figure 12. Degree of self-nestedness measured by δ NEST 0 (dashed lines) and δ NeST (full lines) of the different substructures appearing in the rice panicle.
Figure 12. Degree of self-nestedness measured by δ NEST 0 (dashed lines) and δ NeST (full lines) of the different substructures appearing in the rice panicle.
Algorithms 12 00180 g012
Figure 13. Progress of Algorithm 2 to compute the NEST of the left tree from its height profile. Only the second line must be edited to get the correct output. Editions of the height profile are associated with addition of vertices in red. The output tree is self-nested and has been constructed by adding a minimal number of nodes to the initial tree.
Figure 13. Progress of Algorithm 2 to compute the NEST of the left tree from its height profile. Only the second line must be edited to get the correct output. Editions of the height profile are associated with addition of vertices in red. The output tree is self-nested and has been constructed by adding a minimal number of nodes to the initial tree.
Algorithms 12 00180 g013
Figure 14. Progress of Algorithm 3 to compute the NeST of the left tree from its height profile. Only the second line must be edited to get the correct output. Editions of the height profile are associated with deletion of vertices in dashed lines. The output tree is self-nested and has been constructed by removing a minimal number of nodes from the initial tree.
Figure 14. Progress of Algorithm 3 to compute the NeST of the left tree from its height profile. Only the second line must be edited to get the correct output. Editions of the height profile are associated with deletion of vertices in dashed lines. The output tree is self-nested and has been constructed by removing a minimal number of nodes from the initial tree.
Algorithms 12 00180 g014

Share and Cite

MDPI and ACS Style

Azaïs, R. Nearest Embedded and Embedding Self-Nested Trees. Algorithms 2019, 12, 180. https://0-doi-org.brum.beds.ac.uk/10.3390/a12090180

AMA Style

Azaïs R. Nearest Embedded and Embedding Self-Nested Trees. Algorithms. 2019; 12(9):180. https://0-doi-org.brum.beds.ac.uk/10.3390/a12090180

Chicago/Turabian Style

Azaïs, Romain. 2019. "Nearest Embedded and Embedding Self-Nested Trees" Algorithms 12, no. 9: 180. https://0-doi-org.brum.beds.ac.uk/10.3390/a12090180

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop