Nearest Embedded and Embedding Self-Nested Trees

Azaïs, Romain

doi:10.3390/a12090180

Open AccessArticle

Nearest Embedded and Embedding Self-Nested Trees

by

Romain Azaïs

Laboratoire Reproduction et Développement des Plantes, Univ Lyon, ENS de Lyon, UCB Lyon 1, CNRS, INRA, Inria, F-69342 Lyon, France

Algorithms 2019, 12(9), 180; https://0-doi-org.brum.beds.ac.uk/10.3390/a12090180

Submission received: 25 June 2019 / Revised: 23 August 2019 / Accepted: 27 August 2019 / Published: 29 August 2019

(This article belongs to the Special Issue Data Compression Algorithms and their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Self-nested trees present a systematic form of redundancy in their subtrees and thus achieve optimal compression rates by directed acrylic graph (DAG) compression. A method for quantifying the degree of self-similarity of plants through self-nested trees was introduced by Godin and Ferraro in 2010. The procedure consists of computing a self-nested approximation, called the nearest embedding self-nested tree, that both embeds the plant and is the closest to it. In this paper, we propose a new algorithm that computes the nearest embedding self-nested tree with a smaller overall complexity, but also the nearest embedded self-nested tree. We show from simulations that the latter is mostly the closest to the initial data, which suggests that this better approximation should be used as a privileged measure of the degree of self-similarity of plants.

Keywords:

unordered trees; self-nested trees; approximation of trees; structural self-similarity

1. Introduction

Trees form a wide family of combinatorial objects that offers many application fields, e.g., plant modeling and XML files analysis. Modern databases are huge and thus stored in compressed form. Compression methods take advantage of repeated substructures appearing in the tree. As explained in [1], one often considers the following two types of repeated substructures: subtree repeat (used in DAG compression [2,3,4,5]) and tree pattern repeat (exploited in tree grammars [6,7] and top tree compression [1]). We restrict ourselves to DAG compression of unordered rooted trees, which consists of building a Directed Acyclic Graph (DAG) that represents a tree without the redundancy of its identical subtrees (see Figure 1). Two different algorithms exist for computing the DAG reduction of a tree

τ

[5] (2.2 Computing Tree Reduction), which share the same time-complexity in

O (# V {(τ)}^{2} \times D (τ) \times log (D (τ)))

where

V (τ)

denotes the set of vertices of

τ

and

D (τ)

its outdegree.

Trees that are the most compressed by DAG compression present the highest level of redundancy in their subtrees: all the subtrees of a given height are isomorphic. In this case, the DAG related to a tree

τ

is linear, i.e., there exists a path going through all vertices, with exactly

H (τ) + 1

vertices,

H (τ)

denoting the height of

τ

, which is the minimal number of vertices among trees of this height (see

τ_{3}

in Figure 1). This family of trees has been introduced in [8] as the first interesting class of trees for which the subtree isomorphism problem is in NC

^{2}

. It has been known under the name of nested trees [8] and next self-nested trees [5] to insist on their recursive structure and their proximity to the notion of structural self-similarity.

The authors of [5] are interested in capturing the self-similarity of plants through self-nested trees. They propose to construct a self-nested tree that minimizes the distance of the original tree to the set of self-nested trees that embed the initial tree. The distance to this Nearest Embedding Self-nested Tree (NEST) is then used to quantify the self-nestedness of the tree and thus its structural self-similarity (see

τ

and

NEST (τ)

in Figure 2). The main result of [5] (Theorem 1 and E. NEST Algorithm) is an algorithm that computes the NEST of a tree

τ

from its DAG reduction in

O (H {(τ)}^{2} \times D (τ))

.

The goal of the present article is three-fold. We aim at proposing a new and more explicit algorithm that computes the NEST of a tree

τ

with the same time-complexity

O (H {(τ)}^{2} \times D (τ))

as in [5] but that takes as input the height profile of

τ

and not its DAG reduction. We establish that the height profile of a tree

τ

can be computed in

O (# V (τ) \times D (τ))

reducing the overall complexity of a linear factor. Based on this work, we also provide an algorithm in

O (H {(τ)}^{2})

that computes the Nearest embedded Self-nested Tree (NeST) of a tree

τ

(see

τ

and

NeST (τ)

in Figure 2). Finally, we show from numerical simulations that the distance of a tree

τ

to its NeST is much lower than the distance to its NEST. The NeST is most of the time a better approximation of a tree than the NEST and thus should be privileged to quantify the degree of self-nestedness of plants.

The paper is organized as follows. The structures of interest in this paper, namely unordered trees, DAG compression and self-nested trees, are defined in Section 2. Section 3 is dedicated to the definition and the study of the height profile of a tree. The approximation algorithms are presented in Section 4. We give a new insight on the definitions of the NEST and of the NeST in Section 4.1. Our NEST algorithm is presented in Section 4.2, while the NeST algorithm is given in Section 4.3. Section 5 is devoted to simulations. We state that the NeST is mostly a better approximation of a tree than the NEST in Section 5.1. An application to a real rice panicle is presented in Section 5.2. A summary of the paper and concluding remarks can be found in Section 6. All the figures and numerical experiments presented in the article have been made with the Python library treex [9].

2. Preliminaries

2.1. Unordered Rooted Trees

A rooted tree

τ

is a connected graph containing no cycle, i.e., without chain from any vertex v to itself, and such that there exists a unique vertex

R (τ)

, called the root, which has no parent, and any vertex different from the root has exactly one parent. The leaves of

τ

are all the vertices without children. The set of vertices of

τ

is denoted by

V (τ)

. The height of a vertex v may be recursively defined as

H (v) = 0

if v is a leaf of

τ

and

H (v) = 1 + max_{w \in C_{τ} (v)} H (w)

otherwise,

C_{τ} (v)

denoting the set of children of v in

τ

. The height of the tree

τ

is defined as the height of its root,

H (τ) = H (R (τ))

. The outdegree

D (τ)

of

τ

is the maximal branching factor that can be found in

τ

, i.e.,

D (τ) = max_{v \in τ} # C_{τ} (v) .

A subtree

τ [v]

rooted in v is a particular connected subgraph of

τ

. Precisely,

τ [v] = (V [v], E [v])

where

V [v]

is the set of the descendants of v in

τ

and

E [v]

is defined as

E [v] = \{(ξ, ξ^{'}) \in E (τ) : ξ \in V [v], ξ^{'} \in V [v]\},

with

E (τ)

the set of edges of

τ

.

In all the sequel, we consider unordered rooted trees for which the order among the sibling vertices of any vertex is not significant. A precise characterization is obtained from the additional definition of isomorphic trees. Let

τ

and

θ

two rooted trees. A one-to-one correspondence

φ : V (τ) \to V (θ)

is called a tree isomorphism if, for any edge

(v, w) \in E (τ)

,

(φ (v), φ (w)) \in E (θ)

. Structures

τ_{1}

and

τ_{2}

are called isomorphic trees whenever there exists a tree isomorphism between them. One can determine if two n-vertex trees are isomorphic in

O (n)

[10] (Example 3.2 and Theorem 3.3). The existence of a tree isomorphism defines an equivalence relation on the set of rooted trees. The class of unordered rooted trees is the set of equivalence classes for this relation, i.e., the quotient set of rooted trees by the existence of a tree isomorphism.

2.2. DAG Compression

Now we consider the equivalence relation “existence of a tree isomorphism” on the set of the subtrees of a tree

τ

. We consider the quotient graph

Q (τ) = (V, E)

obtained from

τ

using this equivalence relation. V is the set of equivalence classes on the subtrees of

τ

, while E is a set of pairs of equivalence classes

(C_{1}, C_{2})

such that

R (C_{2}) \in C_{τ} (R (C_{1}))

up to an isomorphism. The graph

Q (τ)

is a DAG [5] (Proposition 1) that is a connected directed graph without path from any vertex v to itself.

Let

(C_{1}, C_{2})

be an edge of the DAG

Q (τ)

. We define

N (C_{1}, C_{2})

as the number of occurrences of a tree of

C_{2}

just below the root of any tree of

C_{1}

. The tree reduction

R (τ)

is defined as the quotient graph

Q (τ)

augmented with labels

N (C_{1}, C_{2})

on its edges [5] (Definition 3 (Reduction of a tree)). Intuitively, the graph

R (τ)

represents the original tree

τ

without its structural redundancies (see Figure 1).

2.3. Self-Nested Trees

A tree

τ

is called self-nested [5] (III. Self-nested trees) if for any pair of vertices v and w, either the subtrees

τ [v]

and

τ [w]

are isomorphic, or one is (isomorphic to) a subtree of the other. This characterization of self-nested trees is equivalent to the following statement: for any pair of vertices v and w such that

H (v) = H (w)

,

τ [x] = τ [y]

, i.e., all the subtrees of the same height are isomorphic.

Linear DAGs are DAGs containing at least one path that goes through all their vertices. They are closely connected with self-nested trees by virtue of the following result.

Proposition 1

(Godin and Ferraro [5]). A tree τ is self-nested if and only if its reduction

R (τ)

is a linear DAG.

This result proves that self-nested trees achieve optimal compression rates among trees of the same height whatever their number of nodes (compare

τ_{3}

with

τ_{1}

and

τ_{2}

in Figure 1). Indeed,

R (τ)

has at least

H (τ) + 1

nodes and the inequality is saturated if and only if

τ

is self-nested.

3. Height Profile of the Tree Structure

3.1. Definition and Complexity

This section is devoted to the definition of the height profile

ρ_{τ}

of a tree

τ

and to the presentation of an algorithm to calculate it. In the sequel, we assume that the tree

τ

is always traversed in the same order, depth-first search to set the ideas down. In particular, when vectors are indexed by nodes of

τ

sharing the same property, the order of the vector is important and should be always the same.

Given a vertex

v \in V (τ)

,

γ_{h} (v) = # {v^{'} \in C_{τ} (v) : H (τ [v^{'}]) = h}

is the number of subtrees of height h directly under v. Now, we consider the vector

ρ_{τ} (h_{1}, h_{2}) = (γ_{h_{2}} (v) : v \in V (τ), H (τ [v]) = h_{1})

made of the concatenation of the integers

γ_{h_{2}} (v)

over subtrees

τ [v]

of height

h_{1}

ordered in depth-first search. Consequently,

ρ_{τ}

is an array made of vectors with varying lengths.

Let

A_{1}

and

A_{2}

be two arrays for which each entry is a vector. We say that

A_{1}

and

A_{2}

are equivalent if, for any line i, there exists a permutation

σ_{i}

such that for any column j,

A_{1} (i, j) = σ_{i} (A_{2} (i, j)) .

In particular, i being fixed, all the vectors

A_{1} (i, j)

and

A_{2} (i, j)

must have the same length. This condition defines an equivalence relation. The height profile of

τ

is the array

ρ_{τ}

as an element of the quotient space of arrays of vectors under this equivalence relation. In other words, the vectors

ρ_{τ} (h_{1}, h_{2})

,

0 \leq h_{2} < h_{1}

and

h_{1}

fixed, must be ordered in the same way but the choice of the order is not significant. Finally, it should be already remarked that

ρ_{τ} (h_{1}, h_{2}) = \emptyset

when

h_{2} \geq h_{1}

or

h_{1} > H (τ)

. Consequently, the height profile can be reduced to the triangular array

ρ_{τ} = {[ρ_{τ} (h_{1}, h_{2})]}_{0 \leq h_{2} < h_{1} \leq H (τ)} .

The application

ρ_{τ}

provides the distribution of subtrees of height

h_{2}

just below the root of subtrees of height

h_{1}

for all couples

(h_{1}, h_{2})

, which typically represents the height profile of

τ

. For clarity’s sake, we give the values of

ρ_{τ_{k}}

for the trees

τ_{k}

of Figure 1, coefficient

(i, j)

of the matrix being

ρ_{τ_{k}} (i, j - 1)

,

ρ_{τ_{1}} = ρ_{τ_{2}} = [\begin{matrix} (1, 1, 2) & \emptyset & \emptyset \\ (0, 1, 1) & (1, 1, 1) & \emptyset \\ (0) & (0) & (3) \end{matrix}] and ρ_{τ_{3}} = [\begin{matrix} (1, 1, 1) & \emptyset & \emptyset \\ (1, 1, 1) & (1, 1, 1) & \emptyset \\ (0) & (0) & (3) \end{matrix}] .

(1)

It should be noticed that the height profile does not contain all the topology of the tree since trees

τ_{1}

and

τ_{2}

of Figure 1 are different but share the same height profile (1). However, the height of a tree

τ

can be recovered from its height profile through the relation

H (τ) = dim (ρ_{τ})

, the dimension of

ρ_{τ}

being defined by

dim (ρ_{τ}) = min \{n \geq 0 : \forall i \geq 0, ρ_{τ} (n + 1, i) = \emptyset\} .

Proposition 2.

ρ_{τ}

can be computed in

O (# V (τ) \times D (τ))

-time.

Proof.

First, attribute to each node

v \in V (τ)

the height of the subtree

τ [v]

with complexity

O (# V (τ))

. Next, traverse the tree in depth-first search in

O (# V (τ))

and calculate for each vertex v the vector

{(γ_{h} (v))}_{0 \leq h < H (τ [v])}

in

# C_{τ} (v) \leq D (τ)

operations. Finally, append this vector to

ρ_{τ} (H (τ [v]), \cdot)

component by component. □

3.2. Relation with Self-Nested Trees

Self-nested trees are characterized by their height profile considering the following result.

Proposition 3.

τ is self-nested if and only if, for any

0 \leq h_{2} < h_{1} \leq H (τ)

, all the components of the vector

ρ_{τ} (h_{1}, h_{2})

are the same (for instance see the profile (1) of the tree

τ_{3}

presented in Figure 1). In addition, a self-nested tree τ can be reconstructed from

ρ_{τ}

(see Algorithm 1).

Proof.

If

τ

is self-nested, the

N_{h_{1}}

subtrees of height

h_{1}

appearing in

τ

are isomorphic and thus have the same number

n_{h_{1}, h_{2}}

of subtrees of height

h_{2}

just below their root. Consequently,

ρ_{τ} (h_{1}, h_{2}) = \underset{N_{h_{1}}}{\underset{\leftrightarrow}{(n_{h_{1}, h_{2}}, \dots, n_{h_{1}, h_{2}})}} .

The reciprocal result may be established considering the following lemma which proof presents no difficulty. □

Lemma 1.

If all the subtrees of height

0 \leq h < H

appearing in a tree τ are isomorphic, and if all the subtrees of height H have the same number of subtrees of height

0 \leq h < H

just below their root, then all the subtrees of height H appearing in τ are isomorphic.

All the subtrees of height 1 in

τ

are isomorphic because all the components of

ρ_{τ} (1, 0)

are the same. The expected result is shown by induction on the height thanks to the previous lemma which assumptions are satisfied since

ρ_{τ}

always contains vectors for which all the entries are equal. The previous reasoning also provides a way (presented in Algorithm 1) to build a unique (self-nested) tree

T

from the height profile

ρ_{τ}

. In addition, this is easy to see that

τ

and

T

are isomorphic.

To present the algorithm of reconstruction of a self-nested tree from its height profile, we need to define the restriction of a height profile to some height. Let p be a height profile. The restriction

p_{|_{h}}

of p to height

h \geq 0

is the array defined by

\{\begin{matrix} \forall 1 \leq h_{1} \leq h, & \forall h_{2} \geq 0, & p_{|_{h}} (h_{1}, h_{2}) = p (h_{1}, h_{2}), \\ \forall h_{1} > h, & \forall h_{2} \geq 0, & p_{|_{h}} (h_{1}, h_{2}) = \emptyset . \end{matrix}

Consequently,

dim (p_{|_{h}}) = min (d i m (p), h)

. A peculiar case is

p_{|_{0}}

for which each entry is the empty set and thus

dim (p_{|_{0}}) = 0

. It should be also remarked that there may exist no tree

τ

such that

p_{|_{h}}

is the height profile of

τ

.

Algorithm 1: Construction of a self-nested tree from its height profile.

As we can see in the proof of Proposition 3 or in Algorithm 1, the lengths of the vectors

ρ_{τ} (h_{1}, h_{2})

are not significant to reconstruct a self-nested tree

τ

. Consequently, since all the components of

ρ_{τ} (h_{1}, h_{2})

are the same, we can identify the height profile of a self-nested tree with the integer-valued array

[ρ_{τ} {(h_{1}, h_{2})}_{1}]

.

Proposition 4.

The number of nodes of a self-nested tree τ can be computed from

ρ_{τ}

in

O (H {(τ)}^{2})

.

Proof.

By induction on the height, one has

# V (τ) = N (H (τ))

, where the sequence

N

is defined by

N (0) = 1

(number of nodes of a tree reduced to a root) and,

\forall 1 \leq H \leq H (τ), N (H) = 1 + \sum_{h = 0}^{H - 1} ρ_{τ} (H, h) N (h) .

(2)

The number of operations required to compute

N (H (τ))

is of order

O (H {(τ)}^{2})

. □

The authors of [5] (Proposition 6) calculate the number of nodes of a tree (self-nested or not) from its DAG reduction by a formula very similar to (2), and which achieve the same complexity on self-nested trees. As mentioned before, a tree cannot be recovered from its height profile in general, thus we cannot expect such a result from the height profile of any tree.

4. Approximation Algorithms

4.1. Definitions

4.1.1. Editing Operations

We shall define the NEST and the NeST of a tree

τ

. As in [5] (Equation (5)), we ask these approximations to be consistent with Zhang’s edit distance between unordered trees [11] denoted

D_{Z}

in this paper. Thus, as in [11] (2.2 Editing Operations), we consider the following two types of editing operations: adding a node and deleting a node. Deleting a node w means making the children of w become the children of the parent v of w and then removing w (see Figure 3). Adding w as a child of v will make w the parent of a subset of the current children of v (see Figure 4).

4.1.2. Constrained Editing Operations

Zhang’s edit distance is defined from the above editing operations and from constrained mappings between trees [11] (3.1 Constrained Edit Distance Mappings). A constrained mapping between two trees

τ

and

θ

is a mapping [11] (2.3.2 Editing Distance Mappings), i.e., a one-to-one correspondence

φ

from a subset of

V (τ)

into a subset of

V (θ)

preserving the ancestor order, with an additional condition on the Least Common Ancestors (LCAs) [11] (condition (2) p. 208): if, for

1 \leq i \leq 3

,

v_{i} \in V (τ)

and

w_{i} = φ (v_{i}) \in V (θ)

, then

LCA (v_{1}, v_{2})

is a proper ancestor of

v_{3}

if and only if

LCA (w_{1}, w_{2})

is a proper ancestor of

w_{3}

.

Let

θ

be a tree that approximates

τ

obtained by inserting nodes in

τ

only and consider the induced mapping

M_{τ \to θ}

that associates nodes of

τ

with themselves in

θ

. We want the approximation process to be consistent with Zhang’s edit distance

D_{Z}

, i.e., we want the mapping

M_{τ \to θ}

to be a constrained mapping in the sense of Zhang, which in particular implies

D_{Z} (θ, τ) = # V (θ) - # V (τ)

. We shall prove that this requirement excludes some inserting operations in our context.

Indeed, the mapping

M_{τ \to θ}

involved in the inserting operation of Figure 4 is partially displayed in Figure 5, nodes

v_{i}

of

τ

being associated with nodes

w_{i}

of

θ

. The LCA of

v_{1}

and

v_{2}

in

τ

is a proper ancestor of

v_{3}

. However, the LCA of

w_{1}

and

w_{2}

in

θ

is not a proper ancestor of

w_{3}

. Consequently, this mapping is not a constrained mapping as defined by Zhang. A necessary and sufficient condition for

M_{τ \to θ}

to be a constrained mapping is given in Lemma 2.

Lemma 2.

Let τ be a tree and

v \in V (τ)

. Let θ be the tree obtained from τ by adding a node w as a child of v making the nodes of the subset

C \subset C_{τ} (v)

children of w. The mapping

M_{τ \to θ}

induced by these inserting operations is a constrained mapping in the sense of Zhang if and only if

C = \emptyset

,

# C = 1

or

# C = # C_{τ} (v)

.

Proof.

The proof is obvious if v has one or two children. Thus, we assume that v has at least three children

c_{1}

,

c_{2}

and

c_{3}

. In

τ

, the LCA of

c_{1}

and

c_{2}

is v and v is an ancestor of

c_{3}

. Adding w as the parent of

c_{1}

and

c_{2}

makes it the LCA of these two nodes, but not an ancestor of

c_{3}

in

θ

. The additional condition on the LCAs is then not satisfied. This problem appears only when making w the parent of at least two children and of not all the children of v. □

Consequently, we restrict ourselves to the following inserting operations which are the only ones that ensure that the associated mapping satisfies Zhang’s condition: adding w as a child of v will make w (i) a leaf, (ii) the parent of one current child of v, or (iii) the parent of all the current children of v. However, it should be noticed that (iii) can always be expressed as (ii) (see Figure 6). Finally, we only consider the inserting operations that make the new child of v the parent of zero or one current child of v. For obvious reasons of symmetry, the allowed deleting operations are the complement of inserting operations, i.e., one can delete an internal node if and only if it has a unique child, which also ensures that the induced mapping is constrained in the sense of Zhang.

4.1.3. Preserving the Height of the Pre-Existing Nodes

In [5] (Definition 9 and Figure 6), the NEST of a tree

τ

is obtained by successive partial linearizations of the (non-linear) DAG of

τ

which consist of merging all the nodes at the same height of the DAG. A consequence is that the height of any pre-existing node of

τ

is not changed by the inserting operations. For the sake of consistency with [5], we only consider inserting and deleting operations that preserve the height of all the pre-existing nodes of

τ

.

The next two results deal with inserting operations that preserve the height of the pre-existing nodes.

Lemma 3.

Let τ be a tree,

v \in V (τ)

and

c \in C_{τ} (v)

. Let θ be the tree obtained from τ by adding the internal node w as a child of v making w the parent of c. Then,

\forall u \in V (τ), H (θ [u]) = H (τ [u]) ⟺ H (τ [c]) + 1 < H (τ [v]) .

Proof.

Adding w may only increase the height of v and the one of its ancestors in

τ

. If the height of v is not changed by adding w, the height of its ancestors will not be modified. The height of v remains unchanged if and only if the height of w in

θ

, i.e.,

H (τ [c]) + 1

, is strictly less than the height of

τ [v]

. □

Lemma 4.

Let τ be a tree and

v \in V (τ)

. Let θ be the tree obtained from τ by adding a tree t as a child of v. Then,

\forall u \in V (τ), H (θ [u]) = H (τ [u]) ⟺ H (t) + 1 \leq H (τ [v]) .

Proof.

Adding a subtree t under v may only increase the height of v and the one of its ancestors in

τ

. If the height of v is not changed by adding t, the height of its ancestors will not be modified. Adding t will make the height of v increase if

H (t)

is strictly greater than the height of the higher child of v. □

A particular case of Lemma 4 is the insertion of leaves in a tree. Considering the above result, a leaf can be added under v if and only if

H (τ [v]) \geq 1

, i.e., v is not a leaf. The below results concern deleting operations that preserve the height of the remaining nodes of

τ

.

Lemma 5.

Let τ be a tree,

v \in V (τ)

,

w \in C_{τ} (v)

and

C_{τ} (w) = {c}

. Let θ be the tree obtained from τ by deleting the internal node w making its unique child c a child of v. Then,

\forall u \in V (θ), H (θ [u]) = H (τ [u]) ⟺ \exists w^{'} \in C_{τ} (v) \ {w}, H (τ [w^{'}]) + 1 = H (τ [v]) .

Proof.

Deleting w may only decrease the height of v and the one of its ancestors in

τ

. If the height of v is not changed by deleting w, the height of its ancestors will not be modified. The height of v remains unchanged if and only if it has a child different of w of height

H (τ [v]) - 1

. □

Lemma 6.

Let τ be a tree,

v \in V (τ)

,

c \in C_{τ} (v)

. Let θ be the tree obtained from τ by deleting the subtree

τ [c]

. Then,

\forall u \in V (θ), H (θ [u]) = H (τ [u]) ⟺ \exists c^{'} \in C_{τ} (v) \ {c}, H (τ [c^{'}]) + 1 = H (τ [v]) .

Proof.

The proof follows the same reasoning as in the previous result. □

4.1.4. NEST and NeST

In view of the foregoing, we consider the set of inserting and deleting operations that fulfill the below requirements.

Adding operations (see Figure 7)

Internal nodes (AI): adding w as a child of v making w the parent of the child c of v can be done only if $H (τ [c]) + 1 < H (τ [v])$ .
Subtrees (AS): adding t as a child of v can be done only if $H (t) + 1 \leq H (τ [v])$ .

Deleting operations (see Figure 8)

Internal nodes (DI): deleting $v \in C_{τ} (u)$ (making the unique child w of v a child of u) can be done only if there exists $v^{'} \in C_{τ} (u)$ , $v \neq v^{'}$ , such that $H (τ [v^{'}]) \geq H (τ [v])$ .
Subtrees (DS): deleting the subtree $τ [w]$ , $w \in C_{τ} (v)$ , of $τ$ can be done if there exists $w^{'} \in C_{τ} (v)$ , $w^{'} \neq w$ , such that $H (τ [w^{'}]) + 1 = H (τ [v])$ .

Proposition 5.

The editing operations AI and AS (DI and DS, respectively) are the only inserting (deleting, respectively) operations that ensure that (i) the induced mapping is a constrained mapping and that (ii) the height of all the pre-existing nodes is unchanged.

Proof.

This result is a direct corollary of Lemmas 2–6. □

The NEST (the NeST, respectively) of a tree

τ

is the self-nested tree obtained by the set of inserting operations AI and AS (of deleting operations DI and DS, respectively) of minimal cost, the cost of inserting a subtree being its number of nodes. Existence and uniqueness of the NEST are not obvious at this stage. The NeST exists because the (self-nested) tree composed of a unique root can be easily obtained by deleting operations from any tree, but its uniqueness is not evident.

4.2. NEST Algorithm

To present our NEST algorithm in a concise form in Algorithm 2, we need to define the following operations involving two vectors u and v of the same size n and a real number

γ

,

\{\begin{matrix} u & + & v & = & (u_{1} + v_{1}, \dots, u_{n} + v_{n}), \\ u & + & γ & = & (u_{1} + γ, \dots, u_{n} + γ), \\ u & \lor & γ & = & (max (u_{1}, γ), \dots, max (u_{n}, γ)) . \end{matrix}

In other words, these operations must be understood component by component. In addition, in a condition,

u = 0

(

u \neq 0

, respectively) means that for all

1 \leq i \leq n

,

u_{i} = 0

(

u_{i} \neq 0

, respectively). Finally, for

1 \leq i \leq j \leq n

,

u_{i \dots j}

denotes the vector

(u_{i}, \dots, u_{j})

of length

j - i + 1

. This notation will also be used in Algorithm 3 for calculating the NeST. It should be noticed that an illustrative example that can help the reader to follow the progress of the algorithm is provided in Section 6.

Algorithm 2: Construction of the nearest embedding self-nested tree.

The relation between the above algorithm and the NEST of a tree is provided in the following result, which states in particular the existence of the NEST.

Proposition 6.

For any tree τ, Algorithm 2 returns the unique NEST of τ in

O (H {(τ)}^{2} \times D (τ))

.

Proof.

By definition of the NEST, the height of all the pre-existing nodes of

τ

cannot be modified. Thus, the number of nodes of height

h - 1

under a node of height h can only increase by inserting subtrees in the structure. Then we have

ρ_{NEST (τ)} (h, h - 1) \geq max ρ_{τ} (h, h - 1) .

(3)

Let v be a vertex of height h in

τ

. We recall that

γ_{i} (v)

denotes the number of subtrees of height i under v. Our objective is to understand the consequences for

γ_{i} (v)

of inserting operations to obtain

ρ_{NEST (τ)} (h, h - 1)

subtrees of height

h - 1

under v. To this aim, we shall define a sequence

γ_{i}^{(h - 1, j)} (v)

starting from

γ_{i}^{(h - 1, 0)} (v) = γ_{i} (v)

that corresponds to the modified versions of

τ

. The first exponent

h - 1

means that this sequence concerns editing operations used to get the good number of subtrees of height

h - 1

under v. □

Let

Δ_{h - 1}^{(0)} (v) = ρ_{NEST (τ)} (h, h - 1) - γ_{h - 1}^{(0)} (v)

be the number of subtrees of height

h - 1

that must be added under v to obtain the height profile of the NEST under v, i.e.,

γ_{h - 1}^{(h - 1, 1)} (v) = ρ_{NEST (τ)} (h, h - 1) .

Implicitly, it means that

γ_{i}^{(h - 1, 1)} (v) = γ_{i} (0) (v)

for

i \neq h - 1

. The subtrees of height

h - 1

that we must add are isomorphic, self-nested and embed all the subtrees of height

h - 2

appearing in

τ

by definition of the NEST. In particular, they can be obtained by the allowed inserting operations from the subtrees of height

h - 2

under v, by first adding an internal node to increase their height to

h - 1

. In addition, it is less costly in terms of editing operations to construct the subtrees of height

h - 1

from the subtrees of height

h - 2

available under v than to directly add these subtrees under v. If all the subtrees of height

h - 2

under v must be reconstructed later, it will be possible to insert them and the total cost will be same as by directly adding the subtrees of height

h - 1

under v. Consequently, all the available subtrees of height

h - 2

are used to construct subtrees of height

h - 1

under v and it remains

Δ_{h - 1}^{(1)} = (Δ_{h - 1}^{(0)} (v) - γ_{h - 2}^{(h - 1, 1)}) \lor 0

subtrees of height

h - 1

to be built under v. Furthermore, in the new version of

τ

, we have

γ_{h - 2}^{(h - 1, 2)} (v) = γ_{h - 2}^{(h - 1, 1)} (v) - Δ_{h - 1}^{(1)} (v) .

The

Δ_{h - 1}^{(1)}

subtrees of height

h - 1

can be constructed from subtrees of height

h - 3

(with a larger cost than from subtrees of height

h - 2

), and so on. To this aim, we define the sequence of the modified versions of

τ

by, for

0 \leq j \leq h - 2

,

\{\begin{matrix} Δ_{h - 1}^{(j + 1)} (v) & = & (Δ_{h - 1}^{(j)} (v) - γ_{h - 1 - (j + 1)}^{(h - 1, j + 1)} (v)) \lor 0, \\ γ_{h - (j + 2)}^{(h - 1, j + 2)} (v) & = & γ_{h - (j + 2)}^{(h - 1, j + 1)} (v) - Δ_{h - 1}^{(j + 1)} (v) . \end{matrix}

At the final step

j = h - 2

, the

Δ_{h - 1}^{(0)} (v)

subtrees of height

h - 1

have been constructed from all the available subtrees appearing under v, starting from subtrees of height

h - 2

, then

h - 3

, etc., and then have been added if necessary.

From now on, the number of subtrees of height

h - 2

under v will not decrease. Indeed, it would mean that an internal node has been added between v and the root of a subtree of height

h - 2

. This would have the consequence to increase of one unit the number of subtrees of height

h - 1

in subtrees of height h, which cost is (strictly) larger than adding a subtree of height

h - 2

in all the subtrees of height h. Consequently, we obtain

ρ_{NEST (τ)} (h, h - 2) \geq max_{{v \in V (τ) : H (τ [v]) = h}} γ_{h - 2}^{(h - 1, h)} (v) .

We can reproduce the above reasoning to construct under v subtrees of height

h - i

, i from 2 to

h - 1

, from subtrees with a smaller height, which defines a sequence

γ_{i}^{(h - i, j)}

of modified versions of

τ

, which size is

h - i + 1

, and we get the following inequality,

\forall 2 \leq i \leq h, ρ_{NEST (τ)} (h, h - i) \geq max_{{v \in V (τ) : H (τ [v]) = h}} γ_{h - i}^{(h - i + 1, h - i + 2)} (v) .

(4)

The tree returned by Algorithm 2 is self-nested and its height profile saturates the inequalities (3) and (4) for all the possible values of h and i by construction. In addition, we have shown that this tree can be obtained from

τ

by the allowed inserting operations. Since increasing of one unit the height profile at

(h_{1}, h_{2})

has a (strictly) positive cost, this tree is thus the (unique) NEST of

τ

. As seen previously, the number of iterations of the while loop at line 7 is the number of subtrees of height

h_{2} < h_{1}

available to construct a tree of height

h_{1}

, i.e., the degree of

τ

in the worst case, which states the complexity.

4.3. NeST Algorithm

This section is devoted to the presentation of the calculation of the NeST in Algorithm 3. An illustrative example that can help the reader to follow the progress of the algorithm is provided in Section 6.

Algorithm 3: Construction of the nearest embedded self-nested tree.

Proposition 7.

For any tree τ, Algorithm 3 returns the unique NeST of τ in

O (H {(τ)}^{2})

.

Proof.

The proof follows the same reasoning as the proof of Proposition 6. First, one may remark that

ρ_{NeST (τ)} (h, h - 1) \leq min ρ_{τ} (h, h - 1),

(5)

because the number of subtrees of height

h - 1

under a node v of height h can only decrease by the allowed deleting operations. Let v be a node of height h in

τ

and

γ_{i} (v)

the number of subtrees of height i under v. If a subtree of height

h - i

under v that must be deleted is not self-nested, one can first modify it to get a self-nested tree and then remove it with the same overall cost. Thus, we can assume without loss of generality that all the subtrees under v are self-nested.

Δ_{h - 1} (v) = γ_{h - 1} (v) - ρ_{NeST (τ)} (h, h - 1)

denotes the number of subtrees of height

h - 1

that have to be removed from v. Let

γ_{i}^{(j)} (v)

the sequence of the modifications to obtain

ρ_{NeST (τ)} (h, h - 1)

subtrees of height

h - 1

under v, with

γ_{i}^{(0)} (v) = γ_{i} (v)

. Instead of deleting a subtree of height

h - 1

, it is always less costly to decrease its height of one unit by deleting its root. However it is possible only if this internal node has only one child, i.e., if

ρ_{τ} (h - 1, h - 2) = 1

and

ρ_{τ} (h - 1, i) = 0

for

0 \leq i < h - 2

. If this new tree of height

h - 2

must be deleted in the sequel, it will be done with the same global cost as by directly deleting the subtree of height

h - 1

. Consequently,

\{\begin{matrix} γ_{h - 1}^{(1)} (v) & = & ρ_{NeST (τ)} (h, h - 1), \\ γ_{h - 2}^{(1)} (v) & = & γ_{h - 2}^{(0)} (v) + Δ_{h - 1} (v) I_{{ρ_{τ} (h - 1, h - 2) = 1, \forall 3 \leq i \leq h, ρ_{τ} (h - 1, h - i) = 0}} . \end{matrix}

From now on, the number of subtrees of height

h - 2

under v will thus not increase and we obtain

ρ_{NeST (τ)} (h, h - 2) \leq min_{{v \in V (τ) : H (τ [v]) = h}} γ_{h - 2}^{(1)} (v) .

There are

Δ_{h - 2} (v) = γ_{h - 2}^{(1)} (v) - ρ_{NeST (τ)} (h, h - 2)

subtrees of height

h - 2

to be deleted under v. We can repeat the previous reasoning and delete the root of subtrees of height

h - 2

if possible rather than delete the whole structure, and so on for any height. Thus, the sequence

γ_{i}^{(j)}

is defined from

\{\begin{matrix} Δ_{h - 1 - i} (v) & = & γ_{h - 1 - i}^{(i)} (v) - ρ_{NeST (τ)} (h, h - 1 - i), \\ γ_{h - 1 - i}^{(i + 1)} (v) & = & ρ_{NeST (τ)} (h, h - 1 - i), \\ γ_{h - 2 - i}^{(i + 1)} (v) & = & γ_{h - 2 - i}^{(i)} (v) + Δ_{h - 1 - i} (v) I_{{ρ_{τ} (h - 1, h - 2) = 1, \forall i + 2 \leq j \leq h, ρ_{τ} (h - i, h - j) = 0}}, \end{matrix}

and we have

\forall 0 \leq i \leq h - 2, ρ_{NeST (τ} (h, h - 2 - i) \leq min_{{v \in V (τ) : H (τ [v]) = h}} γ_{h - 2 - i}^{(i + 1)} (v) .

(6)

The tree returned by Algorithm 3 saturates the inequalities (5) and (6) for all the possible values of h and i. Decreasing of one unit the height profile at

(h_{1}, h_{2})

has a (strictly) positive cost. Thus, this tree is the (unique) NeST of

τ

. The time-complexity is given by the size of the height profile array. □

5. Numerical Illustration

5.1. Random Trees

The aim of this section is to illustrate the behavior of the NEST and of the NeST on a set of simulated random trees regarding both the quality of the approximation and the computation time. We have simulated 3000 random trees of size 10, 20, 30, 40, 50, 75, 100, 150, 200, and 250. For each tree, we have calculated the NEST and the NeST. The number of nodes of these approximations is displayed in Figure 9. We can observe that the number of nodes of the NEST is very large in regards with the size of the initial tree: approximately one thousand nodes on average for a tree of 150 nodes, which is to say an approximation error of 750 vertices. Remarkably, the NEST has never been a better approximation than the NeST on the set of simulated trees.

The computation time required to compute the NEST or the NeST of one tree on a 2.8 GHz Intel Core i7 has also been estimated on the set of simulated trees and is presented in Figure 10. As predicted by the theoretical complexities given in Propositions 6 and 7, the NeST algorithm requires less computation time than the NEST. Consequently, the NeST provides a much better and faster approximation of the initial data than the NEST.

5.2. Structural Analysis of a Rice Panicle

Considering [5], we propose to quantify the degree of self-nestedness of a tree

τ

by the following indicator based on the calculation of

NEST (τ)

,

δ_{NEST} (τ) = 1 - \frac{D_{Z} (NEST (τ), τ)}{# V (τ)} = \frac{2 # V (τ) - # V (NEST (τ))}{# V (τ)},

(7)

where

D_{Z}

stands for Zhang’s edit distance [11]. In [5] (Equation (6)), the degree of self-nestedness of a plant is defined as in (7) but normalizing by the number of nodes of the NEST and not the size of the initial data, which avoids the indicator to be negative. In the present paper, we prefer normalizing by the number of nodes of

τ

to obtain the following comparable self-nestedness measure based on the calculation of

NeST (τ)

,

δ_{NeST} (τ) = 1 - \frac{D_{Z} (NeST (τ), τ)}{# V (τ)} = \frac{# V (NeST (τ))}{# V (τ)} .

The main advantage of this normalization is that if the NEST and the NeST offer equally good approximations, i.e.,

D_{Z} (NEST (τ), τ) = D_{Z} (NeST (τ), τ)

, then the degree of self-nestedness does not depend on the chosen approximation scheme,

δ_{NEST} (τ) = δ_{NeST} (τ)

.

We propose to investigate the degree of structural self-similarity of the topological structure of the rice panicle studied in [5] (4.2 Analysis of a Real Plant) through these self-nested approximations. The rice panicle

V_{1}

is made of a main axis bearing a main inflorescence

P_{1}

and lateral systems

V_{i}

,

2 \leq i \leq 5

, each composed of inflorescences

P_{j}

,

2 \leq j \leq 8

(see Figure 11). We have computed the indicators of self-nestedness

δ_{NEST} \lor 0

and

δ_{NeST}

for each substructure composing the whole panicle (see Figure 12). The numerical values and the shape of these indicators are similar. However,

δ_{NeST}

is always greater than

δ_{NEST}

, in particular for the largest structures

V_{i}

. Based on a better approximation procedure as highlighted in the previous section, the NeST better captures the self-nestedness of the rice panicle.

6. Summary and Concluding Remarks

Self-nested trees are unordered rooted trees that are the most compressed by DAG compression. Since DAG compression takes advantage of subtree repetitions, they present the highest level of redundancy in their subtrees. In this paper, we have developed a new algorithm for computing the Nearest Embedding Self-nested Tree (NEST) of a tree

τ

in

O (H {(τ)}^{2} \times D (τ))

, as well as the first algorithm for determining its Nearest embedded Self-nested Tree (NeST) with time-complexity

O (H {(τ)}^{2})

.

To this end, we have introduced the notion of height profile of a tree. Roughly speaking, the height profile is a triangular array which component

(h_{1}, h_{2})

, with

h_{2} < h_{1}

, is the list of the numbers of direct subtrees of height

h_{2}

in subtrees of height

h_{1}

, where a subtree is said direct if it is attached to the root. We have shown in Proposition 3 that self-nested trees are characterized by their height profile. While the first NEST algorithm [5] was based on edition of the DAG related to the tree to be compressed, the two approximation algorithms developed in the present paper take as input the height profile of any tree

τ

, which can be computed in

O (# V (τ) \times D (τ))

-time (see Proposition 2), and modify it from top to bottom and from right to left, to return the self-nested height profile of the expected estimate (see Algorithms 2 and 3). Figure 13 and Figure 14 illustrate the progress of the algorithms on a simple example. They should be examined in relation to the corresponding algorithms. We would like to emphasize that our paper also states the uniqueness of the NEST and of the NeST, and studies the link with edit operations admitted in Zhang’s distance.

Remarkably, estimations performed on a dataset of random trees establish that the NeST is a more accurate approximation of the initial tree than the NEST. This observation could be investigated from a theoretical perspective. In addition, we have shown that the NeST better captures the degree of structural self-similarity of a rice panicle than the NEST.

The algorithms developed in this paper are available in the last version of the Python library treex [9].

Funding

This research received no external funding.

Acknowledgments

The author would like to show his gratitude to two anonymous reviewers for their relevant comments on a first version of the manuscript.

Conflicts of Interest

The author declares no conflicts of interest.

References

Bille, P.; Gørtz, I.L.; Landau, G.M.; Weimann, O. Tree compression with top trees. Inf. Comput. 2015, 243, 166–177. [Google Scholar] [CrossRef] [Green Version]
Bousquet-Mélou, M.; Lohrey, M.; Maneth, S.; Noeth, E. XML Compression via Directed Acyclic Graphs. Theory Comput. Syst. 2014, 57, 1322–1371. [Google Scholar] [CrossRef] [Green Version]
Buneman, P.; Grohe, M.; Koch, C. Path Queries on Compressed XML. In Proceedings of the 29th International Conference on Very Large Data Bases, VLDB’03, Berlin, Germany, 9–12 September 2003; Volume 29, pp. 141–152. [Google Scholar]
Frick, M.; Grohe, M.; Koch, C. Query evaluation on compressed trees. In Proceedings of the 18th Annual IEEE Symposium of Logic in Computer Science, Ottawa, ON, Canada, 22–25 June 2003; pp. 188–197. [Google Scholar]
Godin, C.; Ferraro, P. Quantifying the degree of self-nestedness of trees. Application to the structural analysis of plants. IEEE Trans. Comput. Biol. Bioinform. 2010, 7, 688–703. [Google Scholar] [CrossRef] [PubMed]
Busatto, G.; Lohrey, M.; Maneth, S. Efficient Memory Representation of XML Document Trees. Inf. Syst. 2008, 33, 456–474. [Google Scholar] [CrossRef]
Lohrey, M.; Maneth, S. The Complexity of Tree Automata and XPath on Grammar-compressed Trees. Theor. Comput. Sci. 2006, 363, 196–210. [Google Scholar] [CrossRef]
Greenlaw, R. Subtree Isomorphism is in DLOG for Nested Trees. Int. J. Found. Comput. Sci. 1996, 7, 161–167. [Google Scholar] [CrossRef]
Azaïs, R.; Cerutti, G.; Gemmerlé, D.; Ingels, F. treex: A Python package for manipulating rooted trees. J. Open Source Softw. 2019, 4, 1351. [Google Scholar] [CrossRef]
Aho, A.V.; Hopcroft, J.E.; Ullman, J.D. The Design and Analysis of Computer Algorithms, 1st ed.; Addison-Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1974. [Google Scholar]
Zhang, K. A constrained edit distance between unordered labeled trees. Algorithmica 1996, 15, 205–222. [Google Scholar] [CrossRef]

Figure 1. Trees and their DAG (Directed Acyclic Graph) reduction. In the tree, roots of isomorphic subtrees are colored identically. In the DAG, vertices are equivalence classes colored according to the class of isomorphic subtrees that they represent.

Figure 2. A tree

τ

(middle) with 30 nodes and its approximations

NeST (τ)

(left) with 24 nodes and

NEST (τ)

(right) with 37 nodes.

Figure 2. A tree

τ

(middle) with 30 nodes and its approximations

NeST (τ)

(left) with 24 nodes and

NEST (τ)

(right) with 37 nodes.

Figure 3. Deleting a node.

Figure 4. Inserting a node.

Figure 5. The tree

θ

is obtained from

τ

by inserting an internal node. The associated mapping does not satisfy the conditions imposed by Zhang [11] because the LCA (Least Common Ancestor) of

v_{1}

and

v_{2}

is a proper ancestor of

v_{3}

whereas the LCA of

w_{1}

and

w_{2}

is not a proper ancestor of

w_{3}

.

Figure 5. The tree

θ

is obtained from

τ

by inserting an internal node. The associated mapping does not satisfy the conditions imposed by Zhang [11] because the LCA (Least Common Ancestor) of

v_{1}

and

v_{2}

is a proper ancestor of

v_{3}

whereas the LCA of

w_{1}

and

w_{2}

is not a proper ancestor of

w_{3}

.

Figure 6. Adding a node as new child of w making all the current children of w children of this new node (top) provides the same topology as adding a new node between v and its child w (bottom).

Figure 7. Allowed (✓) and forbidden (✗) inserting operations to construct the NEST of a tree.

Figure 8. Allowed (✓) and forbidden (✗) deleting operations to construct the NeST of a tree.

Figure 9. Number of nodes of the NEST (left) and of the NeST (right) estimated from 3000 random trees: average (full lines) and first and third quartiles (dashed lines).

Figure 10. Average running time required to compute the NEST (dashed line) or the NeST (full line) estimated from 3000 simulated trees.

Figure 11. The rice panicle is composed of a main axis and lateral systems

V_{i}

, each made of one or several inflorescences

P_{j}

.

Figure 11. The rice panicle is composed of a main axis and lateral systems

V_{i}

, each made of one or several inflorescences

P_{j}

.

Figure 12. Degree of self-nestedness measured by

δ_{NEST} \lor 0

(dashed lines) and

δ_{NeST}

(full lines) of the different substructures appearing in the rice panicle.

Figure 12. Degree of self-nestedness measured by

δ_{NEST} \lor 0

(dashed lines) and

δ_{NeST}

(full lines) of the different substructures appearing in the rice panicle.

Figure 13. Progress of Algorithm 2 to compute the NEST of the left tree from its height profile. Only the second line must be edited to get the correct output. Editions of the height profile are associated with addition of vertices in red. The output tree is self-nested and has been constructed by adding a minimal number of nodes to the initial tree.

Figure 14. Progress of Algorithm 3 to compute the NeST of the left tree from its height profile. Only the second line must be edited to get the correct output. Editions of the height profile are associated with deletion of vertices in dashed lines. The output tree is self-nested and has been constructed by removing a minimal number of nodes from the initial tree.

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Azaïs, R. Nearest Embedded and Embedding Self-Nested Trees. Algorithms 2019, 12, 180. https://0-doi-org.brum.beds.ac.uk/10.3390/a12090180

AMA Style

Azaïs R. Nearest Embedded and Embedding Self-Nested Trees. Algorithms. 2019; 12(9):180. https://0-doi-org.brum.beds.ac.uk/10.3390/a12090180

Chicago/Turabian Style

Azaïs, Romain. 2019. "Nearest Embedded and Embedding Self-Nested Trees" Algorithms 12, no. 9: 180. https://0-doi-org.brum.beds.ac.uk/10.3390/a12090180

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Nearest Embedded and Embedding Self-Nested Trees

Abstract

1. Introduction

2. Preliminaries

2.1. Unordered Rooted Trees

2.2. DAG Compression

2.3. Self-Nested Trees

3. Height Profile of the Tree Structure

3.1. Definition and Complexity

3.2. Relation with Self-Nested Trees

4. Approximation Algorithms

4.1. Definitions

4.1.1. Editing Operations

4.1.2. Constrained Editing Operations

4.1.3. Preserving the Height of the Pre-Existing Nodes

4.1.4. NEST and NeST

4.2. NEST Algorithm

4.3. NeST Algorithm

5. Numerical Illustration

5.1. Random Trees

5.2. Structural Analysis of a Rice Panicle

6. Summary and Concluding Remarks

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI