Next Article in Journal
No-Reference Quality Assessment for 3D Synthesized Images Based on Visual-Entropy-Guided Multi-Layer Features Analysis
Previous Article in Journal
Why Dilated Convolutional Neural Networks: A Proof of Their Optimality
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Meta-Tree Random Forest: Probabilistic Data-Generative Model and Bayes Optimal Prediction

1
Department of Pure and Applied Mathematics, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
2
Faculty of Informatics, Gunma University, 4-2, Maebashi, Gunma 371-8510, Japan
3
Center for Data Science, Waseda University, 1-6-1 Nisniwaseda, Shinjuku-ku, Tokyo 169-8050, Japan
*
Author to whom correspondence should be addressed.
Current address: Digital Printing Business Operations, Canon Inc., Tokyo 146-8501, Japan.
Submission received: 19 May 2021 / Revised: 5 June 2021 / Accepted: 8 June 2021 / Published: 18 June 2021

Abstract

:
This paper deals with a prediction problem of a new targeting variable corresponding to a new explanatory variable given a training dataset. To predict the targeting variable, we consider a model tree, which is used to represent a conditional probabilistic structure of a targeting variable given an explanatory variable, and discuss statistical optimality for prediction based on the Bayes decision theory. The optimal prediction based on the Bayes decision theory is given by weighting all the model trees in the model tree candidate set, where the model tree candidate set is a set of model trees in which the true model tree is assumed to be included. Because the number of all the model trees in the model tree candidate set increases exponentially according to the maximum depth of model trees, the computational complexity of weighting them increases exponentially according to the maximum depth of model trees. To solve this issue, we introduce a notion of meta-tree and propose an algorithm called MTRF (Meta-Tree Random Forest) by using multiple meta-trees. Theoretical and experimental analyses of the MTRF show the superiority of the MTRF to previous decision tree-based algorithms.

1. Introduction

Various studies in pattern recognition deal with a prediction problem of a targeting variable y n + 1 corresponding to an explanatory variable x n + 1 given pairs of explanatory and targeting variable { ( x i , y i ) } i = 1 n . In many of them, the targeting variable y n + 1 is predicted with a tree T. One way to use a tree T is to represent a function y n + 1 = f ( x n + 1 ; T ) and predict y n + 1 . A tree T used to represent the function y n + 1 = f ( x n + 1 ; T ) is called a decision tree in the literature. In this paper, however, this tree is called a function tree. This is because we distinguish a tree used to represent a function from a tree used to represent a data-generative model (A tree used to represent a data-generative model will be explained in the next paragraph.). In the previous studies, algorithms in which a single function tree is used are discussed in, e.g., CART [1]; algorithms in which multiple function trees are used are discussed, e.g., Random Forest [2] is an algorithm that constructs multiple function trees from { ( x i , y i ) } i = 1 n and aggregates them to predict y n + 1 from x n + 1 . There are various extensions of Random Forest, e.g., Generalized Random Forest [3] generalizes the splitting rule of a function tree. Boosting is an algorithm that constructs a function tree sequentially from { ( x i , y i ) } i = 1 n and combines the constructed function trees to predict y n + 1 from x n + 1 . There are various Boosting methods, e.g., gradient boosting method in Gradient Boost [4] and XGBoost [5]. Further, previous studies, such as Alternating Decision Forest [6] and Boosted Random Forest [7], combine the ideas of Random Forest and Boosting method. Moreover, combinations of the function trees and neural networks have also been discussed. In Neural Decision Forest [8], inner nodes are replaced by randomized multi-layer perceptrons. In Deep Neural Decision Forest [9], they are replaced by deep convolutional neural networks (CNN). In Adaptive Neural Trees [10], not only their nodes are replaced by CNNs but also their edges are replaced by nonlinear functions. In Deep Forest [11], Random Forests are combined in a cascade structure like deep CNNs. The function tree is also used to understand the deep neural network as an interpretable input-output function in Reference [12]. In these algorithms, a function tree is used to represent the function from x n + 1 to y n + 1 . Although this function is trained for the data { ( x i , y i ) } i = 1 n under a certain criterion, statistical optimality for prediction of y n + 1 is not necessarily discussed theoretically because these algorithms usually do not assume a probabilistic data-generative structure of x and y. As we will describe later in detail, on the other hand, we assume a probabilistic data-generative structure of x and y, and consider a statistical optimal prediction of y n + 1 . This is the crucial difference between our study and the related works.
To predict the targeting variable y n + 1 , another way to use a tree T is to represent a data-generative model p ( y | x , T ) that represents a conditional probabilistic structure of y given x . A tree T used to represent the data-generative model p ( y | x , T ) is called a model tree throughout this paper. Although not so many studies have assumed the model tree, Reference [13] has assumed it. Because Reference [13] assumed that a targeting variable y n + 1 is generated according to p ( y | x n + 1 , T ) , statistical optimality for prediction of y n + 1 can be discussed theoretically. Specifically, based on the Bayes decision theory [14], Reference [13] proposed the optimal prediction of y n + 1 .
In order to explain the optimal prediction based on the Bayes decision theory in more detail, we introduce the terminology model tree candidate set. The model tree candidate set is a set of model trees in which the true model tree (When the model tree T is used to represent the true data-generative model p ( y | x , T ) , we call it true model tree.) is assumed to be included. One example of a model tree candidate set is a set of model trees whose depth is up to d N . The optimal prediction based on the Bayes decision theory is given by weighting all the model trees in the model tree candidate set and the optimal weight of each model tree is given by the posterior probability p ( T | { ( x i , y i ) } i = 1 n ) of the model tree T.
Because the number of all the model trees in the model tree candidate set increases exponentially according to the maximum depth of model trees, the computational complexity of weighting them increases exponentially according to the maximum depth of model trees. One way to reduce the computational complexity is to restrict the model tree candidate set. To represent a restricted model tree candidate set, Reference [13] proposed a notion of meta-tree. The concept of a meta-tree was originally used for data compression in information theory (see, e.g., Reference [15]) (Recently, the concept of a meta-tree was also used for image compression [16].). As shown in Figure 1, a model tree candidate set is composed of the model trees represented by the meta-tree.
A meta-tree is also used for the prediction of y n + 1 in the algorithm proposed by Reference [13]. In summary, a meta-tree has the following two roles: (i) it represents a model tree candidate set; and (ii) it is used for the prediction algorithm. The characteristics of a meta-tree are as follows:
  • If the true model tree is in a model tree candidate set represented by a meta-tree, the statistically optimal prediction—optimal prediction based on the Bayes decision theory—can be calculated.
  • The number of model trees in a model tree candidate set represented by a meta-tree increases exponentially according to the depth of the meta-tree.
  • The computational cost in learning processes of a single meta-tree has the same order as that of a single function tree.
Under the assumption that the true model tree is in the restricted model tree candidate set represented by a meta-tree, Reference [13] proposed the optimal prediction based on the Bayes decision theory.
As we have described above, if the true model tree is actually included in the model tree candidate set, the optimal prediction based on the Bayes decision theory is calculated. Hence, it is desirable that we construct the model tree candidate set that includes as many model trees as possible within the allowed computational cost. Motivated by this fact, this paper extends the model tree candidate set compared with Reference [13]. Instead of considering a single meta-tree as in Reference [13], we consider multiple meta-trees. By using the model tree candidate set represented by multiple meta-trees, we predict y n + 1 based on the Bayes decision theory. We call this proposed algorithm MTRF ( Meta - Tree Random Forest ) . The characteristics of MTRF are as follows:
  • If the true model tree is in a model tree candidate set represented by any of the meta-trees of MTRF, the statistically optimal prediction—optimal prediction based on the Bayes decision theory— can be calculated.
  • The number of model trees in the model tree candidate set represented by multiple meta-trees is constant times larger than those contained in a single meta-tree.
  • The computational cost in learning processes of meta-trees of MTRF is the same order as that of multiple function trees of Random Forest.
Regarding a meta-tree in Reference [13] or multiple meta-trees of MTRF, how to construct the meta-tree/multiple meta-trees is a problem. One way to solve this issue is to use the function tree-based algorithms; for example, a meta-tree in Reference [13] can be constructed by regarding a function tree in CART as a meta-tree; multiple meta-trees of MTRF can be constructed by regarding function trees in Random Forest as meta-trees. In this way, by regarding a function tree as a meta-tree, our proposed method can be applied to any practical applications in which a function-tree based algorithm is used—insurance claim task (e.g., Reference [5]), letter classification task (e.g., Reference [6]), semantic segmentation (e.g., Reference [8]), face recognition (e.g., Reference [11]), music classification (e.g., Reference [11]), etc.—and we can improve every previous function tree-based algorithm. This is because we can perform a prediction of y n + 1 by using more model trees compared with the previous algorithms. More detailed discussion will be given in Section 3 and Section 4.
The rest of this paper is organized as follows. In Section 2, we state the preliminaries of our study. Section 3 explains our proposed method—MTRF. In Section 4, we revisit the previous studies, such as CART [1] and Random Forest [2], and compare them with MTRF. Section 5 shows the results of experiments. Section 6 concludes this paper.

2. Preliminaries

2.1. Model Tree

Let K denote the dimension of explanatory variables. The space of feature values is denoted by X : = { 0 , 1 , , | X | 1 } , and the explanatory variable is denoted by x : = ( x 1 , , x K ) X K . In addition, the space of label values is denoted by Y : = { 0 , 1 , | Y | 1 } , and the targeting variable is denoted by y Y .
In the field of pattern recognition, the probabilistic structure of x and y is not necessarily assumed. In particular, it is not assumed in most of function tree-based algorithms. In contrast, we consider a triplet of ( T , k , θ ) defined in Definition 1 below and assume the probabilistic structure p ( y | x , θ , T , k ) as in Definition 2. We call ( T , k ) a model tree (In Section 1, we called T a model tree. More precisely speaking, however, a model tree is ( T , k ) .) and θ a tree parameter.
Definition 1.
The notation T denotes the | X | -ary regular (Here, “regular” means that all inner nodes have | X | child nodes.) tree whose depth is up to D max , and  T denotes the set of T. Let s be a node of a tree and S be a set of s. The set of the inner nodes of T is denoted by I ( T ) , and the set of the leaf nodes of T is denoted by L ( T ) . Let k s { 1 , 2 , , K } be a component of the feature assign vector of node s. The explanatory variable x k s X is assigned to the inner node s I ( T ) . Let k : = ( k s ) s S denote a feature assign vector, and  K denote a set of k . Let θ s : = ( θ 0 | s , θ 1 | s , , θ | Y | 1 | s ) ( 0 , 1 ) | Y | , where y = 0 | Y | 1 θ y | s = 1 , be a parameter assigned to s S . We define θ : = ( θ s ) s S , and  Θ is the set of θ .
Definition 2.
Because the value of explanatory variable x corresponds to a path from the root node to the leaf node of T whose feature assign vector is k , let s ( x ) L ( T ) denote the corresponding leaf node. Then, for given x , θ , T, and k, the label variable y is generated according to
p ( y | x , θ , T , k ) : = θ y | s ( x ) .
Figure 2a shows two examples of ( T , k , θ ) , and Figure 2b illustrates p ( y | x , θ , T , k ) .

2.2. Problem Setup

In this subsection, we introduce our problem setup. Let x i : = ( x i 1 , , x i K ) X K ( i = 1 , , n ) denote the value of explanatory variable of the i-th data and x n : = ( x 1 , , x n ) . We assume that x 1 , , x n are i.i.d. random variables drawn from a probability distribution. Let y i Y ( i = 1 , , n ) denote the targeting variable of the i-th data and y n : = ( y 1 , , y n ) . We assume that y 1 , , y n are generated according to p ( y | x , θ , T , k ) defined in Section 2.1. In regard to this, we assume that the true model tree ( T , k ) and true tree parameter θ are unknown, but a model tree candidate set M T × K is known. Note that, as we have explained in Section 1, the model tree candidate set M is a set of model trees ( T , k ) in which the true model tree is assumed to be included. Further, we assume that a class of parametrized distribution { p ( y | x , θ , T , k ) : θ Θ , ( T , k ) M } is known.
The purpose of the prediction problem is to predict unknown targeting variable y n + 1 corresponding to an explanatory variable x n + 1 by using the given data x n , y n , x n + 1 .

2.3. Optimal Prediction Based on the Bayes Decision Theory

We introduce a statistically optimal prediction under the problem setup in Section 2.2. From the viewpoint of the Bayes decision theory [14], we shall derive the optimal prediction based on the Bayes decision theory. To this end, we assume priors p ( θ ) , p ( T ) , and  p ( k ) .
First, we define a decision function as
δ : X K n × Y n × X K Y ; ( x n , y n , x n + 1 ) δ ( x n , y n , x n + 1 ) .
Second, we define a 0–1 loss as follows:
l ( y n + 1 , δ ( x n , y n , x n + 1 ) ) : = 0 ( y n + 1 = δ ( x n , y n , x n + 1 ) ) , 1 ( y n + 1 δ ( x n , y n , x n + 1 ) ) .
Third, using the 0–1 loss, we define the loss function, which is the expectation of the 0–1 loss taken by new targeting variable y n + 1 :
L ( δ ( x n , y n , x n + 1 ) , θ , T , k ) : = y n + 1 Y p ( y n + 1 | x n + 1 , θ , T , k ) l ( y n + 1 , δ ( x n , y n , x n + 1 ) ) .
Next, given the loss function, we define the risk function, which is the expectation of the loss function taken by training data x n , y n , and  x n + 1 :
R ( δ ( · , · , · ) , θ , T , k )
: = x n X K n y n Y n x n + 1 X K p ( x n + 1 ) p ( x n ) p ( y n | x n , θ , T , k ) L ( δ ( x n , y n x n + 1 ) , θ , T , k ) .
Finally, the Bayes risk function, which is the expectation of the risk function taken by θ , T, and  k , is defined as
BR ( δ ( · , · , · ) ) : = ( T , k ) M p ( k ) p ( T ) Θ p ( θ ) R ( δ ( · , · , · ) , θ , T , k ) d θ .
Then, we have the following theorem:
Theorem 1.
Under the setup in Section 2.2, the decision function δ * ( x n , y n , x n + 1 ) that minimizes the Bayes risk function is given as
δ * ( x n , y n , x n + 1 ) = arg max y n + 1 ( T , k ) M p ( k | x n , y n ) p ( T | x n , y n , k ) Θ p ( θ | x n , y n , T , k ) p ( y n + 1 | x n + 1 , θ , T , k ) d θ .
The proof of Theorem 1 is in Appendix A. We call δ * ( x n , y n , x n + 1 ) the Bayes decision.

2.4. Previous Study

In this subsection, we introduce how to calculate the Bayes decision in Reference [13]. Because the proof of Reference [13] lacks some explanation, we give more rigorous discussion in comparison. This is one of the contributions in this paper.
As we have explained in Section 1, Reference [13] introduced the concept of a “meta-tree” to represent a restricted model tree candidate set. For T and k , a meta-tree—denoted by M T , k —represents a model tree candidate set that is composed of model trees ( T , k ) where T is a sub-tree of T and k = k . An example of M T , k is shown in Figure 3, where T is a complete tree with D max = 2 and k = ( 3 , 4 , 2 ) . In this example, a meta-tree M T , k represents a model tree candidate set which is composed of four model trees. A meta-tree is also used for the prediction of y n + 1 in the algorithm proposed by Reference [13]. In short, a meta-tree has the two roles: first, it represents a model tree candidate set, and, second, it is used for the prediction algorithm.
Let T M T , k denote a set of T represented by a meta-tree M T , k and let M = T M T , k × { k } . Under the assumption that the true model tree is included in M , the Bayes decision in Theorem 1 is expressed as
δ * ( x n , y n , x n + 1 ) = arg max y n + 1 T T M T , k p ( T | x n , y n , k ) Θ p ( θ | x n , y n , T , k ) p ( y n + 1 | x n + 1 , θ , T , k ) d θ .
Now, (8) can be decomposed into two parts:
1 . q ( y n + 1 | x n + 1 , x n , y n , T , k ) : = Θ p ( θ | x n , y n , T , k ) p ( y n + 1 | x n + 1 , θ , T , k ) d θ .
2 . q ˜ ( y n + 1 | x n + 1 , x n , y n , k ) : = T T M T , k p ( T | x n , y n , k ) q ( y n + 1 | x n + 1 , x n , y n , T , k ) .
Reference [13] exactly calculates (9) and (10) under some assumptions. We describe them in Section 2.4.1 and Section 2.4.2, respectively.

2.4.1. Expectation over the Parameters

In this subsection, we explain how to calculate (9). To calculate (9) analytically, a conjugate prior of θ s given T and k , is assumed.
Assumption 1.
We assume p ( θ s | T , k ) = Dirichlet ( θ s | α s ) for s L ( T ) , where α s denotes the hyper-parameter of the Dirichlet distribution.
Under this assumption, q ( y n + 1 | x n + 1 , x n , y n , T , k ) has a closed form expression and (9) can be calculated exactly. In addition, q ( y n + 1 | x n + 1 , x n , y n , T , k ) has an important property. We describe this in the next lemma as a preparation of Section 2.4.2.
Lemma 1.
For any T , T T , x n X K n , y n Y n , and  x n + 1 X K , if  s ( x n + 1 ) L ( T ) L ( T ) , then
q ( y n + 1 | x n + 1 , x n , y n , T , k ) = q ( y n + 1 | x n + 1 , x n , y n , T , k ) .
The proof of Lemma 1 is in Appendix B. From this lemma, the right- and left-hand sides of (11) are denoted by q s ( x n + 1 ) ( y n + 1 | x n + 1 , x n , y n , k ) because they depend on not T and T but s ( x n + 1 ) .

2.4.2. Summation over All Model Trees Represented by a Meta-Tree

In this subsection, we explain how to calculate (10). The exact calculation of (10) needs to take summation over all model trees represented by a single meta-tree. However, the computational complexity of this calculation is huge because the number of model trees increases exponentially according to the depth of meta-tree. Reference [13] solved this problem. The advantages of using the method proposed in Reference [13] are as follows:
  • This method calculates (10) exactly.
  • The computational complexity of calculating (10) in learning parameters is O ( n D max ) , which is the same as that of building a single function tree.
To execute this method, an assumption about the prior probability distribution of T—Assumption A2—is required. It should be noted that p ( T ) is different from the construction method of function trees, given x n and y n .
Assumption A2.
Let g s [ 0 , 1 ] denote a hyper-parameter assigned to any node s S . Then, we assume the prior of T T M T , k as
p ( T ) = s I ( T ) g s s L ( T ) ( 1 g s ) ,
where g s = 0 for a leaf node s of a meta-tree M T , k .
Remark 1.
It should be noted that (12) is a probability distribution, i.e.,  T T M T , k p ( T ) = 1 . See Lemma A1 in Appendix C.
As we have described in Section 1, a meta-tree is used for the prediction of y n + 1 in the algorithm proposed by Reference [13]. By using a meta-tree M T , k , the recursive function to calculate the Bayes decision (10) is defined as follows.
Definition 3.
We define the following recursive function q ˜ s ( y n + 1 | x n + 1 , x n , y n , M T , k ) for any node s on the path in M T , k corresponding to x n + 1 .
q ˜ s ( y n + 1 | x n + 1 , x n , y n , M T , k ) : = q s ( y n + 1 | x n + 1 , x n , y n , k ) , ( if s is the leaf node of M T , k ) , ( 1 g s | x n , y n ) q s ( y n + 1 | x n + 1 , x n , y n , k ) + g s | x n , y n q ˜ s child ( y n + 1 | x n + 1 , x n , y n , M T , k ) , ( otherwise ) ,
where s child is the child node of s on the path corresponding to x n + 1 in M T , k , and  g s | x n , y n is also recursively updated as follows:
g s | x i , y i : = g s , ( if i = 0 ) , g s | x i 1 , y i 1 q ˜ s child ( y i | x i , x i 1 , y i 1 , M T , k ) q ˜ s ( y i | x i , x i 1 , y i 1 , M T , k ) , ( otherwise ) .
Now, (10) can be calculated as shown in the following theorem.
Theorem 2.
q ˜ ( y n + 1 | x n + 1 , x n , y n , k ) can be calculated by
q ˜ ( y n + 1 | x n + 1 , x n , y n , k ) = q ˜ s λ ( y n + 1 | x n + 1 , x n , y n , M T , k ) ,
where s λ is the root node of M T , k . In addition, p ( T | x n + 1 , y n + 1 , k ) can be calculated as
p ( T | x n + 1 , y n + 1 , k ) = s I ( T ) g s | x n + 1 , y n + 1 s L ( T ) ( 1 g s | x n + 1 , y n + 1 ) .
The proof of Theorem 2 is in Appendix C. Surprisingly, the computational complexity of calculating (10) is O ( n D max ) by using this theorem.

3. Proposed Method

In this section, we introduce our proposed method. Here, we reconsider the general setup where M = T × K . Recall that the Bayes decision (7) is
δ * ( x n , y n , x n + 1 ) = arg max y n + 1 k K p ( k | x n , y n ) q ˜ ( y n + 1 | x n + 1 , x n , y n , k ) ,
where q ˜ ( y n + 1 | x n + 1 , x n , y n , k ) is defined as in (10). In (17), q ˜ ( y n + 1 | x n + 1 , x n , y n , k ) can be calculated in the same way as in Section 2.4. However, regarding the calculation of (17), we need to calculate q ˜ ( y n + 1 | x n + 1 , x n , y n , k ) for all k that satisfies p ( k ) > 0 . Because this computational cost increases exponentially depending to D max , it is hard to calculate all of them in general. To reduce the computational complexity of (17), we restrict the model tree candidate set to the set represented by multiple meta-trees. The advantages of building multiple meta-trees are as follows:
  • As we have explained in Section 2.4.2, the number of model trees represented by each of the meta-trees increases exponentially according to the depth of meta-trees. In addition, the number of model trees in multiple meta-trees is constant times larger than those contained in a single meta-tree.
  • If the true model tree is a sub-tree of any of the meta-trees, the statistically optimal prediction—optimal prediction based on the Bayes decision theory—can be calculated.
  • If we build B meta-trees, the computational cost of it in learning parameters is O ( n B D max ) , which is the same as that of building B function trees.
Now, we introduce the next assumption.
Assumption A3.
For B meta-trees M T 1 , k 1 , , M T B , k B , we define K { k 1 , , k B } . Then, we assume a uniform distribution on K as a prior probability distribution of k K .
An example of K is shown in Figure 4.
Next, we prove the following lemma about computing a posterior probability distribution of k . In fact, the recursive function we have defined in Definition 3 is also useful to calculate a posterior probability distribution on K sequentially. The proof of Lemma 2 is in Appendix D.
Lemma 2.
By decomposing x n and y n into the set of i-th data ( x i , y i ) ( i = 1 , , n ) and performing the recursive calculation of q ˜ s ( y i | x i , x i 1 , y i 1 , M T , k ) as in (13), a posterior probability distribution of k K is expressed as follows:
p ( k | x n , y n ) i = 1 n q ˜ s λ ( y i | x i , x i 1 , y i 1 , M T , k ) .
From Theorem 2 and Lemma 2, we immediately obtain the next theorem.
Theorem 3.
Let us consider the model tree candidate set represented by B meta-trees, i.e.,
b = 1 B T M T b , k b × { k b } .
Then, if the true model tree is included in (19), the Bayes decision δ * ( x n , y n , x n + 1 ) can be calculated as
δ * ( x n , y n , x n + 1 ) = arg max y n + 1 k K p ( k | x n , y n ) q ˜ ( y n + 1 | x n + 1 , x n , y n , k )
= arg max y n + 1 k K q ˜ s λ ( y n + 1 | x n + 1 , x n , y n , M T , k ) i = 1 n q ˜ s λ ( y i | x i , x i 1 , y i 1 , M T , k ) .
If the true model tree is not included in (19), the right-hand side of (20) is a sub-optimal prediction of the Bayes decision.
By using Theorem 3, we can calculate the optimal (possibly sub-optimal) prediction of the Bayes decision effectively. We name this algorithm Meta-Tree Random Forest (MTRF). We summarize MTRF in Algorithm 1.
Algorithm 1 MTRF
  • Input: x n + 1 , x n , y n , B , g s , α s
  • Output: δ * ( x n , y n , x n + 1 )
  • Initialize parameters g s .
  • Construct B meta-trees M T 1 , k 1 , , M T B , k B and K : = { k 1 , , k B } .
  • for all k K do
  • for i = 1 , , n , n + 1 do
  •   Calculate q ˜ s λ ( y i | x i , x i 1 , y i 1 , M T , k ) with (13).
  •   if i n + 1 then
  •    Renew parameters g s | x i , y i with (14).
  •   end if
  • end for
  • end for
  • Calculate δ * ( x n , y n , x n + 1 ) with (20).
Remark 2.
One way to construct multiple meta-trees is to use the ideas of Random Forest. Random Forest builds function trees FT with the impurity (e.g., the entropy and the Gini coefficient) from the training data x n and y n . By regarding ( T 1 , k 1 ) ( T B , k B ) that are built by Random Forest as meta-trees M T 1 , k 1 M T B , k B , we can construct B meta-trees. In our experiments in Section 5, we adopt this method. The computational complexity of MTRF is O ( n B D max ) in learning processes, which is the same as that of Random Forest. This is computable unless n, B, or D max are not extremely huge. In addition, we can parallelize the procedure on each meta-tree.

4. CART and Random Forest Revisited and Comparison with MTRF

4.1. CART and Random Forest Revisited

Although CART [1] and Random Forest [2] do not assume the model tree and they use trees to express a function y = f ( x ; T ) (i.e., function tree), they can be regarded as methods of constructing a model tree candidate set.
First, we consider algorithms that use a single function tree, such as CART [1]. Such algorithms can be regarded as selecting a single model tree and predicting y n + 1 by using the single function tree that corresponds to the single model tree. For example, the prediction of CART—denoted by δ CART ( x n , y n , x n + 1 ) —is given by
δ CART ( x n , y n , x n + 1 ) = arg max y n + 1 p ( y n + 1 | x n + 1 , θ ^ ( x n , y n ) , T ^ , k ^ ) ,
where ( T ^ , k ^ ) and θ ^ ( x n , y n ) denote an estimated model tree and tree parameter, respectively.
Second, as another example, let us consider Random Forest [2]. Random Forest can be regarded as selecting multiple model trees and predicting y n + 1 by weighting the multiple function trees that correspond to the multiple model trees. It should be noted that the number of function trees is equal to the number of model trees in Random Forest. The prediction of Random Forest—denoted by δ RF ( x n , y n , x n + 1 ) —is given by
δ RF ( x n , y n , x n + 1 ) = arg max y n + 1 ( T , k ) M ^ 1 | M ^ | p ( y n + 1 | x n + 1 , θ ^ ( x n , y n ) , T , k ) ,
where M ^ = { ( T ^ 1 , k ^ 1 ) , , ( T ^ B , k ^ B ) } M denotes the set of model trees that are constructed by Random Forest.

4.2. Comparison of Random Forest with MTRF

Let us consider the situation where the trees that represent the function trees constructed by Random Forest are meta-trees of MTRF. Then, the size of model tree candidate set is far larger in MTRF than in Random Forest.
Random Forest can be regarded as approximating (9), (10) and (17) as in (23). Further, the approximation of the weight p ( T , k | x n , y n ) 1 / | M ^ | is not accurate and the optimality of each function tree’s weight for predicting y n + 1 has not usually been discussed. In contrast, MTRF calculates the Bayes decision in (9) and (10) exactly and approximates (17) as in (20). Moreover, the optimal weights of T and k are given by (16) and (18), respectively.

5. Experiments

Most machine learning algorithms require rich numerical experiments to confirm the improvement independent of a specific dataset. However, in our paper, such an improvement is theoretically explained in Section 3 and Section 4. If we redefine all the function trees constructed by any existing method as meta-trees, we can necessarily improve it in a sense of the Bayes risk function since our model tree candidate set contains all the model trees that correspond to the original function trees. This improvement is independent of both datasets and the original methods. Therefore, we perform our experiment only on three types of datasets and compare with the Random Forest. Similar results should be given in any other situation.

5.1. Experiment 1

Although the theoretical superiority of MTRF to Random Forest has been explained in Section 3 and Section 4, we performed Experiment 1 to make sure their performance on synthetic data. The procedure of Experiment 1 is the following:
  • We initialize the hyper-parameter α s = ( 1 / | X | , , 1 / | X | ) , which is assigned on each node s.
  • To make true model trees diverse, we determinate an index of a true feature assign vector k and the true model trees as below:
    • In the true model tree (A), we assign x j to the node whose depth is j.
      true model tree (A-1): the complete 2-ary model tree with depth 4.
      true model tree (A-2): the regular 2-ary model tree with depth 4. This model tree holds the structure that only left nodes have child nodes and right nodes does not.
    • In the true model tree (B), we assign all different variables to the node.
      true model tree (B-1): the complete 2-ary model tree with depth 4.
  • We determine parameters θ s according to the Dirichlet distribution.
  • We generate K-dimensional n training data from the true model tree and generate K-dimensional 100 test data from the true model tree. From (7), Bayes decision takes the same value for any distribution p ( x n ) . Therefore, in our experiments, we set a probability distribution of x as follows: Let U ( { 0 , 1 } ) denote the uniform distribution on { 0 , 1 } . Then, a probability distribution of x is expressed as
    p ( x i j ) i . i . d . U ( { 0 , 1 } ) ( i = 1 , , n , j = 1 , , K ) .
  • We perform Random Forest that utilizes B function trees ( T 1 , k 1 ) , , ( T B , k B ) . We set its impurity as the entropy and max-depth as D max . Afterward, we make B meta-trees M T 1 , k 1 , , M T B , k B . (Of course, their max-depth are D max , too.) Then, we calculate the average prediction error rate of Random Forest and MTRF, where we set the hyper-parameter g s as g s = 1 / 2 in MTRF.
  • We repeat Steps 3∼5Q times.
Figure 5 shows an example of these model trees (note that, for simplicity, we illustrate the model trees with depth 2).
We set | X | = | Y | = 2 , n = 100 , 200 , , 1000 , K = 500 , B = 10 , D max = 2 , 4 , 6 , and Q = 5000 . The result is shown in Figure 6.
From Figure 6, when we compare the performance of Random Forest with that of MTRF, we can confirm that MTRF outperforms Random Forest in all conditions. These results are theoretically reasonable because the trees that represent the function trees constructed by Random Forest are meta-trees of MTRF.

5.2. Experiment 2

In Experiment 2, we used real data (nursery school data and mushroom data) in the UCI repository (University of California, Irvine Machine Learning Repository). The procedure of Experiment 2 is mostly the same as that of Experiment 1, but the way of sampling data is different; instead of Steps 2, 3, and 4 in Experiment 1, we randomly sampled n + t data from the whole dataset and divided them to n training data and t test data. We repeated the random sampling of the data for 2500 times.
The detail on nursery school data is as follows. The explanatory variables are discrete 8 attributes of a family whose child wants to enter a nursery school (e.g., finance, health, and family structure). The dimension of each attributes is up to 5. The targeting variable is whether the nursery school should welcome him/her or not. We want to consider that a new child should be welcomed with his attributes and learning data. We set | X | = 5 , | Y | = 2 , n = 100 , 200 , , 1000 , t = 1000 , K = 8 , B = 5 , and D max = 2 , 3 , 4 . The result is shown in Figure 7a.
The detail on mushroom data is as follows. The explanatory variables are discrete 22 attributes of a mushroom (e.g., smell, shape, surface, and color). The dimension of each attributes is up to 10. The targeting variable is whether the mushroom can be eaten, not be eaten, or unknown. We want to predict that a new mushroom can be eaten, with its attributes and learning data. We set | X | = 22 , | Y | = 3 , n = 100 , 200 , , 1000 , t = 1000 , K = 22 , B = 5 , and D max = 2 , 3 , 4 . The result is shown in Figure 7b.
From Figure 7, we can confirm MTRF outperforms Random Forest in all conditions.

6. Conclusions and Future Work

In this paper, we have considered a model tree and derived the Bayes decision. We have addressed the computational complexity problem of the Bayes decision by introducing a notion of meta-tree, and proposed MTRF (Meta-Tree Random Forest), which uses multiple meta-trees. As we have shown in Theorem 3, if the true model tree is included in the model tree candidate set represented by multiple meta-trees, MTRF calculates the Bayes decision with efficient computational cost. Even if the true model tree is not included in the model tree candidate set represented by multiple meta-trees, MTRF gives the approximation of the Bayes decision. The advantages of MTRF have been theoretically analyzed. In Section 4, we have explained that, if we redefine all the function trees constructed by Random Forest as meta-trees, we can necessarily improve it in a sense of the Bayes risk function since our model tree candidate set contains all the model trees that correspond to the original function trees. We have performed experiments in Section 5 to check the theoretical analysis.
As we have described in Section 1, it is desirable that we construct the model tree candidate set that includes as many model trees as possible within the allowed computational cost. This is because if the true model tree is included in the model tree candidate set, the Bayes decision is calculated. Hence, the main problem of MTRF is how to construct multiple meta-trees. In Section 5, we have constructed multiple meta-trees by using the idea of Random Forest. However, we can consider other ways to construct multiple meta-trees. One of the possible future directions is to use other ideas, such as Boosting. This is one of the future works in our research.

Author Contributions

Conceptualization, N.D., S.S., Y.N. and T.M.; Methodology, N.D., S.S., Y.N. and T.M.; Software, N.D. and Y.N.; Validation, N.D., S.S., Y.N. and T.M.; Formal Analysis, N.D., S.S., Y.N. and T.M.; Investigation, N.D., S.S., Y.N. and T.M.; Resources, N.D.; Data Curation, N.D.; Writing—Original Draft Preparation, N.D., S.S. and Y.N.; Writing—Review & Editing, N.D., S.S., Y.N. and T.M.; Visualization, N.D., S.S. and Y.N.; Supervision, T.M.; Project Administration, T.M.; Funding Acquisition, S.S. and T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by JSPS KAKENHI Grant Numbers JP17K06446, JP19K04914 and JP19K14989.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://archive.ics.uci.edu/ml/datasets/Nursery and https://archive.ics.uci.edu/ml/datasets/Mushroom (accessed on 8 June 2021).

Acknowledgments

The authors would like to thank all the members of Matsushima Lab. for their comments and discussions.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviation is used in this manuscript:
MTRFMeta-Tree Random Forest

Appendix A. Proof of Theorem 1

Proof of Theorem 1. 
From (4)–(6), and Bayes’ theorem, we have
BR ( δ ( · , · , · ) ) = x n X K n y n Y n x n + 1 X K p ( x n + 1 ) y n + 1 Y p ( y n + 1 | x n , y n , x n + 1 ) l ( y n + 1 , δ ( x n , y n , x n + 1 ) ) · p ( x n , y n ) ,
where
p ( y n + 1 | x n , y n , x n + 1 ) = ( T , k ) M Θ p ( y n + 1 | x n + 1 , θ , T , k ) p ( θ , T , k | x n , y n ) d θ .
We need to calculate δ that minimizes the brackets in (A1) when we minimize the Bayes risk. Let a = δ ( x n , y n , x n + 1 ) and F ( a ) denote the formula in the bracket of (A1). Then,
F ( a ) = y n + 1 Y l ( y n + 1 , a ) p ( y n + 1 | x n , y n , x n + 1 ) = y n + 1 Y \ { a } 1 · p ( y n + 1 | x n , y n , x n + 1 ) .
Therefore, the decision function a that minimizes F ( a ) is
a = arg max y n + 1 p ( y n + 1 | x n , y n , x n + 1 ) .
We can prove (7) by substituting (A2) for (A4). □

Appendix B. Proof of Lemma 1

Proof of Lemma 1. 
Let Θ s ( x n + 1 ) denote the space of parameters on s ( x n + 1 ) . In addition, we define
Θ s s ( x n + 1 ) Θ \ Θ s ( x n + 1 ) .
Then, we can rewrite the left-hand side of (11) as below:
q ( y n + 1 | x n + 1 , x n , y n , T , k ) = Θ p ( y n + 1 | x n + 1 , θ , T , k ) p ( θ | x n , y n , T , k ) d θ = Θ s ( x n + 1 ) θ y n + 1 | s ( x n + 1 ) Θ s s ( x n + 1 ) p ( θ | x n , y n , T , k ) d θ s s ( x n + 1 ) d θ s ( x n + 1 ) ,
where the last equality follows from (1).
Looking into the term Θ s s ( x n + 1 ) p ( θ | x n , y n , T , k ) d θ s s ( x n + 1 ) in (A6), we have
Θ s s ( x n + 1 ) p ( θ | x n , y n , T , k ) d θ s s ( x n + 1 )
( a ) Θ s s ( x n + 1 ) p ( θ | T , k ) p ( y n | x n , θ , T , k ) d θ s s ( x n + 1 )
= Θ s s ( x n + 1 ) s S p ( θ s | α s ) i = 1 n θ y i | s ( x i ) d θ s s ( x n + 1 )
p ( θ s ( x n + 1 ) | α s ( x n + 1 ) ) i { j N : s ( x j ) = s ( x n + 1 ) } θ y i | s ( x i ) ,
where (a) follows from Bayes’ theorem.
Because we can transform the right-hand side of (11) in the same way, we obtain (11). □

Appendix C. Proof of Theorem 2

Proof of Theorem 2. 
We prove Theorem 2 with mathematical induction according to the following three steps:
  • Step 1 : We prove (15) for n = 0 with Lemmas A1 and A2.
  • Step 2 : We prove (16) for n = 0 by using the result of Step 1.
  • Step 3 : If we assume (15) and (16) for n = N , we can prove (15) and (16) for n = N + 1 by repeating Steps 1 and 2.
In the following, we give a proof of each step.
Step 1: First, we prove (15) for n = 0 . Let S ( x i ) ( i = 1 , , n + 1 ) denote the set of all nodes on the path of x i in the meta-tree M T , k . Let T s : = { T T M T , k | s L ( T ) } . Then, we transform q ˜ ( y 1 | x 1 , k ) as below:
q ˜ ( y 1 | x 1 , k ) = ( a ) T T M T , k p ( T ) q ( y 1 | x 1 , T , k )
= s S ( x 1 ) T T s p ( T ) q ( y 1 | x 1 , T , k )
= ( b ) s S ( x 1 ) q s ( y 1 | x 1 , k ) T T s p ( T ) ,
where (a) follows from (10), and (b) follows from Lemma 1. The right-hand side of (A12) contains the summation with respect to T as below:
T T s p ( T ) .
Regarding (A13), we prove the following lemmas.
Lemma A1.
The prior probability distribution p ( T ) assumed in (12) satisfies the next property:
T T M T , k p ( T ) = 1 .
Proof of Lemma A1.
In this proof, for simplicity, we use T instead of T M T , k . Before showing the formal proof, we consider an example. Let us consider the set of tree T shown in Figure A1. In this example, T is composed of five model trees. Among these model trees, T 1 is included in { [ s λ ] } and T 2 T 5 are included in T T \ { [ s λ ] } . Therefore, the next equation holds:
T T p ( T ) = T T s I ( T ) g s s L ( T ) ( 1 g s )
= ( 1 g s λ ) + g s λ T { T 2 , T 3 , T 4 , T 5 } } s L ( T ) ( 1 g s ) s I ( T ) \ { s λ } g s = ( 1 g s λ ) + g s λ ( 1 g s 0 ) ( 1 g s 1 ) + g s 0 ( 1 g s 00 ) ( 1 g s 01 ) ( 1 g s 1 )
+ ( 1 g s 0 ) g s 1 ( 1 g s 10 ) ( 1 g s 11 ) + g s 0 ( 1 g s 00 ) ( 1 g s 01 ) g s 1 ( 1 g s 10 ) ( 1 g s 11 )
= ( 1 g s λ ) + g s λ { ( 1 g s 0 ) + g s 0 ( 1 g s 00 ) ( 1 g s 01 ) } { ( 1 g s 1 ) + g s 1 ( 1 g s 10 ) ( 1 g s 11 ) }
= ( a ) ( 1 g s λ ) + g s λ { ( 1 g s 0 ) + g s 0 } { ( 1 g s 1 ) + g s 1 }
= 1 ,
where (a) follows from g s = 0 for a leaf node s of the meta-tree (see Assumption A2).
Now, we give the formal proof. Because
T T p ( T ) = T T s I ( T ) g s s L ( T ) ( 1 g s ) ,
we prove
T T s I ( T ) g s s L ( T ) ( 1 g s ) = 1
by induction.
Let [ s λ ] denote the tree that consists of only the root node s λ of T. Then, we have
T T s L ( T ) ( 1 g s ) s I ( T ) g s = ( 1 g s λ ) + g s λ T T \ { [ s λ ] } s L ( T ) ( 1 g s ) s I ( T ) \ { s λ } g s .
Because each tree T T \ { [ s λ ] } is identified by | X | sub-trees whose root nodes are the child nodes of s λ , let s λ child , j denote the j-th child node of s λ for 0 j | X | 1 and T s λ child , j denote the set of sub-trees whose root node is s λ child , j . Figure A1 shows an example of T s λ child , 0 and T s λ child , 1 . Then, the summation in the right-hand side of (A23) is factorized as
j = 0 | X | 1 T j T s λ child , j s L ( T j ) ( 1 g s ) s I ( T j ) g s .
Consequently, because the goal is to show (A22), Lemma A1 is proven by induction. □
Lemma A2.
Let A s denote the set of all ancestor nodes of s. Then, we have
T T s p ( T ) = ( 1 g s ) s A s g s .
Figure A1. An illustration of T s λ child , 0 (left lower enclosure) and T s λ child , 1 (right lower enclosure). In the trees T 2 T 5 , s λ has child nodes s λ child , 0 and s λ child , 1 . Therefore, we can decompose the child nodes of s λ in the trees T 2 T 5 into the nodes included in T s λ child , 0 and the nodes included in T s λ child , 1 .
Figure A1. An illustration of T s λ child , 0 (left lower enclosure) and T s λ child , 1 (right lower enclosure). In the trees T 2 T 5 , s λ has child nodes s λ child , 0 and s λ child , 1 . Therefore, we can decompose the child nodes of s λ in the trees T 2 T 5 into the nodes included in T s λ child , 0 and the nodes included in T s λ child , 1 .
Entropy 23 00768 g0a1
Proof of Lemma A2.
T T s p ( T ) = ( a ) T T s s I ( T ) g s s L ( T ) ( 1 g s )
= ( b ) 1 g s s A s g s T T s s I ( T ) \ A s g s s L ( T ) \ { s } 1 g s ,
where (a) follows from Assumption 2, and (b) follows from the fact that all trees T T s have the node s and its ancestor nodes.
Further, as we will prove later, it holds that
T T s s I ( T ) \ A s g s s L ( T ) \ { s } 1 g s = 1 .
Thus, we obtain (A25) by combining (A26) and (A28).
To prove (A28), we can apply the factorization of Lemma A1 to (A28) if we regard the node whose parent node is s or an element of A s as a root node, as shown in Figure A2. Let J denote the number of that factorized trees, and let r j denote the root node of j-th factorized tree. Then, we can rewrite (A28) as follows:
T T s s I ( T ) \ A s g s s L ( T ) \ { s } 1 g s = j = 0 J 1 T j T r j s L ( T j ) ( 1 g s ) s I ( T j ) g s = 1 .
Hence, the proof is complete. □
Now, we transform (A12) as follows.
s S ( x 1 ) q s ( y 1 | x 1 , k ) T T s p ( T ) = ( a ) s S ( x 1 ) q s ( y 1 | x 1 , k ) 1 g s s A s g s = 1 g s λ q s λ ( y 1 | x 1 , k )
+ g s λ s S ( x 1 ) \ { s λ } q s ( y 1 | x 1 , k ) 1 g s s A s \ { s λ } g s ,
where (a) follows from Lemma A2.
Figure A2. An illustration of r 0 , r 1 , r 2 , and r 3 . In this example, the dotted line represents the path from s λ to s and A s is the set of two parent nodes of s. Furthermore, the rectangles of the dotted line represent four factorized trees T 0 , T 1 , T 2 , and T 3 .
Figure A2. An illustration of r 0 , r 1 , r 2 , and r 3 . In this example, the dotted line represents the path from s λ to s and A s is the set of two parent nodes of s. Furthermore, the rectangles of the dotted line represent four factorized trees T 0 , T 1 , T 2 , and T 3 .
Entropy 23 00768 g0a2
Here, let s λ child denote the node that is a child node of s λ and an element of S ( x 1 ) . Then, we can decompose
s S ( x 1 ) \ { s λ } q s ( y 1 | x 1 , k ) 1 g s s A s \ { s λ } g s = 1 g s λ child q s λ child ( y 1 | x 1 , k ) + g s λ child s S ( x 1 ) \ { s λ , s λ child } q s ( y 1 | x 1 , k ) 1 g s s A s \ { s λ , s λ child } g s .
The equation (A32) has a similar structure to (A31). Like this, we can decompose (A31) recursively to the leaf node that is included in S ( x 1 ) . In addition, we can utilize the assumption g s = 0 for any leaf nodes s of a meta-tree M T , k . Consequently, we can prove that (A31) coincides with q ˜ s λ ( y 1 | x 1 , M T , k ) .
Step 2: Next, we prove (16) for n = 0 . Because the parameters g s are updated only when their nodes are the elements of S ( x 1 ) , we decompose the right-hand side of (16) for n = 0 as follows:
s I ( T ) g s | x 1 , y 1 s L ( T ) 1 g s | x 1 , y 1 = s I ( T ) S ( x 1 ) g s | x 1 , y 1 s L ( T ) S ( x 1 ) 1 g s | x 1 , y 1
· s I ( T ) \ S ( x 1 ) g s | x 1 , y 1 s L ( T ) \ S ( x 1 ) 1 g s | x 1 , y 1
= s I ( T ) S ( x 1 ) g s | x 1 , y 1 s L ( T ) S ( x 1 ) 1 g s | x 1 , y 1 s I ( T ) \ S ( x 1 ) g s s L ( T ) \ S ( x 1 ) 1 g s .
We transform the part of (A34), whose parameters g s are renewed by x 1 and y 1 , as follows:
s I ( T ) S ( x 1 ) g s | x 1 , y 1 s L ( T ) S ( x 1 ) 1 g s | x 1 , y 1
= ( a ) 1 g s ( x 1 ) | x 1 , y 1 s I ( T ) S ( x 1 ) g s | x 1 , y 1
= ( b ) 1 g s ( x 1 ) s I ( T ) S ( x 1 ) g s q ˜ s child ( y 1 | x 1 , M T , k ) q ˜ s ( y 1 | x 1 , M T , k )
= ( c ) 1 g s ( x 1 ) q ˜ s ( x 1 ) ( y 1 | x 1 , M T , k ) q ˜ s λ ( y 1 | x 1 , M T , k ) s I ( T ) S ( x 1 ) g s
= q ˜ s ( x 1 ) ( y 1 | x 1 , M T , k ) q ˜ s λ ( y 1 | x 1 , M T , k ) s I ( T ) S ( x 1 ) g s s L ( T ) S ( x 1 ) 1 g s ,
where (a) follows from s L ( T ) S ( x 1 ) is uniquely determined as s ( x 1 ) , (b) follows from Definition 3, and (c) follows from reduction of q ˜ s ( y 1 | x 1 , M T , k ) .
Thus, the right-hand side of (A34) is calculated as
q ˜ s ( x 1 ) ( y 1 | x 1 , M T , k ) q ˜ s λ ( y 1 | x 1 , M T , k ) s I ( T ) g s s L ( T ) 1 g s .
= ( a ) q ˜ s ( x 1 ) ( y 1 | x 1 , M T , k ) p ( T ) q ˜ s λ ( y 1 | x 1 , M T , k )
= ( b ) q ( y 1 | x 1 , T , k ) p ( T ) q ˜ ( y 1 | x 1 , k )
= ( c ) p ( T | x 1 , y 1 , k ) ,
where (a) follows from Assumption 2, the denominator of (b) follows from Theorem 2 for n = 0 , the numerator of (b) follows from Definition 3 and the definition of q s ( x n + 1 ) ( y n + 1 | x n + 1 , x n , y n , k ) , and (c) follows from the Bayes’ theorem.
Step 3: Finally, if we assume (15) and (16) are true when n = N , we can also prove (15) and (16) when n = N + 1 in the same way. Therefore, Theorem 2 holds.

Appendix D. Proof of Lemma 2

Proof of Lemma 2.
p ( k | x n , y n ) p ( k ) p ( x n , y n | k )
= p ( k ) i = 1 n q ˜ ( y i | x i , x i 1 , y i 1 , k ) p ( x i | x i 1 , y i 1 , k )
( a ) i = 1 n q ˜ ( y i | x i , x i 1 , y i 1 , k )
= ( b ) i = 1 n q ˜ s λ ( y i | x i , x i 1 , y i 1 , M T , k ) ,
where (a) follows from Assumption A3 and x 1 , , x n are i.i.d.; (b) follows from Theorem 2 at each i = 1 , , n . □

References

  1. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. [Google Scholar]
  2. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  3. Athey, S.; Tibshirani, J.; Wager, S. Generalized Random Forests. Ann. Stat. 2019, 47, 1148–1178. [Google Scholar] [CrossRef] [Green Version]
  4. Friedman, J.H. Stochastic Gradient Boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
  5. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef] [Green Version]
  6. Schulter, S.; Wohlhart, P.; Leistner, C.; Saffari, A.; Roth, P.M.; Bischof, H. Alternating Decision Forests. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 508–515. [Google Scholar] [CrossRef]
  7. Mishina, Y.; Murata, R.; Yamauchi, Y.; Yamashita, T.; Fujiyoshi, H. Boosted Random Forest. IEICE Trans. Inf. Syst. 2015, E98.D, 1630–1636. [Google Scholar] [CrossRef] [Green Version]
  8. Bulo, S.; Kontschieder, P. Neural Decision Forests for Semantic Image Labelling. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 81–88. [Google Scholar] [CrossRef] [Green Version]
  9. Kontschieder, P.; Fiterau, M.; Criminisi, A.; Bulo, S.R. Deep Neural Decision Forests. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1467–1475. [Google Scholar] [CrossRef]
  10. Tanno, R.; Arulkumaran, K.; Alexander, D.; Criminisi, A.; Nori, A. Adaptive Neural Trees. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6166–6175. [Google Scholar]
  11. Zhou, Z.H.; Feng, J. Deep Forest: Towards An Alternative to Deep Neural Networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia, 19–25 August 2017; pp. 3553–3559. [Google Scholar] [CrossRef] [Green Version]
  12. Frosst, N.; Hinton, G. Distilling a Neural Network Into a Soft Decision Tree. arXiv 2017, arXiv:1711.09784. [Google Scholar]
  13. Suko, T.; Nomura, R.; Matsushima, T.; Hirasawa, S. Prediction Algorithm for Decision Tree Model. IEICE Tech. Rep. Theor. Found. Comput. 2003, 103, 93–98. (In Japanese) [Google Scholar]
  14. Berger, J.O. Statistical Decision Theory and Bayesian Analysis; Springer: New York, NY, USA, 1985. [Google Scholar] [CrossRef]
  15. Matsushima, T.; Hirasawa, S. A Bayes Coding Algorithm Using Context Tree. In Proceedings of the 1994 IEEE International Symposium on Information Theory, Trondheim, Norway, 27 June–1 July 1994; p. 386. [Google Scholar] [CrossRef]
  16. Nakahara, Y.; Matsushima, T. A Stochastic Model of Block Segmentation Based on the Quadtree and the Bayes Code for It. In Proceedings of the 2020 Data Compression Conference (DCC), Snowbird, UT, USA, 24–27 March 2020; pp. 293–302. [Google Scholar] [CrossRef]
Figure 1. An example of a meta-tree and four model trees represented by it.
Figure 1. An example of a meta-tree and four model trees represented by it.
Entropy 23 00768 g001
Figure 2. (a) Two examples of ( T 1 , k 1 , θ 1 ) and ( T 2 , k 2 , θ 2 ) , where T 1 and T 2 are 2-ary regular trees, k 1 = ( k s λ , k s 0 ) = ( 1 , 2 ) , and k 2 = ( k s λ , k s 0 , k s 1 ) = ( 3 , 4 , 2 ) . (b) p ( y | x = ( 0 , 1 ) , θ 1 , T 1 , k 1 ) = θ y | s 01 .
Figure 2. (a) Two examples of ( T 1 , k 1 , θ 1 ) and ( T 2 , k 2 , θ 2 ) , where T 1 and T 2 are 2-ary regular trees, k 1 = ( k s λ , k s 0 ) = ( 1 , 2 ) , and k 2 = ( k s λ , k s 0 , k s 1 ) = ( 3 , 4 , 2 ) . (b) p ( y | x = ( 0 , 1 ) , θ 1 , T 1 , k 1 ) = θ y | s 01 .
Entropy 23 00768 g002
Figure 3. An example of a meta-tree M T , k , where T is a complete tree with D max = 2 and k = ( 3 , 4 , 2 ) .
Figure 3. An example of a meta-tree M T , k , where T is a complete tree with D max = 2 and k = ( 3 , 4 , 2 ) .
Entropy 23 00768 g003
Figure 4. An example of K = { k 1 , k 2 , k 3 } and K = { k 1 , k 2 } . In this example, the prior probability p ( k 1 ) = p ( k 2 ) = 1 / 2 and p ( k 3 ) = 0 .
Figure 4. An example of K = { k 1 , k 2 , k 3 } and K = { k 1 , k 2 } . In this example, the prior probability p ( k 1 ) = p ( k 2 ) = 1 / 2 and p ( k 3 ) = 0 .
Entropy 23 00768 g004
Figure 5. An example of true model tree (A-1) (a), model tree (A-2) (b), and model tree (B-1) (c).
Figure 5. An example of true model tree (A-1) (a), model tree (A-2) (b), and model tree (B-1) (c).
Entropy 23 00768 g005
Figure 6. Relationships between the number of learning data n and the average prediction error rate where the true model tree is model (A-1) (a), model (A-2) (b), and model (B-1) (c).
Figure 6. Relationships between the number of learning data n and the average prediction error rate where the true model tree is model (A-1) (a), model (A-2) (b), and model (B-1) (c).
Entropy 23 00768 g006
Figure 7. Relationships between the number of the learning data n and the average prediction error rate on nursery school data (a)/mushroom data (b) in UCI repository.
Figure 7. Relationships between the number of the learning data n and the average prediction error rate on nursery school data (a)/mushroom data (b) in UCI repository.
Entropy 23 00768 g007
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Dobashi, N.; Saito, S.; Nakahara, Y.; Matsushima, T. Meta-Tree Random Forest: Probabilistic Data-Generative Model and Bayes Optimal Prediction. Entropy 2021, 23, 768. https://0-doi-org.brum.beds.ac.uk/10.3390/e23060768

AMA Style

Dobashi N, Saito S, Nakahara Y, Matsushima T. Meta-Tree Random Forest: Probabilistic Data-Generative Model and Bayes Optimal Prediction. Entropy. 2021; 23(6):768. https://0-doi-org.brum.beds.ac.uk/10.3390/e23060768

Chicago/Turabian Style

Dobashi, Nao, Shota Saito, Yuta Nakahara, and Toshiyasu Matsushima. 2021. "Meta-Tree Random Forest: Probabilistic Data-Generative Model and Bayes Optimal Prediction" Entropy 23, no. 6: 768. https://0-doi-org.brum.beds.ac.uk/10.3390/e23060768

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop