Decision trees most commonly are univariate, i.e., they use splits based on a single attribute at each internal node. Multivariate decision trees can use splits that contain more than one attribute at each internal node. Though several methods have been developed in the literature for constructing multivariate trees, this body of work is not as well-known as that on univariate trees. We summarize below the directions work on automatically constructing multivariate trees has taken.
Most of the work on multivariate splits considered linear (oblique) trees. These are trees which have tests based on a linear combination of the attributes at some internal nodes. The problem of finding an optimal linear split (optimal with respect to any of the feature evaluation measures in Section 2.3.1) is more difficult that that of finding the optimal univariate split. In fact, finding optimal linear splits is known to be intractable for some feature evaluation rules (see Section 2.6.1 for pointers), so heuristic methods are required for finding good, albeit suboptimal, linear splits. Methods used in the literature for finding good linear tests include linear discriminant analysis, hill climbing search, linear programming, perceptron training and others.
Several authors have considered the problem of constructing tree-structured classifiers that have linear discriminants [81] at each node. You and Fu [379] used a linear discriminant at each node in the decision tree, computing the hyperplane coefficients using the Fletcher-Powell descent method [104]. Their method requires that the best set of features at each node be prespecified by a human. Friedman [110] reported that applying Fisher's linear discriminants, instead of atomic features, at some internal nodes was useful in building better trees. Qing-Yun and Fu [289] also describe a method to build linear discriminant trees. Their method uses multivariate stepwise regression to optimize the structure of the decision tree as well as to choose subsets of features to be used in the linear discriminants. More recently, use of linear discriminants at each node is considered by Loh and Vanichsetakul [216]. Unlike in [379], the variables at each stage are appropriately chosen in [216] according to the data and the type of splits desired. Other features of the tree building algorithm in [216] are: (1) it yields trees with univariate, linear combination or linear combination of polar coordinate splits, and (2) allows both ordered and unordered variables in the same linear split. Use of linear discriminants in a decision tree is considered in the remote sensing literature in [158]. A method for building linear discriminant classification trees, in which the user can decide at each node what classes need to be split, is described in [350]. John [166] recently considered linear discriminant trees in the machine learning literature.
An extension of linear discriminants are linear machines [271], which are linear structures that can discriminate between multiple classes. In the machine learning literature, Utgoff et al. explored decision trees that used linear machines at internal nodes [32,79].
Sklansky and his students developed several piecewise linear discriminants based on the principle of locally opposed clusters of objects. Wassel and Sklansky [368,335] suggested a procedure to train a linear split to minimize the error probability. Using this procedure, Sklansky and Michelotti [334] developed a system to induce a piece-wise linear classifier. Their method identifies the closest-opposed pairs of clusters in the data, and trains each linear discriminant locally. The final classifier produced by this method is a piecewise linear decision surface, not a tree. Foroutan [107] discovered that the resubstitution error rate of optimized piece-wise linear classifiers is nearly monotonic with respect to the number of features. Based on this result, Foroutan and Sklansky [108] suggest an effective feature selection procedure for linear splits that uses zero-one integer programming. Park and Sklansky [280,281] describe methods to induce linear tree classifiers and piece-wise linear discriminators. The main idea in these methods is to find hyperplanes that cut a maximal number of Tomek links. Tomek links of a data set connect opposed pairs of data points for which the circle of influence between the points doesn't contain any other points.
CART's use of linear combinations of attributes
([29], Chapter 5) is well-known. This algorithm
uses heuristic hill climbing and backward feature elimination to find
good linear combinations at each node. Murthy et al.
[263,264] described significant extensions
to CART's linear combinations algorithm, using randomized techniques.
(See Chapter
)
A perceptron is a linear function neuron [244,135] which can be trained to optimize the sum of distances of the misclassified objects to it, using a convergent procedure for adjusting its coefficients. Perceptron trees, which are decision trees with perceptrons just above the leaf nodes, were discussed in [357]. Decision trees with perceptrons at all internal nodes were described in [359,325].
Linear programming has been used for building adaptive classifiers since late 1960s [156]. Given two possibly interesecting sets of points, Duda and Hart [81] proposed a linear programming formulation for finding the split whose distance from the misclassified points is minimized. More recently, Mangasarian and Bennett used linear and quadratic programming techniques to build machine learning systems in general and decision trees in particular [228,21,19,226,20]. Use of zero-one integer programming for designing vector quantizers can be found in [213]. Brown and Pittard [34] also employed linear programming for finding optimal multivariate splits at classification tree nodes. Almost all the above papers attempt to minimize the distance of the misclassified points from the decision boundary. In that sense, these methods are more similar to perceptron training methods [244], than to decision tree splitting criteria. Mangasarian [227] describes a linear programming formulation to minimize the number of misclassified points instead of the geometric distance.
In the neural networks community, many researchers have recently considered hybrid structures between decision trees and neural nets. Though these techniques were developed as neural networks whose structure could be automatically determined, their outcome can be interpreted as decision trees with nonlinear splits. Examples of this work include [125,333,30,57,150,315,69]. Techniques very similar to those used in tree construction, such as information theoretic splitting criteria and pruning, can be found in neural tree construction also. In addition to these methods, there exist other hybrid techniques between decision trees and neural networks. Sethi [322] described a method for converting a univariate decision tree into a neural net and then retraining it, resulting in tree structured entropy nets with sigmoidal splits. An extension of entropy nets, that converts linear decision trees into neural nets was described in [279]. Decision trees with small multilayer networks at each node, implementing nonlinear, multivariate splits, were described in [132]. Jordan and Jacobs [168] described hierarchical parametric classifiers with small ``experts'' at internal nodes. Training methods for tree structured Boltzmann machines are described in [316].
Use of polynomial splits at tree nodes is explored in decision theory in [321]. In information theory, Gelfand and Ravishanker [116] describe a method to build a tree structured filter that has linear processing elements at internal nodes. Heath et al. [147,145] used simulated annealing to find the best oblique split at each tree node. Lubinsky [221,220] attempted bivariate trees, trees in which some functions of two variables can be used as tests at internal nodes. Lubinsky considered the use of linear cuts, corner cuts and rectangular cuts, using ordered and unordered variables.