next up previous
Next: 2.3.2 Multivariate splits Up: 2.3 Finding splits Previous: 2.3 Finding splits

2.3.1 Feature evaluation rules

 

When used for classification or generalization, decision trees are essentially probability estimators. Feature evaluation rules are heuristics whose aim is to produce as reliable probability estimates from training data as possible. A taxonomy, proposed by Ben-Bassat [18], is helpful in understanding the large number of existing feature evaluation criteria. Ben-Basset divides feature evaluation rules into three, not necessarily distinct, categories: rules derived from information theory, rules derived from distance measures and rules derived from dependence measures.

There exist several attribute selection criteria that do not clearly belong to any category in Ben-Basset's taxonomy. Gleser and Collen [124] and Talmon [344] used a combination of mutual information and measures. They first measured the gain in average mutual information due to a new split , and then quantified the probability that this gain is due to chance, using tables. The split that minimized was chosen by these methods. A permutation statistic was used for univariate tree construction for 2-class problems in [210]. The main advantage of this statistic is that, unlike most of the other measures, its distribution is independent of the number of training instances. As will be seen in Section 2.4, this property provides a natural measure of when to stop tree growth.

Measures that use the activity of an attribute have been explored for tree construction [253,247]. The activity of a variable is equal to the testing cost of the variable times the a priori probability that it will be tested. The computational requirements for computing activity are the same as those for the information-based measures. Quinlan and Rivest [299] suggested the use of Risannen's minimum description length [304] for deciding which splits to prefer over others and also for pruning. Kalkanis [170] pointed out that measures like information gain and Gini index are all concave (i.e., they never report a worse goodness value after trying a split than before splitting), so there is no natural way of assessing where to stop further expansion of a node. As a remedy, Kalkanis suggested the use of the upper bounds in the confidence intervals for the misclassification error as an attribute selection criterion. gif

Heath [147] used the simplest possible attribute selection criteria, based on the number of misclassified objects, for oblique decision tree induction. Their measures were called max minority and sum minority, respectively denoting the maximum and the sum of the number of misclassified points on either side of a binary split. Max minority has the theoretical advantage that the depth of the tree constructed using this measure is at worst logarithmic in the number of examples. Lubinsky [219,220] also used the number of misclassified points as a splitting criterion, calling it inaccuracy. The performance of these measures does not seem to be in general as good as the information theory or distance based measures, and additional tricks are needed to make these measures robust [219,264]. Another measure suggested by Heath , called the sum of impurities, assigns an integer to each class and measures the variance between class numbers in each partition [147,264]. An almost identical measure was used earlier in the Automatic Interaction Detection (AID) program [100].

Most of the above feature evaluation criteria assume no knowledge of the probability distribution of the training objects. The optimal decision rule at each tree node, a rule that minimizes the overall error probability, is considered in [199,200,201] assuming that complete probabilistic information about the data is known.

Comparisons

Given the large number of feature evaluation rules, a natural concern is to decide their relative effectiveness in constructing ``good'' trees. Evaluations in this direction, in statistics, pattern recognition and machine learning, have been predominantly empirical in nature, though there have been a few theoretical evaluations. We will discuss the empirical comparisons here, and defer the discussion of the latter to Section 2.6.

In spite of a large number of comparative studies, very few so far have concluded that a particular feature evaluation rule is significantly better than others. A majority of studies have concluded that there is not much difference between different measures. This is to be expected as induction per se can not rigorously justify performance on unseen instances. Any strategy that results in superior generalization accuracy on some problems is bound to have inferior performance on some other problems. gif Of course, comparisons of individual methods are still interesting because they throw light on which method can be used in what situations.

Baker and Jain [14] reported experiments comparing eleven feature evaluation criteria and concluded that the feature rankings induced by various rules are very similar. Several feature evaluation criteria, including Shannon's entropy and divergence measures, are compared using simulated data in [17], on a sequential, multi-class classification problem. The conclusions are that no feature selection rule is consistently superior to the others, and that no specific strategy for alternating different rules seems to be significantly more effective. Breiman et al. [29] conjectured that decision tree design is rather insensitive to any one from a large class of splitting rules, and it is the stopping rule that is crucial. Mingers [243] compared several attribute selection criteria, and concluded that tree quality doesn't seem to depend on the specific criterion used. He even claimed that random attribute selection criteria are as good as measures like information gain [292]. This later claim was refuted in [215], where the authors argued that random attribute selection criteria are prone to overfitting, and also fail when there are several noisy attributes.

Babic et al. [12] compared ID3 [292] and CART [29], for two clinical diagnosis problems. Miyakawa [247] compared three activity-based measures, Q, O and loss, both analytically and empirically. He showed that Q and O do not chose non-essential variables at tree nodes, and that they produce trees that are 1/4th the size of the trees produced by loss. Fayyad and Irani [95] showed that their measure C-SEP, performs better than Gini index [29] and information gain [292] for specific types of problems.

Several researchers [140,292] pointed out that information gain is biased towards attributes with a large number of possible values. Mingers [241] compared information gain and the statistic for growing the tree as well as for stop-splitting. He concluded that corrected information gain's bias towards multivalued attributes, however to such an extent that they were never chosen, and the latter produced trees that were extremely deep and hard to interpret. Quinlan suggested gain ratio [297] as a remedy for the bias of information gain. Mantaras [229] argued that gain ratio had its own set of problems, and suggested using information theory-based distance between partitions for tree construction. He formally proved that his measure is not biased towards multiple-valued attributes. However, White and Liu [374] present experiments to conclude that information gain, gain ratio and Mantaras' measure are worse than a based statistical measure, in terms of their bias towards multiple-valued attributes. A hypergeometric distribution is proposed as a means to avoid the biases of information gain, gain ratio and metrics in [231]. Kononenko recently pointed out that [189] Minimum Description Length based feature evaluation criteria have the least bias towards multi-valued attributes.



next up previous
Next: 2.3.2 Multivariate splits Up: 2.3 Finding splits Previous: 2.3 Finding splits



Sreerama Murthy
Thu Oct 19 17:40:24 EDT 1995