When used for classification or generalization, decision trees are essentially probability estimators. Feature evaluation rules are heuristics whose aim is to produce as reliable probability estimates from training data as possible. A taxonomy, proposed by Ben-Bassat [18], is helpful in understanding the large number of existing feature evaluation criteria. Ben-Basset divides feature evaluation rules into three, not necessarily distinct, categories: rules derived from information theory, rules derived from distance measures and rules derived from dependence measures.
distribution, for tree construction as
well as for deciding when to stop. De Merckt [361]
suggested an attribute selection measure that combined geometric
distance with information gain, and argued that such measures are
more appropriate for numeric attribute spaces.
Bhattacharya distance [214], Kolmogorov-Smirnoff distance
[110,307,142] and the
statistic
[16,140,241,382,374] are
some other distance-based measures that have been used for tree
induction. Class separation-based metrics developed in the machine
learning literature [95,381] are also
distance measures. A relatively simplistic method for estimating
class separation, which assumes that the values of each feature
follow a Gaussian distribution in each class, was used for tree
construction in [223].
There exist several attribute selection criteria that do not clearly
belong to any category in Ben-Basset's taxonomy. Gleser and Collen
[124] and Talmon [344] used a combination
of mutual information and
measures. They first measured the
gain in average mutual information
due to a new split
,
and then quantified the probability
that this gain is due
to chance, using
tables. The split that minimized
was chosen by these methods. A permutation statistic was
used for univariate tree construction for 2-class problems in
[210]. The main advantage of this statistic is that,
unlike most of the other measures, its distribution is independent of
the number of training instances. As will be seen in Section
2.4, this property provides a natural
measure of when to stop tree growth.
Measures that use the activity of an attribute have been
explored for tree construction [253,247]. The
activity of a variable is equal to the testing cost of the variable
times the a priori probability that it will be tested. The
computational requirements for computing activity are the same as
those for the information-based measures. Quinlan and Rivest
[299] suggested the use of Risannen's minimum
description length [304] for deciding which splits to
prefer over others and also for pruning. Kalkanis [170]
pointed out that measures like information gain and Gini index are all
concave (i.e., they never report a worse goodness value after trying a
split than before splitting), so there is no natural way of assessing
where to stop further expansion of a node. As a remedy, Kalkanis
suggested the use of the upper bounds in the confidence intervals for
the misclassification error as an attribute selection
criterion.
Heath [147] used the simplest possible attribute selection criteria, based on the number of misclassified objects, for oblique decision tree induction. Their measures were called max minority and sum minority, respectively denoting the maximum and the sum of the number of misclassified points on either side of a binary split. Max minority has the theoretical advantage that the depth of the tree constructed using this measure is at worst logarithmic in the number of examples. Lubinsky [219,220] also used the number of misclassified points as a splitting criterion, calling it inaccuracy. The performance of these measures does not seem to be in general as good as the information theory or distance based measures, and additional tricks are needed to make these measures robust [219,264]. Another measure suggested by Heath , called the sum of impurities, assigns an integer to each class and measures the variance between class numbers in each partition [147,264]. An almost identical measure was used earlier in the Automatic Interaction Detection (AID) program [100].
Most of the above feature evaluation criteria assume no knowledge of the probability distribution of the training objects. The optimal decision rule at each tree node, a rule that minimizes the overall error probability, is considered in [199,200,201] assuming that complete probabilistic information about the data is known.
Given the large number of feature evaluation rules, a natural concern is to decide their relative effectiveness in constructing ``good'' trees. Evaluations in this direction, in statistics, pattern recognition and machine learning, have been predominantly empirical in nature, though there have been a few theoretical evaluations. We will discuss the empirical comparisons here, and defer the discussion of the latter to Section 2.6.
In spite of a large number of comparative studies, very few so far
have concluded that a particular feature evaluation rule is
significantly better than others. A majority of studies have concluded
that there is not much difference between different measures. This is
to be expected as induction per se can not rigorously justify
performance on unseen instances. Any strategy that results in superior
generalization accuracy on some problems is bound to have inferior
performance on some other problems.
Of course, comparisons
of individual methods are still interesting because they throw light
on which method can be used in what situations.
Baker and Jain [14] reported experiments comparing eleven feature evaluation criteria and concluded that the feature rankings induced by various rules are very similar. Several feature evaluation criteria, including Shannon's entropy and divergence measures, are compared using simulated data in [17], on a sequential, multi-class classification problem. The conclusions are that no feature selection rule is consistently superior to the others, and that no specific strategy for alternating different rules seems to be significantly more effective. Breiman et al. [29] conjectured that decision tree design is rather insensitive to any one from a large class of splitting rules, and it is the stopping rule that is crucial. Mingers [243] compared several attribute selection criteria, and concluded that tree quality doesn't seem to depend on the specific criterion used. He even claimed that random attribute selection criteria are as good as measures like information gain [292]. This later claim was refuted in [215], where the authors argued that random attribute selection criteria are prone to overfitting, and also fail when there are several noisy attributes.
Babic et al. [12] compared ID3 [292] and CART [29], for two clinical diagnosis problems. Miyakawa [247] compared three activity-based measures, Q, O and loss, both analytically and empirically. He showed that Q and O do not chose non-essential variables at tree nodes, and that they produce trees that are 1/4th the size of the trees produced by loss. Fayyad and Irani [95] showed that their measure C-SEP, performs better than Gini index [29] and information gain [292] for specific types of problems.
Several researchers [140,292] pointed out that
information gain is biased towards attributes with a large number of
possible values. Mingers [241] compared information gain
and the
statistic for growing the tree as well as for
stop-splitting. He concluded that
corrected information
gain's bias towards multivalued attributes, however to such an extent
that they were never chosen, and the latter produced trees that were
extremely deep and hard to interpret. Quinlan suggested gain
ratio [297] as a remedy for the bias of information
gain. Mantaras [229] argued that gain ratio had its own
set of problems, and suggested using information theory-based
distance between partitions for tree construction. He formally
proved that his measure is not biased towards multiple-valued
attributes. However, White and Liu [374] present
experiments to conclude that information gain, gain ratio and
Mantaras' measure are worse than a
based statistical
measure, in terms of their bias towards multiple-valued attributes. A
hypergeometric distribution is proposed as a means to avoid the biases
of information gain, gain ratio and
metrics in
[231]. Kononenko recently pointed out that
[189] Minimum Description Length based feature
evaluation criteria have the least bias towards multi-valued
attributes.