Evaluating techniques for generating metric-based classification trees

TitleEvaluating techniques for generating metric-based classification trees
Publication TypeJournal Articles
Year of Publication1990
AuthorsPorter A, Selby RW
JournalJournal of Systems and Software
Pagination209 - 218
Date Published1990/07//
ISBN Number0164-1212

Metric-based classification trees provide an approach for identifying user-specified classes of high-risk software components throughout the software lifecycle. Based on measurable attributes of software components and processors, this empirically guided approach derives models of problematic software components. These models, which are represented as classification trees, are used on future systems to identify components likely to share the same high-risk properties. Example high-risk component properties include being fault-prone, change-prone, or effort-prone, or containing certain types of faults. Identifying these components allows developers to focus the application of specialized techniques and tools for analyzing, testing, and constructing software. A validation study using metric data from 16 NASA systems showed that the trees had an average classification accuracy of 79.3% for fault-prone and effort-prone components in that environment.One fundamental feature of the classification tree generation algorithm is the method used for partitioning the metric data values into mutually exclusive and exhaustive ranges. This study compares the accuracy and the complexity of trees resulting from five techniques for partitioning metric data values. The techniques are quartiles, octiles, and three methods based on least weight subsequence (LWS-[chi]) analysis, where [chi] is the upper bound on the number of partitions. The LWS-3 and LWS-5 partition techniques resulted in trees with higher accuracy (in terms of completeness and consistency) than did quartiles and octiles. LWS-3 and LWS-5 trees were not statistically different in terms of accuracy, but LWS-3 trees had lower complexity than all other methods in terms of the number of unique metrics required. The trees from the three LWS methods (LWS-3, LWS-5, and LWS-8) had lower complexity than did the trees from quartiles and octiles. In general, the results indicate that distribution-sensitive partition techniques that use only relatively few partitions, such as the least weight subsequence techniques LWS-3 and LWS-5, can increase accuracy and decrease complexity in classification trees. Classification analysis techniques, along with other empirically based analysis techniques for large-scale software, will be supported in the Amadeus measurement and empirical analysis system.