Learning from examples: generation and evaluation of decision trees for software resource analysis

TitleLearning from examples: generation and evaluation of decision trees for software resource analysis
Publication TypeJournal Articles
Year of Publication1988
AuthorsSelby RW, Porter A
JournalIEEE Transactions on Software Engineering
Pagination1743 - 1757
Date Published1988/12//
ISBN Number0098-5589
KeywordsAnalysis of variance, Artificial intelligence, Classification tree analysis, Data analysis, decision theory, Decision trees, Fault diagnosis, Information analysis, machine learning, metrics, NASA, production environment, software engineering, software modules, software resource analysis, Software systems, Termination of employment, trees (mathematics)

A general solution method for the automatic generation of decision (or classification) trees is investigated. The approach is to provide insights through in-depth empirical characterization and evaluation of decision trees for one problem domain, specifically, that of software resource data analysis. The purpose of the decision trees is to identify classes of objects (software modules) that had high development effort, i.e. in the uppermost quartile relative to past data. Sixteen software systems ranging from 3000 to 112000 source lines have been selected for analysis from a NASA production environment. The collection and analysis of 74 attributes (or metrics), for over 4700 objects, capture a multitude of information about the objects: development effort, faults, changes, design style, and implementation style. A total of 9600 decision trees are automatically generated and evaluated. The analysis focuses on the characterization and evaluation of decision tree accuracy, complexity, and composition. The decision trees correctly identified 79.3% of the software modules that had high development effort or faults, on the average across all 9600 trees. The decision trees generated from the best parameter combinations correctly identified 88.4% of the modules on the average. Visualization of the results is emphasized, and sample decision trees are included