The Effects of Training Set Size on Decision Tree Complexity
T. Oates and D. Jensen. "The Effects of Training Set Size on Decision
Tree Complexity." Preliminary Papers of the Sixth International
Workshop on Artificial Intelligence and Statistics. January 1997.
pp. 379-390.
- Abstract
- This paper presents experiments with 19 datasets and 5 decision
tree pruning algorithms that show that increasing training set
size often results in a linear increase in tree size, even when
that additional complexity results in no significant increase
in classification accuracy. Said differently, removing randomly
selected training instances often results in trees that are substantially
smaller and just as accurate as those built on all available training
instances. This implies that decreases in tree size obtained by
more sophisticated data reduction techniques should be decomposed
into two parts: that which is due to reduction of training set
size, and the remainder, which is due to how the method selects
instances to discard. We perform this decomposition for one recent
data reduction technique, John's ROBUST-C4.5 (John 1995), and
show that a large percentage of its effect on tree size is attributable
to the fact that it simply reduces that size of the training set.
We conclude that random data reduction is a baseline against which
more sophisticated data reduction techniques should be compared.
- Text
- A Postscript version of this paper will be made available soon.