The Effects of Training Set Size on Decision Tree Complexity
Tim Oates and David Jensen, "The Effects of Training Set Size on Decision Tree Complexity." Proceedings of the Fourteenth International Conference on Machine Learning. 1997.
- Abstract
- This paper presents experiments with 19 datasets and 5 decision
tree pruning algorithms that show that increasing training set
size often results in a linear increase in tree size, even when
that additional complexity results in no significant increase
in classification accuracy. Said differently, removing randomly
selected training instances often results in trees that are substantially
smaller and just as accurate as those built on all available training
instances. This implies that decreases in tree size obtained by
more sophisticated data reduction techniques should be decomposed
into two parts: that which is due to reduction of training set
size, and the remainder, which is due to how the method selects
instances to discard. We perform this decomposition for one recent
data reduction technique, John's Robust-C4.5, and show that a
large percentage of its effect on tree size is attributable to
the fact that it simply reduces the size of the training set.
We conclude that random data reduction is a baseline against which
more sophisticated data reduction techniques should be compared.
Finally, we examine one possible cause of the pathological relationship
between tree size and training set size.
- Text
- A Postscript version of this paper is available (166K).