Large Datasets Lead to Overly Complex Models:
An Explanation and a Solution
Tim Oates and David Jensen (1998). "Large Datasets Lead to Overly Complex Models: An Explanation and a Solution." Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. August.
- Abstract
- This paper explores unexpected results that lie at the intersection
of two common themes in the KDD community: large datasets and
the goal of building compact models. Experiments with many different
datasets and several model construction algorithms (including
tree learning algorithms such as C4.5 with three different pruning
methods, and rule learning algorithms such as C4.5RULES and RIPPER)
show that increasing the amount of data used to build a model
often results in a linear increase in model size, even when that
additional complexity results in no significant increase in model
accuracy. Despite the promise of better parameter estimation held
out by large datasets, as a practical matter, models built with
large amounts of data are often needlessly complex and cumbersome.
In the case of decision trees, the cause of this pathology is
identified as a bias inherent in several common pruning techniques.
Pruning errors made low in the tree, where there is insufficient
data to make accurate parameter estimates, are propagated and
magnified higher in the tree, working against the accurate parameter
estimates that are made possible there by abundant data. We propose
a general solution to this problem based on a statistical technique
known as randomization testing, and empirically evaluate its utility.
- Text
- A Postscript version of this paper is available on request.