Efficient Progressive Sampling
F. Provost, D. Jensen, and T. Oates (1999). Efficient progressive sampling. Proceedings of the Fifth ACM SIG-KDD International Conference on Knowledge Discovery and Data Mining (KDD-99). pp. 23-32.
- Abstract
- Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the correct sample size rarely is obvious. We analyze methods for progressive sampling using progressively larger samples as long as model accuracy improves. We analyze efficiency relative to induction with all instances; we show that a simple, geometric sampling schedule is asympototically optimal, and we describe how best to take into account prior expectations of accuracy convergence. We then describe the issues involved in instantiating an efficient progressive sampler, including how to detect convergence. Finally, we provide empirical results comparing a variety of progressive sampling methods. We conclude that progressive sampling can be remarkably efficient.
- Text
- A PDF version of this paper is available.