Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection
A. Fast, L. Friedland, M. Maier, B. Taylor, D. Jensen, H. Goldberg and J. Komoroske. 2007. Relational data pre-processing techniques for improved securities fraud detection. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 941-949. (Also appeared as a University of Massachusetts Technical Report 07-20).
- Abstract
Commercial datasets are often large, relational and dynamic.
They contain many records of people, places, things and events
and their interactions over time. Such datasets are rarely
structured appropriately for important knowledge discovery tasks,
and they often contain variables whose meaning changes across
different subsets of the data. We describe how these challenges
were addressed in a collaborative analysis project undertaken by
the University of Massachusetts Amherst and the National
Association of Securities Dealers (NASD). We describe several
methods for data pre-processing that we applied to transform a
large, dynamic, and relational dataset describing nearly the
entirety of the U.S. securities industry, and we show how these
methods made the dataset suitable for learning statistical relational
models. To better utilize social structure, we first applied known
consolidation and link formation techniques to associate
individuals with branch office locations. In addition, we
developed an innovative technique to infer professional
associations by exploiting dynamic employment histories.
Finally, we applied normalization techniques to create a suitable
class label that adjusts for spatial, temporal, and other
heterogeneity within the data. We show how these pre-processing
techniques combine to provide the necessary foundation for
learning high-performing statistical models of fraudulent activity.
-
- Text
- A PDF version of this paper is available.