Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection

A. Fast, L. Friedland, M. Maier, B. Taylor, D. Jensen, H. Goldberg and J. Komoroske. 2007. Relational data pre-processing techniques for improved securities fraud detection. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 941-949. (Also appeared as a University of Massachusetts Technical Report 07-20).

Abstract

Commercial datasets are often large, relational and dynamic. They contain many records of people, places, things and events and their interactions over time. Such datasets are rarely structured appropriately for important knowledge discovery tasks, and they often contain variables whose meaning changes across different subsets of the data. We describe how these challenges were addressed in a collaborative analysis project undertaken by the University of Massachusetts Amherst and the National Association of Securities Dealers (NASD). We describe several methods for data pre-processing that we applied to transform a large, dynamic, and relational dataset describing nearly the entirety of the U.S. securities industry, and we show how these methods made the dataset suitable for learning statistical relational models. To better utilize social structure, we first applied known consolidation and link formation techniques to associate individuals with branch office locations. In addition, we developed an innovative technique to infer professional associations by exploiting dynamic employment histories. Finally, we applied normalization techniques to create a suitable class label that adjusts for spatial, temporal, and other heterogeneity within the data. We show how these pre-processing techniques combine to provide the necessary foundation for learning high-performing statistical models of fraudulent activity.
 
Text
A PDF version of this paper is available.


Feedback Back to main page Fineprint