| Home
People
Publications
Projects
Software
News
|
A central activity of network analysis is identifying patterns in records about interrelated people, places, things, and events. We will develop a class of methods that can learn more accurate predictive patterns from fewer data points than existing methods. Our new methods will be enabled by statistical techniques customized for use on network data.
Many network analysis tasks pose unique challenges for data mining tools. These problems require the analysis of "relational data" sets of interconnected records of many different types. Such data include combinations of message traffic, financial transactions, records of meetings, and reports of organizational, familial, and social ties among individuals and organizations. Representing and utilizing the linkage among records of people, places, things, and events is essential to analyzing such data. However, few data mining techniques can even represent such linkage, and nearly all techniques assume that the data records are non-relational. That is, they assume that data are independent (knowing the value of one record tells you nothing about another record) and identically distributed (all records are drawn from a homogeneous population).
We have discovered a set of basic statistical errors made when conventional data mining algorithms are applied to relational data. Specifically, linkage among records (e.g., many financial transactions that originate from the same bank account) and autocorrelation among linked records (e.g., all those transactions going to banks located in the same country) changes the confidence with which statistical inferences can be made from the data. If algorithms do not account for record linkage and autocorrelation, they will make far more frequent errors and require more data to learn subtle relationships.
We propose to develop techniques that use resampling and randomization testing to produce statistical estimates that are customized to the linkage, autocorrelation, and other features of specific relational data sets. We have used these approaches in earlier work on the statistical effects of searching massive spaces of possible models, and we believe that they are nearly the only techniques that can adjust for the complex interrelationships present in many types of network data.
|