My work has produced several practical technologies, many of which are implemented in my laboratory's Proximity software. Some of the most important technologies I have developed with students and colleagues include:
Relational Dependency Networks RDNs are an undirected graphical model of relational data that offer simple parameter estimation, efficient structure learning, and collective inference. Jen Neville and I introduced RDNs in a recent workshop paper, and they are an area of particularly active research in KDL.
Relational Probability Trees RPTs extend the feature space of traditional probability estimation trees to learn accurate conditional probability distributions in relational data. Jen Neville, Lisa Friedland, Michael Hay, and I explain their construction and compare their performance to several other models in a recent paper. Several of our examinations of statistical biases in relational learners have also been evaluated using RPTs.
Relational Bayesian Classifiers RBCs extend traditional Bayesian classifers to learn classifiers for relational data. In a recent paper, Jen Neville, Brian Gallagher, and I examined several alternative methods for estimating conditional probability distributions in relational data and found that the simplest method (corresponding to the independence assumption made by propositional Bayesian classifiers) produces the most robust and accurate classifier.
QGraph query language Traditional query languages such as SQL are not ideal for relational knowledge discovery. Many of the most desirable analytic operations and ad hoc explorations require entire subgraphs rather than individual entities or relations, and many analysts are ill-equipped to write complex SQL. To address this need, Hannah Blau, Neil Immerman, and I designed QGraph, a visual query language whose queries are expressed as annotated graphs and which return collections of subgraphs.
Relational multiple instance learning Multiple-instance learning addresses situations where groups of instances are labeled, rather than individual instances. Amy McGovern and I have applied multiple-instance learning to relational data.
Progressive sampling Foster Provost, Tim Oates, and I have devised simple methods for progressive sampling . These methods provide an efficient way to select the minimum sample size required to learn a model that (approximately) maximizes accuracy.
Randomization tests Randomization tests (also known as permutation tests) provide high-quality approximations of a sampling distributions, allowing hypothesis testing in situations where the sampling distribution cannot be derived theoretically. My dissertation research (and a 1991 KDD paper) was the first use of randomization tests to adjust for the effects of multiple comparison procedures in induction algorithms. This work has informed the design and implementation of practical algorithms for learning and data mining, including TBA and a version of C4.5 with randomization pruning. Jen Neville, Matt Rattigan, and I have also applied randomization tests to relational data.
Methods for detecting non-local effects in multi-agent systems The emergent behavior of multi-agent systems depends largely on whether the work of one agent affects the work of another. Such "non-local effects" are the essence of emergent behavior. Michael Atighetchi, Regis Vincent, Victor Lesser, and I devised techniques to learn these non-local effects from very small amounts of data.
Other work I have made minor contributions to several other technical developments, including methods for automatic construction of timelines, predicting stock prices from news stories, and clustering relational data.