Chapter 1. Introduction

Table of Contents

Conventional Knowledge Discovery
Relational Knowledge Discovery
Proximity Advantages

Proximity is an environment for knowledge discovery in relational data. It helps human analysts discover new knowledge by analyzing complex data sets about people, places, things, and events. New developments in this area are vital because of the growing interest in analyzing the Web, social networks, telecommunications and computer networks, relational and object-oriented databases, multi-agent systems, and other sources of structured and semi-structured data.

Proximity consists of novel algorithms that help manage, explore, sample, model, and visualize data. Proximity implements methods for learning statistical models that describe the probabilistic dependencies in relational data and can estimate probability distributions over unseen data. Proximity is an open-source application developed in Java, and it makes substantial use of MonetDB [Boncz and Kersten, 1995], [Boncz, 2002], an open-source, vertical database system designed for high performance on semi-structured data.

Conventional Knowledge Discovery

First-generation tools for knowledge discovery are already widely deployed in business, science, and government. These tools help epidemiologists identify emerging diseases, help engineers improve industrial processes, and help credit-card companies spot fraud.

Unfortunately, much of the technical work in knowledge discovery, and its underlying statistical theory, assumes that data records are structurally homogeneous and statistically independent. For example, to analyze a set of patient records to determine useful diagnostic rules for a new disease, traditional techniques would assume that the records provide the same type of information about each patient and that knowing something about one patient tells you nothing about another. Good work in epidemiology, however, regularly considers records of many types (e.g., patients, workplaces, industrial chemicals) as well as relationships among those records (genetic and social relationships among patients, occupational exposure of patients to chemicals, etc.).

Ignoring this relational information vastly oversimplifies many problems and can make their deep structure all but undiscoverable. Indeed, the importance of such relational information is precisely what led computer scientists to create relational databases and knowledge representations based on first-order logic. To date, however, most technologies for knowledge discovery have lagged behind these decades-old innovations, only addressing the data contained in a single database table and only expressing concepts in representations roughly equivalent to propositional logic.

Addressing fully relational tasks has raised a remarkable array of new problems in statistical inference, required the development of new technologies for knowledge discovery, and raised new questions about the assessment and management of these technologies. The need to investigate these interconnected questions has driven the work of the Knowledge Discovery Laboratory (KDL) at the University of Massachusetts Amherst, and the desire to disseminate our findings led us to create Proximity.