Knowledge Discovery Laboratory Knowledge Discovery Laboratory
About PROXIMITY
Home

People

Publications

Projects

PROXIMITY

What's New
About PROXIMITY
FAQ
Downloads
Documentation
License
Acknowledgments
Mailing lists
Contact

Data

News

PROXIMITY is a system for relational knowledge discovery designed and implemented by the Knowledge Discovery Laboratory in the Department of Computer Science at the University of Massachusetts Amherst.

What is relational knowledge discovery?

Knowledge discovery is the extraction of non-trivial, previously unknown, and useful information from data. Relational knowledge discovery is knowledge discovery in relational data. For the purposes of this discussion, a relational data structure is a labeled, directed graph in which objects from the domain of discourse are connected by links that represent relationships between pairs of objects. Both objects and links can have an arbitrary number of attributes.

Figure 1 shows a fragment of a relational data structure from the cinematic domain. The objects include movies (e.g., Network and The Thomas Crown Affair), people (Faye Dunaway), studios (MGM), and awards (Oscars). The links include (among others) the ActorIn relationship between a person and a movie; the StudioOf relationship between a studio and a movie; and the Nominated relationship between a movie or a person and an award.

Figure 1: Graphical data fragment from a movie database

In this diagram, the labels on the objects indicate the value of the attribute name, and the labels on links indicate the value of the attribute type. Not shown in the figure are other attributes of objects, such as the gender of an actor, the year a movie was released, or the location of a studio. Similarly, links could have additional attributes, such as the salary an actor received for starring in a given movie.

PROXIMITY offers a more flexible data representation than many other approaches to knowledge discovery and machine learning. Other techniques assume their input data instances are structurally homogeneous, identically distributed, and statistically independent. These assumptions lead to a “flat-file” data format in which each instance is represented by a vector of features, and all the vectors have the same length. Relational knowledge discovery is designed for the many real-life domains that do not meet these criteria. Objects in a PROXIMITY database can be structurally heterogeneous. They can have a variable number of attributes, and a variable number of values for each attribute. The relationships encoded by links violate the assumption that each object is statistically independent of the others.

Relational knowledge discovery is particularly applicable when data about objects and events are used to make inferences about organizations and activities. Examples of such cases include:

  • Epidemiology (relating people, animals, homes, and workplaces);
  • Fraud detection (relating people, transactions, businesses, and accounts),
  • Actions of complex computer systems (relating users, automated agents, machines, and sessions).

What is a PROXIMITY database?

A PROXIMITY database consists of objects, directed binary links, and attributes that record characteristics of the objects and links. An object or link can have zero or more attributes. An attribute is a <name, value> pair where the attribute value is a set. For a given attribute name, an object or link can have no values (the empty set), one value, or multiple values. For example, a movie might have several titles (working title, release title, translated title), one title only, or none at all. A value of the empty set indicates that either the attribute is not relevant for that object (or link) or the attribute is relevant but no information is recorded.

The PROXIMITY data schema differs from those commonly used in relational databases. A traditional database schema stores records for each type of object (movies, persons, studios, awards) in a separate table. Each table contains a fixed number of fields that store attributes (a movie might have fields for name, year of release, etc.). A traditional schema is specified once, before the database is constructed. It can be difficult to change the schema, change the type of an individual object, or insert records for objects whose type is uncertain. Thus, a traditional schema doesn't easily support the iterative structuring or “sense-making” activities that are central to knowledge discovery.

In contrast, the schema used by PROXIMITY escapes the rigid record typing of a traditional schema, and it enables analysts to introduce and transform data structures as analysis progresses. Specifically, the PROXIMITY schema supports:

  • Flexible typing — The type of an object or link is an attribute like any other. An object or link can have no type, one type, or many types, and types of objects and links can be easily changed. In a conventional relational database, changing an object's type would require moving records from one table to another, and altering the fields of an existing record to match the new table.

  • Attribute creation — New attributes can be created easily, by creating a new table containing the values and the identifiers of the objects or links to which those values belong. Objects of unknown type can be inserted into the database and attribute values can be added to them as data about the object accumulate. Adding an attribute in a traditional schema would require adding a column to each table to which the attribute applied, even when values are unknown for some records.

  • Set-valued attributes — The PROXIMITY schema makes it simple to implement attributes whose values are sets. Indeed, all attribute values in a PROXIMITY database are sets. Thus, a person can have no name, one name, or many names. Set-valued attributes can be used to represent real-world features such as aliasing in names (e.g., “William”, “Bill”, and “Shorty”) and multiple roles of a person in an organization (e.g., President and CEO).

  • Efficient scaling — While nearly all operations with the PROXIMITY schema require a join between at least two tables, the tables are extremely simple. As a result, an analyst can create hundreds or even thousands of attributes with little or no impact on query speed. In a traditional schema, the width of the table would increase dramatically, decreasing the number of records that can be paged into main memory at one time.

How does PROXIMITY support relational knowledge discovery?

PROXIMITY supports multiple phases of the knowledge discovery process: creating a database, exploring and visualizing the data, transforming the relational structure, calculating new attributes, and learning and applying a model.

Create a databaseThe import module converts data from conventional relational database tables into a PROXIMITY database according to the directives in an import specification file. The user can create a new PROXIMITY database or extend an existing one. The import specification file is written in XML. It states how to map the source data schema to PROXIMITY objects, links, and attributes. For a given set of source tables, there are many possible mappings to a PROXIMITY database, each corresponding to a different import specification file.

Proximity
               4.3

Explore and visualize dataThe visual query language QGraph enables the data analyst to search a PROXIMITY database for recurrent patterns and significant combinations of attribute values. QGraph serves the same function in PROXIMITY that SQL serves in a traditional relational database. With QGraph, the analyst describes what the user is looking for and the query processor finds all matching instances in the database. A QGraph query is a labeled graph in which the vertices correspond to objects and the edges to links. The query specifies the desired configuration of objects and links, boolean conditions on their attribute values, and global constraints relating one object or link to another. To match the query, a database subgraph must have the correct structure and satisfy all the boolean conditions and constraints. The collection of matching subgraphs becomes the input for later stages of the knowledge discovery process: visualization, structural transformation, attribute calculation, sampling, classification. After running a query, the analyst can examine any subgraph in the resulting collection with the graphical user interface.

Learn and apply a modelOne type of model supported in PROXIMITY is a simple Bayesian classification algorithm that we have adapted to the context of relational data [Neville, Jensen, and Gallagher, 2003]. The relational Bayesian classifier (RBC) is trained on one collection of subgraphs and then applied to another. Each subgraph in the collection contains one target object to be classified; the other objects and links in the subgraph contribute to the target's attributes. For example, suppose we want to infer movie genre and each input subgraph contains a movie with all its actors. We could add to the movie a binary attribute for each possible genre, indicating whether any member of the cast has acted in another movie of that genre. Attributes of surrounding objects and links in the subgraph as well as those of the target object are available to the classifier. By preserving the relational structure of the data, we can exploit the connections between objects to improve classification accuracy.

A relational probability tree (RPT) is another type of model supported by PROXIMITY [Neville, Jensen, Friedland, and Hay, 2003]. RPTs are a form of classification trees for relational data that selectively consider attributes of related objects and links as well as complex relational aggregates of these attributes to build a probabilistic model. These relational aggregations include mode/average, count, and proportion. Each of these aggregations can operate on related objects and each dynamically determines the best threshold. In addition to the aggregation functions, RPTs also consider structural features of the data by including the degree of an object, or number of links into or out of an object. Advantages of this model include a correction for the autocorrelation and degree disparity properties found in many relational data sets [Jensen & Neville 2002; Jensen, Neville, & Hay 2003]. Traditional classification tree building algorithms are not able to correct for these biases and build larger trees with lower accuracies. Other advantages include ease of model understanding (the model is a series of hierarchical rules) and the ability to dynamically select predictive features and thresholds.

A relational dependency network (RDN) [Neville and Jensen, 2003, Neville and Jensen, 2004] is a graphical model that extends the concept of a dependency network [Heckerman, et al., 2000] for relational domains. RDNs approximate a joint probability distribution over the attributes of objects in a network with a set of conditional probability distributions. The RDN learning algorithm is based on pseudolikehood techniques, which estimate a set of conditional probability distributions independently. This approach avoids the complexities of estimating a full joint distribution and can incorporate existing techniques for learning conditional probability distributions of relational data (e.g., RBCs or RPTs). RDNs support collective classification [Taskar, Abbeel, and Koller, 2002, Neville and Jensen, 2004] (using inferences about one entity in a relational data set to influence inferences about related entities), allowing the model to exploit relational autocorrelation dependencies to improve classification accuracy. Gibbs sampling inference techniques are used to recover a full joint distribution and to estimate probabilities of interest.

We will be adding new models to PROXIMITY as they become ready. We will announce the addition of new models on our “What's New” page. You can also sign up for our PROXIMITY announcements mailing list.

FeedbackPrivacyDisclaimer