| Home
People
Publications
Projects
PROXIMITY
What's New
About PROXIMITY
FAQ
Downloads
Documentation
License
Acknowledgments
Mailing lists
Contact
Data
News
|
PROXIMITY is a system for relational
knowledge discovery designed and implemented by the Knowledge Discovery
Laboratory in the Department of Computer
Science at the University of
Massachusetts Amherst.
What is relational knowledge discovery?
Knowledge discovery is the extraction of non-trivial,
previously unknown, and useful information from
data. Relational knowledge discovery is knowledge discovery
in relational data. For the purposes of this discussion, a
relational data structure is a labeled, directed graph in
which objects from the domain of discourse are
connected by links that represent relationships
between pairs of objects. Both objects and links can have an
arbitrary number of attributes.
Figure 1 shows a fragment of a relational data structure
from the cinematic domain. The objects include movies (e.g.,
Network and The Thomas Crown Affair), people
(Faye Dunaway), studios (MGM), and awards (Oscars). The
links include (among others) the ActorIn
relationship between a person and a movie; the
StudioOf relationship between a studio and a movie;
and the Nominated relationship between a movie or a
person and an award.
 |
| Figure 1: Graphical data fragment from a movie database |
In this diagram, the labels on the objects indicate the value
of the attribute name, and the labels on links indicate
the value of the attribute type. Not shown in the figure
are other attributes of objects, such as the gender of an actor,
the year a movie was released, or the location of a
studio. Similarly, links could have additional attributes, such
as the salary an actor received for starring in a given
movie.
PROXIMITY offers a more flexible data
representation than many other approaches to knowledge discovery
and machine learning. Other techniques assume their input data
instances are structurally homogeneous, identically distributed,
and statistically independent. These assumptions lead to a
“flat-file” data format in which each instance is
represented by a vector of features, and all the vectors have the
same length. Relational knowledge discovery is designed for the
many real-life domains that do not meet these criteria. Objects
in a PROXIMITY database can be structurally
heterogeneous. They can have a variable number of attributes, and
a variable number of values for each attribute. The relationships
encoded by links violate the assumption that each object is
statistically independent of the others.
Relational knowledge discovery is particularly applicable when
data about objects and events are used to make inferences about
organizations and activities. Examples of such cases include:
- Epidemiology (relating people, animals, homes, and workplaces);
- Fraud detection (relating people, transactions,
businesses, and accounts),
- Actions of complex computer systems (relating users,
automated agents, machines, and sessions).
What is a PROXIMITY database?
A PROXIMITY database consists of objects,
directed binary links, and attributes that record
characteristics of the objects and links. An object or link
can have zero or more attributes. An attribute is a <name,
value> pair where the attribute value is a set. For a given
attribute name, an object or link can have no values (the
empty set), one value, or multiple values. For example, a
movie might have several titles (working title, release
title, translated title), one title only, or none at all. A
value of the empty set indicates that either the attribute is
not relevant for that object (or link) or the attribute is
relevant but no information is recorded.
The PROXIMITY data schema differs from
those commonly used in relational databases. A traditional
database schema stores records for each type of object
(movies, persons, studios, awards) in a separate table. Each
table contains a fixed number of fields that store attributes
(a movie might have fields for name, year of release,
etc.). A traditional schema is specified once, before the
database is constructed. It can be difficult to change the
schema, change the type of an individual object, or insert
records for objects whose type is uncertain. Thus, a
traditional schema doesn't easily support the iterative
structuring or “sense-making” activities that are
central to knowledge discovery.
In contrast, the schema used by PROXIMITY
escapes the rigid record typing of a traditional schema, and
it enables analysts to introduce and transform data
structures as analysis progresses. Specifically, the
PROXIMITY schema supports:
- Flexible typing The type of an
object or link is an attribute like any other. An object or
link can have no type, one type, or many types, and types of
objects and links can be easily changed. In a conventional
relational database, changing an object's type would require
moving records from one table to another, and altering the
fields of an existing record to match the new table.
- Attribute creation New attributes can be created easily, by creating a new table containing the values and the identifiers of the objects or links to which those values belong. Objects of unknown type can be inserted into the database and attribute values can be added to them as data about the object accumulate. Adding an attribute in a traditional schema would require adding a column to each table to which the attribute applied, even when values are unknown for some records.
- Set-valued attributes The
PROXIMITY schema makes it simple to implement
attributes whose values are sets. Indeed, all attribute
values in a PROXIMITY database are
sets. Thus, a person can have no name, one name, or many
names. Set-valued attributes can be used to represent
real-world features such as aliasing in names (e.g.,
“William”, “Bill”, and
“Shorty”) and multiple roles of a person in an
organization (e.g., President and CEO).
- Efficient scaling While nearly all
operations with the PROXIMITY schema require
a join between at least two tables, the tables are extremely
simple. As a result, an analyst can create hundreds or even
thousands of attributes with little or no impact on query
speed. In a traditional schema, the width of the table would
increase dramatically, decreasing the number of records that
can be paged into main memory at one time.
How does PROXIMITY support relational
knowledge discovery?
PROXIMITY supports multiple phases of the
knowledge discovery process: creating a database, exploring
and visualizing the data, transforming the relational
structure, calculating new attributes,
and learning and applying a model.
- Create a database The
import module converts data from conventional relational
database tables into a PROXIMITY database
according to the directives in an import specification
file. The user can create a new PROXIMITY
database or extend an existing one. The import specification
file is written in XML. It states how to map the source data
schema to PROXIMITY objects, links, and
attributes. For a given set of source tables, there are many
possible mappings to a PROXIMITY database,
each corresponding to a different import specification
file.
- Explore and visualize data
The visual query language QGraph
enables the data analyst to search a
PROXIMITY database for recurrent patterns and
significant combinations of attribute values. QGraph serves
the same function in PROXIMITY that SQL
serves in a traditional relational database. With QGraph,
the analyst describes what the user is looking for and the query
processor finds all matching instances in the database. A
QGraph query is a labeled graph in which the vertices
correspond to objects and the edges to links. The query
specifies the desired configuration of objects and links,
boolean conditions on their attribute values, and global
constraints relating one object or link to another. To match
the query, a database subgraph must have the correct
structure and satisfy all the boolean conditions and
constraints. The collection of matching subgraphs becomes
the input for later stages of the knowledge discovery
process: visualization, structural transformation, attribute
calculation, sampling, classification. After running a
query, the analyst can examine any subgraph in the resulting
collection with the graphical user interface.
Learn and apply a model
One type of model supported in
PROXIMITY is a simple Bayesian classification
algorithm that we have adapted to the context of relational
data [Neville,
Jensen, and Gallagher, 2003]. The relational Bayesian
classifier (RBC) is trained on one collection of subgraphs
and then applied to another. Each subgraph in the collection
contains one target object to be classified; the other
objects and links in the subgraph contribute to the target's
attributes. For example, suppose we want to infer movie
genre and each input subgraph contains a movie with all its
actors. We could add to the movie a binary attribute for
each possible genre, indicating whether any member of the
cast has acted in another movie of that genre. Attributes of
surrounding objects and links in the subgraph as well as
those of the target object are available to the
classifier. By preserving the relational structure of the
data, we can exploit the connections between objects to
improve classification accuracy.
A relational probability tree (RPT) is another type of model
supported by PROXIMITY [Neville,
Jensen, Friedland, and Hay, 2003]. RPTs are a form of
classification trees for relational data that selectively
consider attributes of related objects and links as well as
complex relational aggregates of these attributes to build a
probabilistic model. These relational aggregations include
mode/average, count, and proportion. Each of these aggregations
can operate on related objects and each dynamically determines
the best threshold. In addition to the aggregation functions,
RPTs also consider structural features of the data by including
the degree of an object, or number of links into or out of an
object. Advantages of this model include a correction for the
autocorrelation and degree disparity properties found in many
relational data sets [Jensen
& Neville 2002; Jensen,
Neville, & Hay 2003]. Traditional classification tree
building algorithms are not able to correct for these biases
and build larger trees with lower accuracies. Other advantages
include ease of model understanding (the model is a series of
hierarchical rules) and the ability to dynamically select
predictive features and thresholds.
A relational dependency network (RDN) [Neville
and Jensen, 2003, Neville
and Jensen, 2004] is a graphical model that extends the
concept of a dependency network [Heckerman,
et al., 2000] for relational domains. RDNs approximate a
joint probability distribution over the attributes of objects
in a network with a set of conditional probability
distributions.
The RDN learning algorithm is based on
pseudolikehood techniques, which estimate a set of conditional
probability distributions independently.
This approach avoids
the complexities of estimating a full joint distribution and
can incorporate existing techniques for learning conditional
probability distributions of relational data (e.g.,
RBCs or RPTs).
RDNs support collective classification
[Taskar,
Abbeel, and Koller, 2002,
Neville
and Jensen, 2004]
(using inferences about one
entity in a relational data set to influence inferences
about related entities), allowing the model to exploit
relational autocorrelation dependencies to improve
classification accuracy.
Gibbs sampling inference techniques are used to recover
a full joint distribution and to estimate probabilities of
interest.
We will be adding new models to PROXIMITY
as they become ready. We will announce the addition of new
models on our “What's
New” page. You can also sign up for our
PROXIMITY announcements mailing list.
|