Traditional machine learning and knowledge discovery techniques identify probabilistic dependencies among the attributes of a single record only. Proximity’s modeling algorithms extend this to include attributes of related entities and characteristics of the surrounding relational structure of the data.
To enable efficient model creation, Proximity employs unusual technologies for data storage and access. Its core database uses the decomposition storage model [Copeland and Khoshafian, 1985], a method of vertical fragmentation that allows for a highly flexible data schema. Knowledge discovery virtually requires such a schema, because substantial reinterpretations of the data are frequent and highly desirable. Proximity also uses QGraph, a visual query language that returns graph fragments with highly variable structure, rather than returning sets of individual records with homogeneous structure. Visualization tools in Proximity allow users to browse the data as a graph, examining both the attributes of individual records as well as the higher-level structure of relationships that interconnect records.
Algorithms for constructing statistical models that estimate conditional and joint probability distributions are implemented on top of Proximity’s database infrastructure. These algorithms construct relational probability trees [Neville et al., 2003], relational dependency networks [Neville and Jensen, 2003], [Neville and Jensen, 2004], and relational Bayesian classifiers [Neville, Jensen and Gallagher, 2003]. Each of these models is constructed by analyzing a data sample created using a QGraph query and is implemented as a set of operations run on the underlying database.