Relational Probability Trees

A relational probability tree (RPT) [Neville et al., 2003] is a form of classification tree for relational data that considers attributes of related objects and links, and automatically constructs complex relational aggregates of these attributes to build a probabilistic model. These relational aggregations include mode, average, count, and proportion. Each of these aggregations can operate on related objects or links and each dynamically determines the best threshold. In addition to the aggregation functions, RPTs also consider structural features of the data. Specifically, RPTs can include degree features (counts of named objects or links in a subgraph) in the resulting models. Advantages of this model include ease of model understanding (the model is a series of hierarchical rules) and the ability to dynamically select predictive features and thresholds.

Proximity’s RPT code has been modularized to permit easier maintenance of and additions to the code. Model creation has been split into learning and pruning modules, with each of these having their own set of components used to specify particular performance parameters.

The learning module includes several component modules that control the learning phase:

The pruning module provides the ability to apply different pruning strategies to the completed tree. Implementing pruning strategies is planned for future releases; the current release includes only the DefaultPruningModule, which does no pruning.

The example below performs the same classification task as the previous RBC example—we want to infer whether or not a web page is a student page. But this time we use the value of the pagetype attribute to identify known student pages in the training phase. This attribute can take one of six (string) values: Student, Course, Faculty, Staff, ResearchProject, and Other. We use the same training and test sets as in the RBC example.

Code example: run-1d-clusters-rpt.py

This section describes the script found in $PROX_HOME/doc/user/tutorial/examples/run-1d-clusters-rpt.py.

Import the necessary class definitions.

from kdl.prox.model2.common.sources import *
from kdl.prox.model2.rpt import RPT

As before, sample 0 serves as the training set and sample 1 is the test set.

trainContainer = prox.getContainer("1d-clusters/samples/0")
testContainer = prox.getContainer("1d-clusters/samples/1")

Our classification task remains the same—we want to predict the value of one of the attributes for the central object in our 1d-clusters subgraphs. This object is named core_page in the subgraphs in both our test and training containers.

coreItemName = 'core_page'

This time, we’re predicting the value of the pagetype attribute.

attrToPredict = 'pagetype'

Create an AttributeSource instance that stores both the core item name and the class label (attribute) we want to predict for that item.

classLabel = AttributeSource(coreItemName, attrToPredict)

As before, we specify the set of sources to be considered in determining the model’s features. See “Code example: run-1d-clusters-rbc.py” for more information on specifying attribute sources.

In addition to attribute sources, models can also consider structural features such as the degree of an object. To tell Proximity to also consider structural attributes, we provide an ItemSource instance for the appropriate subgraph item. In this example, we want to consider structural features for the related objects (linked_to_page and linked_from_page), but not for the core object. Although the ('linked_from_page', 'page_num_inlinks') attribute source provides similar information, these sources are not the same. The value of page_num_inlinks for related linked_from_page objects does not consider the full degree of the object (it ignores out links and only counts in links) and the value is stored as an attribute on the object. It must therefore be defined as an attribute source. The full degree of the linked objects is not available as an attribute value but must be calculated for each object. It must therefore be specified as an item (structural) source.

inputSources = [ \
   AttributeSource('core_page', 'url_server_info'), \
   AttributeSource ('core_page', 'url_hierarchy1b'), \
   AttributeSource('linked_from_page', 'page_num_outlinks'), \
   AttributeSource('linked_to_page', 'page_num_inlinks'), \
   ItemSource("linked_from_page"), \
   ItemSource("linked_to_page")]

Proximity uses these attributes to construct the specific features used by the RPT model. Relational features typically identify an attribute and a test for the values of that attribute. For example, an RPT feature might test to see if an attribute is equal to a designated value. Relational features can also examine and aggregate the set of values found in linked objects. For example, the RPT might use the number of linked_to_page objects that have more than 10 outlinks as a feature in the trained model.

We instantiate the model by calling the appropriate constructor.

print "Beginning modeling section"
print "Instantiating model..."
rpt = RPT()

This instantiates the model using default values for all parameters. You can override these defaults by specifying particular modules to be used with the model. For this example, we want a maximum tree depth of three, so we add that specification to the stopping module.

rpt.learningModule.stoppingModule.setMaxDepth(3)

Train (learn) the model on the training set.

print "Learning model..."
rpt.learn(trainContainer, classLabel, inputSources)

Write the trained model to an XML file. The file is written to the current working directory, which is $PROX_HOME if you are following the tutorial.

xmlFileName = 'ProxWebKB_RPT.xml'
rpt.save(xmlFileName)
print "RPT written to ", xmlFileName

Unlike the RBC, the learned RPT model is designed to be human interpretable. The XML file for an RPT describes a probability estimation tree, which can be viewed in the Proximity Database Browser (see Exercise 7.3, below). The tree represents a series of questions to ask about a web page and the pages in its relational neighborhood. The leaf nodes contain the predicted class label for pages that correspond to the matching path through the tree.

Apply the model to the test set.

print "Applying model..."
predictions = rpt.apply(testContainer)

Tell the Predictions instance where to find the true values for the class labels. The true values are required for evaluating the model’s predictions.

predictions.setTrueLabels(testContainer, classLabel)

To save the predicted values, save them as attributes on the subgraphs. The savePredictions() method silently overwrites any existing values for this subgraph attribute.

print "Writing predictions..."
rptAttrName = "rpt_pagetype_prediction"
predictions.savePredictions(testContainer.getSubgraphAttrs(), rptAttrName)

Evaluate the model. Accuracy and area under the ROC curve approach 1.0 as the results improve. Conditional log likelihood is useful only for relative comparison of comparable entities with higher values signifying better performance.

print "Computing accuracy (ACC)..." 
acc = (1 - predictions.getZeroOneLoss())

Computing area under the ROC curve requires a binary classification problem. We identify the positive instances and group all other class values into the negative instances. In this example, student pages (pages with a value of Student for the pagetype attribute) are positive instances.

print "Computing area under ROC curve (AUC)..."
auc = predictions.getAUC("Student")
print "Computing conditional likelihood (CLL)..."
cll = predictions.getConditionalLogLikelihood()

Print a summary of the evaluation results.

print "RPT results:"
print "  ACC: ", str(acc)
print "  AUC: ", str(auc)
print "  CLL: ", str(cll)

Exercise 7.2. Learning and applying the relational probability tree model:

This script requires entities created in Exercise 5.7 and Exercise 6.4. You must have completed these exercises before running the script in the current exercise.

Before beginning, make sure that you are serving the ProxWebKB database using Mserver. Start the Proximity Database Browser if it is not already running.

  1. From the Script menu, choose Run Script. Proximity displays the Open dialog.

  2. Navigate to the $PROX_HOME/doc/user/tutorial/examples directory and choose run-1d-clusters-rpt.py. Click Open.

    Proximity opens a window to show you the output from the script along with a trace of the script execution. The run-1d-clusters-rpt.py script may take several minutes or longer to run. Your output should look similar to the following (leading information showing elapsed time and execution thread has been omitted from the trace for brevity):

    Status: starting running script:
        /proximity/doc/user/tutorial/examples/run-1d-clusters-rpt.py
    Beginning modeling section
    Instantiating model...
    Inducing model...
    INFO kdl.prox.model2.rpt.RPT - Creating feature tables
    INFO kdl.prox.model2.rpt.RPT - Done creating feature tables: 792 features.
    INFO kdl.prox.model2.rpt.modules.learning.DefaultLearningModule
        - Choosing split for 2068 subgs
    
        portion of trace deleted
    
    
    RPT written to  ProxWebKB_RPT.xml
    Applying model...
    Writing predictions...
    Computing accuracy (ACC)...
    Computing area under ROC curve (AUC)...
    Computing conditional likelihood (CLL)...
    RPT results: 
      ACC:  0.8355104015481374
      AUC:  0.8423346214179187
      CLL:  -1312.0900625717052
    Status: finished running script
    

    Note that some parts of the RPT model, such as choosing between two equivalent features, are non-deterministic, so your results may differ slightly from that shown above. You can close this window after the script finishes.

  3. Examine the values predicted by the RPT. Drill down through the container hierarchy in the Proximity Database Browser to display the list of subgraphs for the /1d-clusters/samples/1 container. Click a subgraph ID, then click attrs to display the attributes for that subgraph. The example below shows the value for subgraph 0, which shows that the model predicts that the core page for this subgraph (object 1) has a pagetype of Other You can compare this to the actual value of pagetype by examining the attribute values for object 1.

    Recall that the RPT only makes predictions for core objects in the test container. Therefore, after learning the model, only the subgraphs in the 1d-clusters/samples/1 container have a value for rpt_pagetype_prediction.