A relational probability tree (RPT) [Neville et al., 2003] is a form of classification tree for relational data that considers attributes of related objects and links, and automatically constructs complex relational aggregates of these attributes to build a probabilistic model. These relational aggregations include mode, average, count, and proportion. Each of these aggregations can operate on related objects or links and each dynamically determines the best threshold. In addition to the aggregation functions, RPTs also consider structural features of the data. Specifically, RPTs can include degree features (counts of named objects or links in a subgraph) in the resulting models. Advantages of this model include ease of model understanding (the model is a series of hierarchical rules) and the ability to dynamically select predictive features and thresholds.
Proximity’s RPT code has been modularized to permit easier maintenance of and additions to the code. Model creation has been split into learning and pruning modules, with each of these having their own set of components used to specify particular performance parameters.
The learning module includes several component modules that control the learning phase:
The splitting module evaluates the possible splits at each branch point in the tree, selecting the best split according to scores calculated by the scoring module.
The scoring module computes a score for each possible split in the tree. The results of the scoring module are used to select the best split by the splitting module.
The significance module determines whether the split chosen by the splitting module reaches the specified level of statistical significance. If not, that point in the tree becomes a leaf node.
The stopping module determines when to
stop looking for additional splits in the tree. The current
release only includes a single implementation,
DefaultStoppingModule, which stops splitting a tree after it
reaches the specified depth (with a default depth of three).
The pruning module provides the ability to apply different
pruning strategies to the completed tree. Implementing pruning
strategies is planned for future releases; the current release
includes only the DefaultPruningModule, which does no pruning.
The example below performs the same classification task as the previous RBC example—we want to infer whether or not a web page is a student page. But this time we use the value of the pagetype attribute to identify known student pages in the training phase. This attribute can take one of six (string) values: Student, Course, Faculty, Staff, ResearchProject, and Other. We use the same training and test sets as in the RBC example.
This section describes the script found in
$PROX_HOME/doc/user/tutorial/examples/run-1d-clusters-rpt.py.
Import the necessary class definitions.
from kdl.prox.model2.common.sources import * from kdl.prox.model2.rpt import RPT
As before, sample 0 serves as the training set and sample 1 is the test set.
trainContainer = prox.getContainer("1d-clusters/samples/0")
testContainer = prox.getContainer("1d-clusters/samples/1")
Our classification task remains the same—we want to predict the value of one of the attributes for the central object in our 1d-clusters subgraphs. This object is named core_page in the subgraphs in both our test and training containers.
coreItemName = 'core_page'
This time, we’re predicting the value of the pagetype attribute.
attrToPredict = 'pagetype'
Create an AttributeSource
instance that stores both the core item
name and the class label (attribute) we want to predict for that item.
classLabel = AttributeSource(coreItemName, attrToPredict)
As before, we specify the set of sources to be considered in determining the model’s features. See “Code example: run-1d-clusters-rbc.py” for more information on specifying attribute sources.
In addition to attribute sources, models can also consider structural
features such as the degree of an object. To tell Proximity to also
consider structural attributes, we provide an ItemSource
instance for
the appropriate subgraph item. In this example, we want to consider
structural features for the related objects
(linked_to_page and
linked_from_page), but not for the core
object. Although the
('linked_from_page', 'page_num_inlinks')
attribute source provides similar information, these sources are not
the same. The value of page_num_inlinks for
related linked_from_page objects does not
consider the full degree of the object (it ignores out links and only
counts in links) and the value is stored as an attribute on the
object. It must therefore be defined as an attribute source.
The full degree of the linked objects is not available as an attribute
value but must be calculated for each object. It must therefore be
specified as an item (structural) source.
inputSources = [ \
AttributeSource('core_page', 'url_server_info'), \
AttributeSource ('core_page', 'url_hierarchy1b'), \
AttributeSource('linked_from_page', 'page_num_outlinks'), \
AttributeSource('linked_to_page', 'page_num_inlinks'), \
ItemSource("linked_from_page"), \
ItemSource("linked_to_page")]
Proximity uses these attributes to construct the specific features used by the RPT model. Relational features typically identify an attribute and a test for the values of that attribute. For example, an RPT feature might test to see if an attribute is equal to a designated value. Relational features can also examine and aggregate the set of values found in linked objects. For example, the RPT might use the number of linked_to_page objects that have more than 10 outlinks as a feature in the trained model.
We instantiate the model by calling the appropriate constructor.
print "Beginning modeling section" print "Instantiating model..." rpt = RPT()
This instantiates the model using default values for all parameters. You can override these defaults by specifying particular modules to be used with the model. For this example, we want a maximum tree depth of three, so we add that specification to the stopping module.
rpt.learningModule.stoppingModule.setMaxDepth(3)
Train (learn) the model on the training set.
print "Learning model..." rpt.learn(trainContainer, classLabel, inputSources)
Write the trained model to an XML file. The file is written to the current
working directory, which is
$PROX_HOME if you are following the tutorial.
xmlFileName = 'ProxWebKB_RPT.xml' rpt.save(xmlFileName) print "RPT written to ", xmlFileName
Unlike the RBC, the learned RPT model is designed to be human interpretable. The XML file for an RPT describes a probability estimation tree, which can be viewed in the Proximity Database Browser (see Exercise 7.3, below). The tree represents a series of questions to ask about a web page and the pages in its relational neighborhood. The leaf nodes contain the predicted class label for pages that correspond to the matching path through the tree.
Apply the model to the test set.
print "Applying model..." predictions = rpt.apply(testContainer)
Tell the Predictions
instance where to find the true values for the
class labels. The true values are required for evaluating the
model’s predictions.
predictions.setTrueLabels(testContainer, classLabel)
To save the predicted values, save them as attributes on the
subgraphs. The savePredictions()
method silently overwrites any
existing values for this subgraph attribute.
print "Writing predictions..." rptAttrName = "rpt_pagetype_prediction" predictions.savePredictions(testContainer.getSubgraphAttrs(), rptAttrName)
Evaluate the model. Accuracy and area under the ROC curve approach 1.0 as the results improve. Conditional log likelihood is useful only for relative comparison of comparable entities with higher values signifying better performance.
print "Computing accuracy (ACC)..." acc = (1 - predictions.getZeroOneLoss())
Computing area under the ROC curve requires a binary classification problem. We identify the positive instances and group all other class values into the negative instances. In this example, student pages (pages with a value of Student for the pagetype attribute) are positive instances.
print "Computing area under ROC curve (AUC)..."
auc = predictions.getAUC("Student")
print "Computing conditional likelihood (CLL)..."
cll = predictions.getConditionalLogLikelihood()
Print a summary of the evaluation results.
print "RPT results:" print " ACC: ", str(acc) print " AUC: ", str(auc) print " CLL: ", str(cll)
Exercise 7.2. Learning and applying the relational probability tree model:
This script requires entities created in Exercise 5.7 and Exercise 6.4. You must have completed these exercises before running the script in the current exercise.
Before beginning, make sure that you are serving the ProxWebKB database using Mserver. Start the Proximity Database Browser if it is not already running.
From the Script menu, choose Run Script. Proximity displays the Open dialog.
Navigate to the $PROX_HOME/doc/user/tutorial/examples directory and
choose run-1d-clusters-rpt.py.
Click Open.
Proximity opens a window to show you the output from the script
along with a trace of the script execution. The
run-1d-clusters-rpt.py script may take several
minutes or longer to run.
Your output should look similar to the following
(leading information showing elapsed time and execution thread
has been omitted from the trace for brevity):
Status: starting running script:
/proximity/doc/user/tutorial/examples/run-1d-clusters-rpt.py
Beginning modeling section
Instantiating model...
Inducing model...
INFO kdl.prox.model2.rpt.RPT - Creating feature tables
INFO kdl.prox.model2.rpt.RPT - Done creating feature tables: 792 features.
INFO kdl.prox.model2.rpt.modules.learning.DefaultLearningModule
- Choosing split for 2068 subgs
portion of trace deleted
RPT written to ProxWebKB_RPT.xml
Applying model...
Writing predictions...
Computing accuracy (ACC)...
Computing area under ROC curve (AUC)...
Computing conditional likelihood (CLL)...
RPT results:
ACC: 0.8355104015481374
AUC: 0.8423346214179187
CLL: -1312.0900625717052
Status: finished running script
Note that some parts of the RPT model, such as choosing between two equivalent features, are non-deterministic, so your results may differ slightly from that shown above. You can close this window after the script finishes.
Examine the values predicted by the RPT. Drill down through the container hierarchy in the Proximity Database Browser to display the list of subgraphs for the /1d-clusters/samples/1 container. Click a subgraph ID, then click attrs to display the attributes for that subgraph. The example below shows the value for subgraph 0, which shows that the model predicts that the core page for this subgraph (object 1) has a pagetype of Other You can compare this to the actual value of pagetype by examining the attribute values for object 1.
![]() |
Recall that the RPT only makes predictions for core objects in the test container. Therefore, after learning the model, only the subgraphs in the 1d-clusters/samples/1 container have a value for rpt_pagetype_prediction.