Researchers frequently need to explore the effects of certain data characteristics on their models. To help construct datasets exhibiting specific properties, such as autocorrelation or degree disparity, Proximity can generate synthetic data having one of several types of graph structure:
In all cases, the data generation process follows the same process:
Generate the empty graph structure.
Generate attribute values based on user-supplied prior probabilities.
Because the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively.
Details for each of these steps are presented in the context of a script that generates i.i.d. data, shown below.
The i.i.d. graph generation process creates a set of independent
containing two object types:
and T. Each component contains a single
S object that is linked to a variable
We typically consider the S objects to be the target objects to be classified, with the T objects used as peripheral objects during classification. Each S object is assigned a single, discrete attribute (s_class_label in this example) that can be used as a class label. In order to generate datasets with degree disparity, the assignment of class labels is conditioned on the degree of the object in the graph. The generation process can also add additional discrete attributes to the S and T objects, respectively.
The following script creates four connected components using two
different degree distributions. Both S and
T objects are given one additional
attribute each (s_attr0_label and
These attribute values are conditioned by the models in