Synthetic Data Generation

Synthetic Data Generation

Researchers frequently need to explore the effects of certain data characteristics on their models. To help construct datasets exhibiting specific properties, such as autocorrelation or degree disparity, Proximity can generate synthetic data having one of several types of graph structure:

In all cases, the data generation process follows the same process:

  1. Generate the empty graph structure.

  2. Generate attribute values based on user-supplied prior probabilities.

Because the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively.

Details for each of these steps are presented in the context of a script that generates i.i.d. data, shown below.

Generating i.i.d. data

The i.i.d. graph generation process creates a set of independent connected components containing two object types: S and T. Each component contains a single S object that is linked to a variable number (  0) of T objects. The number of linked T objects (the degree of S) follows a normal distribution with user-specified mean and standard deviation. You can specify multiple normal distributions to create S objects having different degree distributions.

We typically consider the S objects to be the target objects to be classified, with the T objects used as peripheral objects during classification. Each S object is assigned a single, discrete attribute (s_class_label in this example) that can be used as a class label. In order to generate datasets with degree disparity, the assignment of class labels is conditioned on the degree of the object in the graph. The generation process can also add additional discrete attributes to the S and T objects, respectively.

The following script creates four connected components using two different degree distributions. Both S and T objects are given one additional attribute each (s_attr0_label and t_attr1_label, respectively). These attribute values are conditioned by the models in s-attr-rpt.xml and t-attr-rpt.xml.