Class ClusteringDataGenerator

java.lang.Object
org.tribuo.clustering.example.ClusteringDataGenerator

public abstract class ClusteringDataGenerator extends Object
Generates three example train and test datasets, used for unit testing. They don't necessarily have sensible cluster boundaries, it's for testing the machinery rather than accuracy.

Also has a dataset generator which returns a dataset sampled from a mixture of 2 dimensional gaussians.

  • Constructor Details

    • ClusteringDataGenerator

      public ClusteringDataGenerator()
  • Method Details

    • gaussianClusters

      public static Dataset<ClusterID> gaussianClusters(long size, long seed)
      Generates a dataset drawn from a mixture of 5 2d gaussians.
      Parameters:
      size - The number of points to sample for the dataset.
      seed - The RNG seed.
      Returns:
      A pair of datasets.
    • denseTrainTest

      public static com.oracle.labs.mlrg.olcut.util.Pair<Dataset<ClusterID>,Dataset<ClusterID>> denseTrainTest()
      Generates a train/test dataset pair which is dense in the features, each example has 4 features,{A,B,C,D}, and there are 4 clusters, {0,1,2,3}.
      Returns:
      A pair of datasets.
    • denseTrainTest

      public static com.oracle.labs.mlrg.olcut.util.Pair<Dataset<ClusterID>,Dataset<ClusterID>> denseTrainTest(double negate)
      Generates a train/test dataset pair which is dense in the features, each example has 4 features,{A,B,C,D}, and there are 4 clusters, {0,1,2,3}.
      Parameters:
      negate - Supply -1.0 to negate some feature values.
      Returns:
      A pair of datasets.
    • sparseTrainTest

      public static com.oracle.labs.mlrg.olcut.util.Pair<Dataset<ClusterID>,Dataset<ClusterID>> sparseTrainTest()
      Generates a pair of datasets, where the features are sparse, and unknown features appear in the test data. It has the same 4 clusters {0,1,2,3}.
      Returns:
      A pair of datasets.
    • sparseTrainTest

      public static com.oracle.labs.mlrg.olcut.util.Pair<Dataset<ClusterID>,Dataset<ClusterID>> sparseTrainTest(double negate)
      Generates a pair of datasets, where the features are sparse, and unknown features appear in the test data. It has the same 4 clusters {0,1,2,3}.
      Parameters:
      negate - Supply -1.0 to negate some feature values.
      Returns:
      A pair of datasets.
    • invalidSparseExample

      public static Example<ClusterID> invalidSparseExample()
      Generates an example with the feature ids 1,5,8, which does not intersect with the ids used elsewhere in this class. This should make the example empty at prediction time.
      Returns:
      An example with features {1:1.0,5:5.0,8:8.0}.
    • emptyExample

      public static Example<ClusterID> emptyExample()
      Generates an example with no features.
      Returns:
      An example with no features.