Class LibSVMDataSource<T extends Output<T>>

java.lang.Object
org.tribuo.datasource.LibSVMDataSource<T>
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>, Iterable<Example<T>>, ConfigurableDataSource<T>, DataSource<T>

public final class LibSVMDataSource<T extends Output<T>> extends Object implements ConfigurableDataSource<T>
A DataSource which can read LibSVM formatted data.

It also provides a static save method which writes LibSVM format data.

This class can read libsvm files which are zero-indexed or one-indexed, and the parsed result is available after construction. When loading testing data it's best to use the maxFeatureID from the training data (or the number of features in the model) to ensure that the feature names are formatted with the appropriate number of leading zeros.

  • Constructor Details

    • LibSVMDataSource

      public LibSVMDataSource(Path path, OutputFactory<T> outputFactory) throws IOException
      Constructs a LibSVMDataSource from the supplied path and output factory.
      Parameters:
      path - The path to load.
      outputFactory - The output factory to use.
      Throws:
      IOException - If the file could not be read or is an invalid format.
    • LibSVMDataSource

      public LibSVMDataSource(Path path, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID) throws IOException
      Constructs a LibSVMDataSource from the supplied path and output factory.

      Also allows control over the maximum feature id and if the file is zero indexed. The maximum feature id is used as part of the padding calculation converting the integer feature numbers into Tribuo's String feature names and is important to set when loading test data to ensure that the names line up with the training names. For example if there are 110 features, but the test dataset only has features 0-90, then without setting maxFeatureID = 110 all the features will be named "00" through "90", rather than the expected "000" - "090", leading to a mismatch.

      Parameters:
      path - The path to load.
      outputFactory - The output factory to use.
      zeroIndexed - Are the features in this file indexed from zero?
      maxFeatureID - The maximum feature ID allowed.
      Throws:
      IOException - If the file could not be read or is an invalid format.
    • LibSVMDataSource

      public LibSVMDataSource(URL url, OutputFactory<T> outputFactory) throws IOException
      Constructs a LibSVMDataSource from the supplied URL and output factory.
      Parameters:
      url - The url to load.
      outputFactory - The output factory to use.
      Throws:
      IOException - If the url could not load or is in an invalid format.
    • LibSVMDataSource

      public LibSVMDataSource(URL url, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID) throws IOException
      Constructs a LibSVMDataSource from the supplied URL and output factory.

      Also allows control over the maximum feature id and if the file is zero indexed. The maximum feature id is used as part of the padding calculation converting the integer feature numbers into Tribuo's String feature names and is important to set when loading test data to ensure that the names line up with the training names. For example if there are 110 features, but the test dataset only has features 0-90, then without setting maxFeatureID = 110 all the features will be named "00" through "90", rather than the expected "000" - "090", leading to a mismatch.

      Parameters:
      url - The url to load.
      outputFactory - The output factory to use.
      zeroIndexed - Are the features in this file indexed from zero?
      maxFeatureID - The maximum feature ID allowed.
      Throws:
      IOException - If the url could not load or is in an invalid format.
  • Method Details

    • postConfig

      public void postConfig() throws IOException
      Used by the OLCUT configuration system, and should not be called by external code.
      Specified by:
      postConfig in interface com.oracle.labs.mlrg.olcut.config.Configurable
      Throws:
      IOException
    • isZeroIndexed

      public boolean isZeroIndexed()
      Returns true if this dataset is zero indexed, false otherwise (i.e., it starts from 1).
      Returns:
      True if zero indexed.
    • getMaxFeatureID

      public int getMaxFeatureID()
      Gets the maximum feature ID found.
      Returns:
      The maximum feature id.
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • getOutputFactory

      public OutputFactory<T> getOutputFactory()
      Description copied from interface: DataSource
      Returns the OutputFactory associated with this Output subclass.
      Specified by:
      getOutputFactory in interface DataSource<T extends Output<T>>
      Returns:
      The output factory.
    • getProvenance

      public DataSourceProvenance getProvenance()
      Specified by:
      getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<T extends Output<T>>
    • size

      public int size()
      The number of examples.
      Returns:
      The number of examples.
    • iterator

      public Iterator<Example<T>> iterator()
      Specified by:
      iterator in interface Iterable<T extends Output<T>>
    • writeLibSVMFormat

      public static <T extends Output<T>> void writeLibSVMFormat(Dataset<T> dataset, PrintStream out, boolean zeroIndexed, Function<T,Number> transformationFunc)
      Writes out a dataset in LibSVM format.

      Can write either zero indexed or one indexed.

      Type Parameters:
      T - The type of the Output.
      Parameters:
      dataset - The dataset to write out.
      out - A stream to write it to.
      zeroIndexed - If true start the feature numbers from zero, otherwise start from one.
      transformationFunc - A function which transforms an Output into a number.