org.tribuo.datasource.LibSVMDataSource<T>

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>, Iterable<Example<T>>, ConfigurableDataSource<T>, DataSource<T>

public final class LibSVMDataSource<T extends Output<T>> extends Object implements ConfigurableDataSource<T>

A DataSource which can read LibSVM formatted data.

It also provides a static save method which writes LibSVM format data.

This class can read libsvm files which are zero-indexed or one-indexed, and the parsed result is available after construction. When loading testing data it's best to use the maxFeatureID from the training data (or the number of features in the model) to ensure that the feature names are formatted with the appropriate number of leading zeros.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static final class

LibSVMDataSource.LibSVMDataSourceProvenance

The provenance for a LibSVMDataSource.
Constructor Summary

Constructors

Constructor

Description

LibSVMDataSource(URL url, OutputFactory<T> outputFactory)

Constructs a LibSVMDataSource from the supplied URL and output factory.

LibSVMDataSource(URL url, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID)

Constructs a LibSVMDataSource from the supplied URL and output factory.

LibSVMDataSource(Path path, OutputFactory<T> outputFactory)

Constructs a LibSVMDataSource from the supplied path and output factory.

LibSVMDataSource(Path path, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID)

Constructs a LibSVMDataSource from the supplied path and output factory.
Method Summary

Modifier and Type

Method

Description

int

getMaxFeatureID()

Gets the maximum feature ID found.

OutputFactory<T>

getOutputFactory()

Returns the OutputFactory associated with this Output subclass.

DataSourceProvenance

getProvenance()

boolean

isZeroIndexed()

Returns true if this dataset is zero indexed, false otherwise (i.e., it starts from 1).

Iterator<Example<T>>

iterator()

void

postConfig()

Used by the OLCUT configuration system, and should not be called by external code.

int

size()

The number of examples.

String

toString()

static <T extends Output<T>> void

writeLibSVMFormat(Dataset<T> dataset, PrintStream out, boolean zeroIndexed, Function<T,Number> transformationFunc)

Writes out a dataset in LibSVM format.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface java.lang.Iterable
forEach, spliterator

Constructor Details
- LibSVMDataSource
  
  public LibSVMDataSource(Path path, OutputFactory<T> outputFactory) throws IOException
  
  Constructs a LibSVMDataSource from the supplied path and output factory.
  
  Parameters:
  
  path - The path to load.
  
  outputFactory - The output factory to use.
  
  Throws:
  
  IOException - If the file could not be read or is an invalid format.
- LibSVMDataSource
  
  public LibSVMDataSource(Path path, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID) throws IOException
  
  Constructs a LibSVMDataSource from the supplied path and output factory.
  Also allows control over the maximum feature id and if the file is zero indexed. The maximum feature id is used as part of the padding calculation converting the integer feature numbers into Tribuo's String feature names and is important to set when loading test data to ensure that the names line up with the training names. For example if there are 110 features, but the test dataset only has features 0-90, then without setting maxFeatureID = 110 all the features will be named "00" through "90", rather than the expected "000" - "090", leading to a mismatch.
  
  Parameters:
  
  path - The path to load.
  
  outputFactory - The output factory to use.
  
  zeroIndexed - Are the features in this file indexed from zero?
  
  maxFeatureID - The maximum feature ID allowed.
  
  Throws:
  
  IOException - If the file could not be read or is an invalid format.
- LibSVMDataSource
  
  public LibSVMDataSource(URL url, OutputFactory<T> outputFactory) throws IOException
  
  Constructs a LibSVMDataSource from the supplied URL and output factory.
  
  Parameters:
  
  url - The url to load.
  
  outputFactory - The output factory to use.
  
  Throws:
  
  IOException - If the url could not load or is in an invalid format.
- LibSVMDataSource
  
  public LibSVMDataSource(URL url, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID) throws IOException
  
  Constructs a LibSVMDataSource from the supplied URL and output factory.
  Also allows control over the maximum feature id and if the file is zero indexed. The maximum feature id is used as part of the padding calculation converting the integer feature numbers into Tribuo's String feature names and is important to set when loading test data to ensure that the names line up with the training names. For example if there are 110 features, but the test dataset only has features 0-90, then without setting maxFeatureID = 110 all the features will be named "00" through "90", rather than the expected "000" - "090", leading to a mismatch.
  
  Parameters:
  
  url - The url to load.
  
  outputFactory - The output factory to use.
  
  zeroIndexed - Are the features in this file indexed from zero?
  
  maxFeatureID - The maximum feature ID allowed.
  
  Throws:
  
  IOException - If the url could not load or is in an invalid format.
Method Details
- postConfig
  
  public void postConfig() throws IOException
  
  Used by the OLCUT configuration system, and should not be called by external code.
  
  Specified by:
  
  postConfig in interface com.oracle.labs.mlrg.olcut.config.Configurable
  
  Throws:
  
  IOException
- isZeroIndexed
  
  public boolean isZeroIndexed()
  
  Returns true if this dataset is zero indexed, false otherwise (i.e., it starts from 1).
  
  Returns:
  
  True if zero indexed.
- getMaxFeatureID
  
  public int getMaxFeatureID()
  
  Gets the maximum feature ID found.
  
  Returns:
  
  The maximum feature id.
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object
- getOutputFactory
  
  public OutputFactory<T> getOutputFactory()
  
  Description copied from interface: DataSource
  
  Returns the OutputFactory associated with this Output subclass.
  
  Specified by:
  
  getOutputFactory in interface DataSource<T extends Output<T>>
  
  Returns:
  
  The output factory.
- getProvenance
  
  public DataSourceProvenance getProvenance()
  
  Specified by:
  
  getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<T extends Output<T>>
- size
  
  public int size()
  
  The number of examples.
  
  Returns:
  
  The number of examples.
- iterator
  
  public Iterator<Example<T>> iterator()
  
  Specified by:
  
  iterator in interface Iterable<T extends Output<T>>
- writeLibSVMFormat
  
  public static <T extends Output<T>> void writeLibSVMFormat(Dataset<T> dataset, PrintStream out, boolean zeroIndexed, Function<T,Number> transformationFunc)
  
  Writes out a dataset in LibSVM format.
  Can write either zero indexed or one indexed.
  
  Type Parameters:
  
  T - The type of the Output.
  
  Parameters:
  
  dataset - The dataset to write out.
  
  out - A stream to write it to.
  
  zeroIndexed - If true start the feature numbers from zero, otherwise start from one.
  
  transformationFunc - A function which transforms an Output into a number.

Class LibSVMDataSource<T extends Output<T>>

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface java.lang.Iterable

Constructor Details

LibSVMDataSource

LibSVMDataSource

LibSVMDataSource

LibSVMDataSource

Method Details

postConfig

isZeroIndexed

getMaxFeatureID

toString

getOutputFactory

getProvenance

size

iterator

writeLibSVMFormat