Class LibSVMDataSource<T extends Output<T>>
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>
,Iterable<Example<T>>
,ConfigurableDataSource<T>
,DataSource<T>
It also provides a static save method which writes LibSVM format data.
This class can read libsvm files which are zero-indexed or one-indexed, and the parsed result is available after construction. When loading testing data it's best to use the maxFeatureID from the training data (or the number of features in the model) to ensure that the feature names are formatted with the appropriate number of leading zeros.
-
Nested Class Summary
-
Constructor Summary
ConstructorDescriptionLibSVMDataSource
(URL url, OutputFactory<T> outputFactory) Constructs a LibSVMDataSource from the supplied URL and output factory.LibSVMDataSource
(URL url, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID) Constructs a LibSVMDataSource from the supplied URL and output factory.LibSVMDataSource
(Path path, OutputFactory<T> outputFactory) Constructs a LibSVMDataSource from the supplied path and output factory.LibSVMDataSource
(Path path, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID) Constructs a LibSVMDataSource from the supplied path and output factory. -
Method Summary
Modifier and TypeMethodDescriptionint
Gets the maximum feature ID found.Returns the OutputFactory associated with this Output subclass.boolean
Returns true if this dataset is zero indexed, false otherwise (i.e., it starts from 1).iterator()
void
Used by the OLCUT configuration system, and should not be called by external code.int
size()
The number of examples.toString()
static <T extends Output<T>>
voidwriteLibSVMFormat
(Dataset<T> dataset, PrintStream out, boolean zeroIndexed, Function<T, Number> transformationFunc) Writes out a dataset in LibSVM format.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Constructor Details
-
LibSVMDataSource
Constructs a LibSVMDataSource from the supplied path and output factory.- Parameters:
path
- The path to load.outputFactory
- The output factory to use.- Throws:
IOException
- If the file could not be read or is an invalid format.
-
LibSVMDataSource
public LibSVMDataSource(Path path, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID) throws IOException Constructs a LibSVMDataSource from the supplied path and output factory.Also allows control over the maximum feature id and if the file is zero indexed. The maximum feature id is used as part of the padding calculation converting the integer feature numbers into Tribuo's String feature names and is important to set when loading test data to ensure that the names line up with the training names. For example if there are 110 features, but the test dataset only has features 0-90, then without setting
maxFeatureID = 110
all the features will be named "00" through "90", rather than the expected "000" - "090", leading to a mismatch.- Parameters:
path
- The path to load.outputFactory
- The output factory to use.zeroIndexed
- Are the features in this file indexed from zero?maxFeatureID
- The maximum feature ID allowed.- Throws:
IOException
- If the file could not be read or is an invalid format.
-
LibSVMDataSource
Constructs a LibSVMDataSource from the supplied URL and output factory.- Parameters:
url
- The url to load.outputFactory
- The output factory to use.- Throws:
IOException
- If the url could not load or is in an invalid format.
-
LibSVMDataSource
public LibSVMDataSource(URL url, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID) throws IOException Constructs a LibSVMDataSource from the supplied URL and output factory.Also allows control over the maximum feature id and if the file is zero indexed. The maximum feature id is used as part of the padding calculation converting the integer feature numbers into Tribuo's String feature names and is important to set when loading test data to ensure that the names line up with the training names. For example if there are 110 features, but the test dataset only has features 0-90, then without setting
maxFeatureID = 110
all the features will be named "00" through "90", rather than the expected "000" - "090", leading to a mismatch.- Parameters:
url
- The url to load.outputFactory
- The output factory to use.zeroIndexed
- Are the features in this file indexed from zero?maxFeatureID
- The maximum feature ID allowed.- Throws:
IOException
- If the url could not load or is in an invalid format.
-
-
Method Details
-
postConfig
Used by the OLCUT configuration system, and should not be called by external code.- Specified by:
postConfig
in interfacecom.oracle.labs.mlrg.olcut.config.Configurable
- Throws:
IOException
-
isZeroIndexed
public boolean isZeroIndexed()Returns true if this dataset is zero indexed, false otherwise (i.e., it starts from 1).- Returns:
- True if zero indexed.
-
getMaxFeatureID
public int getMaxFeatureID()Gets the maximum feature ID found.- Returns:
- The maximum feature id.
-
toString
-
getOutputFactory
Description copied from interface:DataSource
Returns the OutputFactory associated with this Output subclass.- Specified by:
getOutputFactory
in interfaceDataSource<T extends Output<T>>
- Returns:
- The output factory.
-
getProvenance
-
size
public int size()The number of examples.- Returns:
- The number of examples.
-
iterator
-
writeLibSVMFormat
public static <T extends Output<T>> void writeLibSVMFormat(Dataset<T> dataset, PrintStream out, boolean zeroIndexed, Function<T, Number> transformationFunc) Writes out a dataset in LibSVM format.Can write either zero indexed or one indexed.
- Type Parameters:
T
- The type of the Output.- Parameters:
dataset
- The dataset to write out.out
- A stream to write it to.zeroIndexed
- If true start the feature numbers from zero, otherwise start from one.transformationFunc
- A function which transforms anOutput
into a number.
-