public final class LibSVMDataSource<T extends Output<T>> extends Object implements ConfigurableDataSource<T>
It also provides a static save method which writes LibSVM format data.
This class can read libsvm files which are zero-indexed or one-indexed, and the parsed result is available after construction. When loading testing data it's best to use the maxFeatureID from the training data (or the number of features in the model) to ensure that the feature names are formatted with the appropriate number of leading zeros.
Modifier and Type | Class and Description |
---|---|
static class |
LibSVMDataSource.LibSVMDataSourceProvenance
The provenance for a
LibSVMDataSource . |
Constructor and Description |
---|
LibSVMDataSource(Path path,
OutputFactory<T> outputFactory)
Constructs a LibSVMDataSource from the supplied path and output factory.
|
LibSVMDataSource(Path path,
OutputFactory<T> outputFactory,
boolean zeroIndexed,
int maxFeatureID)
Constructs a LibSVMDataSource from the supplied path and output factory.
|
LibSVMDataSource(URL url,
OutputFactory<T> outputFactory)
Constructs a LibSVMDataSource from the supplied URL and output factory.
|
LibSVMDataSource(URL url,
OutputFactory<T> outputFactory,
boolean zeroIndexed,
int maxFeatureID)
Constructs a LibSVMDataSource from the supplied URL and output factory.
|
Modifier and Type | Method and Description |
---|---|
int |
getMaxFeatureID()
Gets the maximum feature ID found.
|
OutputFactory<T> |
getOutputFactory()
Returns the OutputFactory associated with this Output subclass.
|
DataSourceProvenance |
getProvenance() |
boolean |
isZeroIndexed()
Returns true if this dataset is zero indexed, false otherwise (i.e., it starts from 1).
|
Iterator<Example<T>> |
iterator() |
void |
postConfig()
Used by the OLCUT configuration system, and should not be called by external code.
|
int |
size()
The number of examples.
|
String |
toString() |
static <T extends Output<T>> |
writeLibSVMFormat(Dataset<T> dataset,
PrintStream out,
boolean zeroIndexed,
Function<T,Number> transformationFunc)
Writes out a dataset in LibSVM format.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
forEach, spliterator
public LibSVMDataSource(Path path, OutputFactory<T> outputFactory) throws IOException
path
- The path to load.outputFactory
- The output factory to use.IOException
- If the file could not be read or is an invalid format.public LibSVMDataSource(Path path, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID) throws IOException
Also allows control over the maximum feature id and if the file is zero indexed.
The maximum feature id is used as part of the padding calculation converting the
integer feature numbers into Tribuo's String feature names and is important
to set when loading test data to ensure that the names line up with the training
names. For example if there are 110 features, but the test dataset only has features
0-90, then without setting maxFeatureID = 110
all the features will be named
"00" through "90", rather than the expected "000" - "090", leading to a mismatch.
path
- The path to load.outputFactory
- The output factory to use.zeroIndexed
- Are the features in this file indexed from zero?maxFeatureID
- The maximum feature ID allowed.IOException
- If the file could not be read or is an invalid format.public LibSVMDataSource(URL url, OutputFactory<T> outputFactory) throws IOException
url
- The url to load.outputFactory
- The output factory to use.IOException
- If the url could not load or is in an invalid format.public LibSVMDataSource(URL url, OutputFactory<T> outputFactory, boolean zeroIndexed, int maxFeatureID) throws IOException
Also allows control over the maximum feature id and if the file is zero indexed.
The maximum feature id is used as part of the padding calculation converting the
integer feature numbers into Tribuo's String feature names and is important
to set when loading test data to ensure that the names line up with the training
names. For example if there are 110 features, but the test dataset only has features
0-90, then without setting maxFeatureID = 110
all the features will be named
"00" through "90", rather than the expected "000" - "090", leading to a mismatch.
url
- The url to load.outputFactory
- The output factory to use.zeroIndexed
- Are the features in this file indexed from zero?maxFeatureID
- The maximum feature ID allowed.IOException
- If the url could not load or is in an invalid format.public void postConfig() throws IOException
postConfig
in interface com.oracle.labs.mlrg.olcut.config.Configurable
IOException
public boolean isZeroIndexed()
public int getMaxFeatureID()
public OutputFactory<T> getOutputFactory()
DataSource
getOutputFactory
in interface DataSource<T extends Output<T>>
public DataSourceProvenance getProvenance()
getProvenance
in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>
public int size()
public static <T extends Output<T>> void writeLibSVMFormat(Dataset<T> dataset, PrintStream out, boolean zeroIndexed, Function<T,Number> transformationFunc)
Can write either zero indexed or one indexed.
T
- The type of the Output.dataset
- The dataset to write out.out
- A stream to write it to.zeroIndexed
- If true start the feature numbers from zero, otherwise start from one.transformationFunc
- A function which transforms an Output
into a number.Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.