Class CSVLoader<T extends Output<T>>

java.lang.Object
org.tribuo.data.csv.CSVLoader<T>
Type Parameters:
T - The type of the output generated.

public class CSVLoader<T extends Output<T>> extends Object
Load a DataSource/Dataset from a CSV file.

The delimiter and quote characters are user controlled, so this class can parse TSVs, CSVs, semi-colon separated data and other types of single character delimiter separated data.

This class is a simple loader *only* for numerical CSV files with a String response field. If you need more complex processing, the response field isn't present, or you don't wish to use all of the columns as features then you should use CSVDataSource and build a RowProcessor to cope with your specific input format.

CSVLoader is thread safe and immutable.

Multi-output responses such as MultiLabel or Regressor can be processed in two different ways either as a single column of separated values, or multiple columns. If there is a single column the value is passed directly to the OutputFactory. If there are multiple response columns then the name of the column is concatenated with the value, then a list of the concatenated values is passed to the OutputFactory.

  • Constructor Details

    • CSVLoader

      public CSVLoader(char separator, char quote, OutputFactory<T> outputFactory)
      Creates a CSVLoader using the supplied separator, quote and output factory.
      Parameters:
      separator - The separator character.
      quote - The quote character.
      outputFactory - The output factory.
    • CSVLoader

      public CSVLoader(char separator, OutputFactory<T> outputFactory)
      Creates a CSVLoader using the supplied separator and output factory. Sets the quote to CSVIterator.QUOTE.
      Parameters:
      separator - The separator character.
      outputFactory - The output factory.
    • CSVLoader

      public CSVLoader(OutputFactory<T> outputFactory)
      Creates a CSVLoader using the supplied output factory. Sets the separator to CSVIterator.SEPARATOR and the quote to CSVIterator.QUOTE.
      Parameters:
      outputFactory - The output factory.
  • Method Details

    • load

      public MutableDataset<T> load(Path csvPath, String responseName) throws IOException
      Loads a DataSource from the specified csv file then wraps it in a dataset.
      Parameters:
      csvPath - The path to load.
      responseName - The name of the response variable.
      Returns:
      A dataset containing the csv data.
      Throws:
      IOException - If the read failed.
    • load

      public MutableDataset<T> load(Path csvPath, String responseName, String[] header) throws IOException
      Loads a DataSource from the specified csv file then wraps it in a dataset.
      Parameters:
      csvPath - The path to load.
      responseName - The name of the response variable.
      header - The header of the CSV if it's not present in the file.
      Returns:
      A dataset containing the csv data.
      Throws:
      IOException - If the read failed.
    • load

      public MutableDataset<T> load(Path csvPath, Set<String> responseNames) throws IOException
      Loads a DataSource from the specified csv file then wraps it in a dataset.

      The responseNames set is traversed in iteration order to emit outputs, and should be an ordered set to ensure reproducibility.

      If there are multiple elements in responseNames then the responses are processed into the form 'column-name=column-value' before being passed to the OutputFactory for conversion into an Output.

      Parameters:
      csvPath - The path to load.
      responseNames - The names of the response variables.
      Returns:
      A dataset containing the csv data.
      Throws:
      IOException - If the read failed.
    • load

      public MutableDataset<T> load(Path csvPath, Set<String> responseNames, String[] header) throws IOException
      Loads a DataSource from the specified csv file then wraps it in a dataset.

      The responseNames set is traversed in iteration order to emit outputs, and should be an ordered set to ensure reproducibility.

      If there are multiple elements in responseNames then the responses are processed into the form 'column-name=column-value' before being passed to the OutputFactory for conversion into an Output.

      Parameters:
      csvPath - The path to load.
      responseNames - The names of the response variables.
      header - The header of the CSV if it's not present in the file.
      Returns:
      A dataset containing the csv data.
      Throws:
      IOException - If the read failed.
    • loadDataSource

      public DataSource<T> loadDataSource(Path csvPath, String responseName) throws IOException
      Loads a DataSource from the specified csv path.
      Parameters:
      csvPath - The csv to load from.
      responseName - The name of the response variable.
      Returns:
      A datasource containing the csv data.
      Throws:
      IOException - If the disk read failed.
    • loadDataSource

      public DataSource<T> loadDataSource(URL csvPath, String responseName) throws IOException
      Loads a DataSource from the specified csv path.
      Parameters:
      csvPath - The csv to load from.
      responseName - The name of the response variable.
      Returns:
      A datasource containing the csv data.
      Throws:
      IOException - If the disk read failed.
    • loadDataSource

      public DataSource<T> loadDataSource(Path csvPath, String responseName, String[] header) throws IOException
      Loads a DataSource from the specified csv path.
      Parameters:
      csvPath - The csv to load from.
      responseName - The name of the response variable.
      header - The header of the CSV if it's not present in the file.
      Returns:
      A datasource containing the csv data.
      Throws:
      IOException - If the disk read failed.
    • loadDataSource

      public DataSource<T> loadDataSource(URL csvPath, String responseName, String[] header) throws IOException
      Loads a DataSource from the specified csv path.
      Parameters:
      csvPath - The csv to load from.
      responseName - The name of the response variable.
      header - The header of the CSV if it's not present in the file.
      Returns:
      A datasource containing the csv data.
      Throws:
      IOException - If the disk read failed.
    • loadDataSource

      public DataSource<T> loadDataSource(Path csvPath, Set<String> responseNames) throws IOException
      Loads a DataSource from the specified csv path.

      The responseNames set is traversed in iteration order to emit outputs, and should be an ordered set to ensure reproducibility.

      If there are multiple elements in responseNames then the responses are processed into the form 'column-name=column-value' before being passed to the OutputFactory for conversion into an Output.

      Parameters:
      csvPath - The csv to load from.
      responseNames - The names of the response variables.
      Returns:
      A datasource containing the csv data.
      Throws:
      IOException - If the disk read failed.
    • loadDataSource

      public DataSource<T> loadDataSource(URL csvPath, Set<String> responseNames) throws IOException
      Loads a DataSource from the specified csv path.

      The responseNames set is traversed in iteration order to emit outputs, and should be an ordered set to ensure reproducibility.

      If there are multiple elements in responseNames then the responses are processed into the form 'column-name=column-value' before being passed to the OutputFactory for conversion into an Output.

      Parameters:
      csvPath - The csv to load from.
      responseNames - The names of the response variables.
      Returns:
      A datasource containing the csv data.
      Throws:
      IOException - If the disk read failed.
    • loadDataSource

      public DataSource<T> loadDataSource(Path csvPath, Set<String> responseNames, String[] header) throws IOException
      Loads a DataSource from the specified csv path.

      The responseNames set is traversed in iteration order to emit outputs, and should be an ordered set to ensure reproducibility.

      If there are multiple elements in responseNames then the responses are processed into the form 'column-name=column-value' before being passed to the OutputFactory for conversion into an Output.

      Parameters:
      csvPath - The csv to load from.
      responseNames - The names of the response variables.
      header - The header of the CSV if it's not present in the file.
      Returns:
      A datasource containing the csv data.
      Throws:
      IOException - If the disk read failed.
    • loadDataSource

      public DataSource<T> loadDataSource(URL csvPath, Set<String> responseNames, String[] header) throws IOException
      Loads a DataSource from the specified csv path.

      The responseNames set is traversed in iteration order to emit outputs, and should be an ordered set to ensure reproducibility.

      If there are multiple elements in responseNames then the responses are processed into the form 'column-name=column-value' before being passed to the OutputFactory for conversion into an Output.

      Parameters:
      csvPath - The csv to load from.
      responseNames - The names of the response variables.
      header - The header of the CSV if it's not present in the file.
      Returns:
      A datasource containing the csv data.
      Throws:
      IOException - If the disk read failed.