BERTFeatureExtractor (Tribuo 4.1.1 API)

java.lang.Object
- org.tribuo.interop.onnx.extractors.BERTFeatureExtractor<T>

Type Parameters:

T - The output type.

All Implemented Interfaces:

com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, AutoCloseable, TextFeatureExtractor<T>, TextPipeline
```
public class BERTFeatureExtractor<T extends Output<T>>
extends Object
implements AutoCloseable, TextFeatureExtractor<T>, TextPipeline
```
Builds examples and sequence examples using features from BERT.
Assumes that the BERT is an ONNX model generated by HuggingFace Transformers and exported using their export tool.
The tokenizer is expected to be a HuggingFace Transformers tokenizer config json file.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`BERTFeatureExtractor.BERTFeatureExtractorOptions` CLI options for running BERT.
`static class`	`BERTFeatureExtractor.OutputPooling` The type of output pooling to perform.

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`ATTENTION_MASK`
`static String`	`CLASSIFICATION_TOKEN`
`static String`	`CLS_OUTPUT`
`static String`	`INPUT_IDS`
`static long`	`MASK_VALUE`
`static String`	`SEPARATOR_TOKEN`
`static String`	`TOKEN_METADATA`
`static String`	`TOKEN_OUTPUT`
`static String`	`TOKEN_TYPE_IDS`
`static long`	`TOKEN_TYPE_VALUE`
`static String`	`UNKNOWN_TOKEN`

Constructor Summary

Constructors
Constructor and Description
`BERTFeatureExtractor(OutputFactory<T> outputFactory, Path modelPath, Path tokenizerPath)` Constructs a BERTFeatureExtractor.
`BERTFeatureExtractor(OutputFactory<T> outputFactory, Path modelPath, Path tokenizerPath, BERTFeatureExtractor.OutputPooling pooling, int maxLength, boolean useCUDA)` Constructs a BERTFeatureExtractor.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`close()`
`Example<T>`	`extract(T output, String data)` Tokenizes the input using the loaded tokenizer, truncates the token list if it's longer than `maxLength` - 2 (to account for [CLS] and [SEP] tokens), and then passes the token list to `extractExample(java.util.List<java.lang.String>)`.
`Example<T>`	`extractExample(List<String> tokens)` Passes the tokens through BERT, replacing any unknown tokens with the [UNK] token.
`Example<T>`	`extractExample(List<String> tokens, T output)` Passes the tokens through BERT, replacing any unknown tokens with the [UNK] token.
`SequenceExample<T>`	`extractSequenceExample(List<String> tokens, boolean stripSentenceMarkers)` Passes the tokens through BERT, replacing any unknown tokens with the [UNK] token.
`SequenceExample<T>`	`extractSequenceExample(List<String> tokens, List<T> output, boolean stripSentenceMarkers)` Passes the tokens through BERT, replacing any unknown tokens with the [UNK] token.
`int`	`getMaxLength()` Returns the maximum length this BERT will accept.
`com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance`	`getProvenance()`
`Set<String>`	`getVocab()` Returns the vocabulary that this BERTFeatureExtractor understands.
`static void`	`main(String[] args)` Test harness for running a BERT model and inspecting the output.
`void`	`postConfig()`
`List<Feature>`	`process(String tag, String data)` Tokenizes the input using the loaded tokenizer, truncates the token list if it's longer than `maxLength` - 2 (to account for [CLS] and [SEP] tokens), and then passes the token list to `extractExample(java.util.List<java.lang.String>)`.
`void`	`reconfigureOrtSession(ai.onnxruntime.OrtSession.SessionOptions options)` Reconstructs the OrtSession using the supplied options.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - INPUT_IDS
```
public static final String INPUT_IDS
```
    See Also:
    
    Constant Field Values
  - ATTENTION_MASK
```
public static final String ATTENTION_MASK
```
    See Also:
    
    Constant Field Values
  - TOKEN_TYPE_IDS
```
public static final String TOKEN_TYPE_IDS
```
    See Also:
    
    Constant Field Values
  - TOKEN_OUTPUT
```
public static final String TOKEN_OUTPUT
```
    See Also:
    
    Constant Field Values
  - CLS_OUTPUT
```
public static final String CLS_OUTPUT
```
    See Also:
    
    Constant Field Values
  - CLASSIFICATION_TOKEN
```
public static final String CLASSIFICATION_TOKEN
```
    See Also:
    
    Constant Field Values
  - SEPARATOR_TOKEN
```
public static final String SEPARATOR_TOKEN
```
    See Also:
    
    Constant Field Values
  - UNKNOWN_TOKEN
```
public static final String UNKNOWN_TOKEN
```
    See Also:
    
    Constant Field Values
  - TOKEN_METADATA
```
public static final String TOKEN_METADATA
```
    See Also:
    
    Constant Field Values
  - MASK_VALUE
```
public static final long MASK_VALUE
```
    See Also:
    
    Constant Field Values
  - TOKEN_TYPE_VALUE
```
public static final long TOKEN_TYPE_VALUE
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - BERTFeatureExtractor
```
public BERTFeatureExtractor(OutputFactory<T> outputFactory,
                            Path modelPath,
                            Path tokenizerPath)
```
    Constructs a BERTFeatureExtractor.
    
    Parameters:
    
    outputFactory - The output factory to use for building any unknown outputs.
    
    modelPath - The path to BERT in onnx format.
    
    tokenizerPath - The path to a Huggingface tokenizer json file.
  - BERTFeatureExtractor
```
public BERTFeatureExtractor(OutputFactory<T> outputFactory,
                            Path modelPath,
                            Path tokenizerPath,
                            BERTFeatureExtractor.OutputPooling pooling,
                            int maxLength,
                            boolean useCUDA)
```
    Constructs a BERTFeatureExtractor.
    
    Parameters:
    
    outputFactory - The output factory to use for building any unknown outputs.
    
    modelPath - The path to BERT in onnx format.
    
    tokenizerPath - The path to a Huggingface tokenizer json file.
    
    pooling - The pooling type for extracted Examples.
    
    maxLength - The maximum number of wordpieces.
    
    useCUDA - Set to true to enable CUDA.
- Method Detail
  - postConfig
```
public void postConfig()
                throws com.oracle.labs.mlrg.olcut.config.PropertyException
```
    Specified by:
    
    postConfig in interface com.oracle.labs.mlrg.olcut.config.Configurable
    
    Throws:
    
    com.oracle.labs.mlrg.olcut.config.PropertyException
  - getProvenance
```
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
```
    Specified by:
    
    getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
  - reconfigureOrtSession
```
public void reconfigureOrtSession(ai.onnxruntime.OrtSession.SessionOptions options)
                           throws ai.onnxruntime.OrtException
```
    Reconstructs the OrtSession using the supplied options. This allows the use of different computation backends and configurations.
    
    Parameters:
    
    options - The new session options.
    
    Throws:
    
    ai.onnxruntime.OrtException - If the native runtime failed to rebuild itself.
  - getMaxLength
```
public int getMaxLength()
```
    Returns the maximum length this BERT will accept.
    
    Returns:
    
    The maximum number of tokens (including [CLS] and [SEP], so the maximum is effectively 2 less than this).
  - getVocab
```
public Set<String> getVocab()
```
    Returns the vocabulary that this BERTFeatureExtractor understands.
    
    Returns:
    
    The vocabulary.
  - extractExample
```
public Example<T> extractExample(List<String> tokens)
```
    Passes the tokens through BERT, replacing any unknown tokens with the [UNK] token.
    The features of the returned example are dense, and come from the [CLS] token.
    Throws IllegalArgumentException if the list is longer than getMaxLength(). Throws IllegalStateException if the BERT model failed to produce an output.
    
    Parameters:
    
    tokens - The input tokens. Should be tokenized using the Tokenizer this BERT expects.
    
    Returns:
    
    A dense example representing the pooled output from BERT for the input tokens.
  - extractExample
```
public Example<T> extractExample(List<String> tokens,
                                 T output)
```
    Passes the tokens through BERT, replacing any unknown tokens with the [UNK] token.
    The features of the returned example are dense, and are controlled by the output pooling field.
    Throws IllegalArgumentException if the list is longer than getMaxLength(). Throws IllegalStateException if the BERT model failed to produce an output.
    
    Parameters:
    
    tokens - The input tokens. Should be tokenized using the Tokenizer this BERT expects.
    
    output - The ground truth output for this example.
    
    Returns:
    
    A dense example representing the pooled output from BERT for the input tokens.
  - extractSequenceExample
```
public SequenceExample<T> extractSequenceExample(List<String> tokens,
                                                 boolean stripSentenceMarkers)
```
    Passes the tokens through BERT, replacing any unknown tokens with the [UNK] token.
    The features of each example are dense. If stripSentenceMarkers is true then the [CLS] and [SEP] tokens are removed before example generation. If it's false then they are left in with the appropriate unknown output set.
    Throws IllegalArgumentException if the list is longer than getMaxLength(). Throws IllegalStateException if the BERT model failed to produce an output.
    
    Parameters:
    
    tokens - The input tokens. Should be tokenized using the Tokenizer this BERT expects.
    
    stripSentenceMarkers - Remove the [CLS] and [SEP] tokens from the returned example.
    
    Returns:
    
    A dense sequence example representing the token level output from BERT.
  - extractSequenceExample
```
public SequenceExample<T> extractSequenceExample(List<String> tokens,
                                                 List<T> output,
                                                 boolean stripSentenceMarkers)
```
    Passes the tokens through BERT, replacing any unknown tokens with the [UNK] token.
    The features of each example are dense. The output list must be the same length as the number of tokens. If stripSentenceMarkers is true then the [CLS] and [SEP] tokens are removed before example generation. If it's false then they are left in with the appropriate unknown output set.
    Throws IllegalArgumentException if the list is longer than getMaxLength(). Throws IllegalStateException if the BERT model failed to produce an output.
    
    Parameters:
    
    tokens - The input tokens. Should be tokenized using the Tokenizer this BERT expects.
    
    output - The ground truth output for this example.
    
    stripSentenceMarkers - Remove the [CLS] and [SEP] tokens from the returned example.
    
    Returns:
    
    A dense sequence example representing the token level output from BERT.
  - close
```
public void close()
           throws ai.onnxruntime.OrtException
```
    Specified by:
    
    close in interface AutoCloseable
    
    Throws:
    
    ai.onnxruntime.OrtException
  - extract
```
public Example<T> extract(T output,
                          String data)
```
    Tokenizes the input using the loaded tokenizer, truncates the token list if it's longer than maxLength - 2 (to account for [CLS] and [SEP] tokens), and then passes the token list to extractExample(java.util.List<java.lang.String>).
    
    Specified by:
    
    extract in interface TextFeatureExtractor<T extends Output<T>>
    
    Parameters:
    
    output - The output object.
    
    data - The input text.
    
    Returns:
    
    An example containing BERT embedding features and the requested output.
  - process
```
public List<Feature> process(String tag,
                             String data)
```
    Tokenizes the input using the loaded tokenizer, truncates the token list if it's longer than maxLength - 2 (to account for [CLS] and [SEP] tokens), and then passes the token list to extractExample(java.util.List<java.lang.String>).
    
    Specified by:
    
    process in interface TextPipeline
    
    Parameters:
    
    tag - A tag to prefix all the generated feature names with.
    
    data - The input text.
    
    Returns:
    
    The BERT features for the supplied data.
  - main
```
public static void main(String[] args)
                 throws IOException,
                        ai.onnxruntime.OrtException
```
    Test harness for running a BERT model and inspecting the output.
    
    Parameters:
    
    args - The CLI arguments.
    
    Throws:
    
    IOException - If the files couldn't be read or written to.
    
    ai.onnxruntime.OrtException - If the BERT model failed to load, or threw an exception during computation.

Class BERTFeatureExtractor<T extends Output<T>>

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

INPUT_IDS

ATTENTION_MASK

TOKEN_TYPE_IDS

TOKEN_OUTPUT

CLS_OUTPUT

CLASSIFICATION_TOKEN

SEPARATOR_TOKEN

UNKNOWN_TOKEN

TOKEN_METADATA

MASK_VALUE

TOKEN_TYPE_VALUE

Constructor Detail

BERTFeatureExtractor

BERTFeatureExtractor

Method Detail

postConfig

getProvenance

reconfigureOrtSession

getMaxLength

getVocab

extractExample

extractExample

extractSequenceExample

extractSequenceExample

close

extract

process

main