public class Subtokenizer extends Tokenizer_ImplBase
This tokenizer is named 'Subtokenizer' because it is designed to over-generate tokens that can then be used as input for another tokenization approach. Specifically, this tokenizer can be used as input for BIO-styled tokenization using a classifier. Each token generated by this tokenizer would be assigned a B-TOKEN (or something similar) or I-TOKEN for a given set of gold-standard tokens or based on the results of a classifier.
Please see the corresponding unit tests for examples of how this tokenizer produces tokens.
| Modifier and Type | Field and Description |
|---|---|
static Pattern |
multipleWhitespacePattern |
static String |
multipleWhitespaceRegex |
static Pattern |
subtokensPattern |
static String |
subtokensRegex |
| Constructor and Description |
|---|
Subtokenizer() |
| Modifier and Type | Method and Description |
|---|---|
String[] |
getTokenTexts(String text) |
getTokenspublic static Pattern multipleWhitespacePattern
public static String multipleWhitespaceRegex
public static Pattern subtokensPattern
public static String subtokensRegex
public Subtokenizer()
public String[] getTokenTexts(String text)
getTokenTexts in class Tokenizer_ImplBaseCopyright © 2014. All rights reserved.