public class RuleBasedTokenizer extends Object implements Tokenizer
| Modifier and Type | Field and Description |
|---|---|
static Pattern |
AlphaAposAlpha
Alphabetic apostrophe and alphabetic.
|
static Pattern |
alphaAposNonAlpha
Alphabetic apostrophe and non alpha.
|
static Pattern |
asciiHex
Non printable control characters.
|
static Pattern |
beginLink
Re-tokenize beginning of link.
|
static Pattern |
commaNoDigit
Comma and no digit.
|
static Pattern |
detokenParagraphs
De-tokenize paragraph marks.
|
static Pattern |
digitCommaNoDigit
Digit comma and non digit.
|
static Pattern |
dotmultiDot
Multi dot pattern and extra dot.
|
static Pattern |
dotmultiDotAny
Dot multi pattern followed by anything.
|
static Pattern |
doubleSpaces |
static Pattern |
endLink |
static Pattern |
endOfSentenceApos
Tokenize apostrophes occurring at the end of the string.
|
static Pattern |
englishApos
Split English apostrophes.
|
static Pattern |
multiDots
Multidots.
|
static Pattern |
noAlphaAposNoAlpha
No alphabetic apostrophe and no alphabetic.
|
static Pattern |
noAlphaDigitAposAlpha
Non alpha, digit, apostrophe and alpha.
|
static Pattern |
noDigitComma
No digit comma.
|
static Pattern |
noDigitCommaDigit
Non digit comma and digit.
|
static Pattern |
qexc
Question and exclamation marks (do not separate if multiple).
|
static Pattern |
replacement |
static Pattern |
spaceDashSpace
Dashes or slashes preceded or followed by space.
|
static Pattern |
specials
Tokenize everything but these characters.
|
static String |
TLD
Top level domains for stopping the wrongLink pattern below.
|
static Pattern |
wrongLink
Detect wrongly tokenized links.
|
static Pattern |
yearApos
Digit apostrophe and s (for 1990's).
|
| Constructor and Description |
|---|
RuleBasedTokenizer(String text,
Properties properties)
Construct a rule based tokenizer.
|
public static Pattern replacement
public static Pattern doubleSpaces
public static Pattern asciiHex
public static Pattern specials
public static Pattern qexc
public static Pattern spaceDashSpace
public static Pattern multiDots
public static Pattern dotmultiDot
public static Pattern dotmultiDotAny
public static Pattern noDigitComma
public static Pattern commaNoDigit
public static Pattern digitCommaNoDigit
public static Pattern noDigitCommaDigit
public static final String TLD
public static Pattern wrongLink
public static Pattern beginLink
public static Pattern endLink
public static Pattern noAlphaAposNoAlpha
public static Pattern noAlphaDigitAposAlpha
public static Pattern alphaAposNonAlpha
public static Pattern AlphaAposAlpha
public static Pattern englishApos
public static Pattern yearApos
public static Pattern endOfSentenceApos
public static Pattern detokenParagraphs
public RuleBasedTokenizer(String text, Properties properties)
text - the text used for offset calculationproperties - the optionspublic static void normalizeTokens(List<List<Token>> tokens, String lang)
tokens - the tokenslang - the languageCopyright © 2015 IXA pipes. All rights reserved.