public class Tokenizer extends Object
The rules generally follow those of the Penn Tree Bank, although hyphenated items are separated, with the hyphen a separate token, and single quotes (') are always treated as separate tokens unless part of a standard suffix ('s, 'm, 'd, 're, 've, n't, 'll).
For a capitalized word, we set the feature case=cap, except that at the beginning of a sentence, the token is marked case=forcedCap. In addition, words following a ``, ", `, or _ are marked forcedCap.
The tokenizer is loosely based on the version for OAK.
| Constructor and Description |
|---|
Tokenizer() |
| Modifier and Type | Method and Description |
|---|---|
static Annotation[] |
gatherTokens(Document doc,
Span span)
returns an array containing all token annotations in
span of doc. |
static String[] |
gatherTokenStrings(Document doc,
Span span)
returns an array of Strings corresponding to all the tokens
in
span of doc. |
static void |
main(String[] args)
performs a very simple validation of the tokenizer, returning a success
or failure indication.
|
static int |
skipWS(Document doc,
int posn,
int end)
advances to the next non-whitespace character in a document.
|
static int |
skipWS(String text,
int posn,
int end)
advances to the next non-whitespace character in a String.
|
static int |
skipWSX(Document doc,
int posn,
int end)
advances to the next non-whitespace character in a document,
skipping any XML tags.
|
static int |
skipWSX(String text,
int posn,
int end) |
static void |
tokenize(Document doc,
Span span)
tokenizes the portion of Document doc covered by span.
|
static String[] |
tokenize(String text)
tokenizes the argument string.
|
static void |
tokenizeOnWS(Document doc,
Span span)
tokenizes portion 'span' of 'doc', splitting only on white space.
|
public static void tokenize(Document doc, Span span)
public static String[] tokenize(String text)
public static void tokenizeOnWS(Document doc, Span span)
public static int skipWS(Document doc, int posn, int end)
posn is a character position within Document
doc. Returns posn (if that character
position is occupied by a non-whitespace character), or the position
of the next non-whitespace character, or end if all
the characters up to end are whitespace.public static int skipWS(String text, int posn, int end)
posn is a character position within String
text. Returns posn (if that character
position is occupied by a non-whitespace character), or the position
of the next non-whitespace character, or end if all
the characters up to end are whitespace.public static int skipWSX(Document doc, int posn, int end)
public static int skipWSX(String text, int posn, int end)
public static Annotation[] gatherTokens(Document doc, Span span)
span of doc.public static String[] gatherTokenStrings(Document doc, Span span)
span of doc.public static void main(String[] args)
Copyright © 2016 New York University. All rights reserved.