Simple English document tokenizer that splits up words on whitespace
or punctuation, but keeps word-internal punctuation within the word.
Skips whitespace.
Because this class may improve over time in non-backwards-compatible ways,
the default behavior of SimpleEnglishTokenizer.apply() is to return an
instance of SimpleEnglishTokenizer.V1. To get an instance of the
old version (based on patterns by Steven Bethard), you can call
SimpleEnglishTokenizer.V0().
Linear Supertypes
Tokenizer, Serializable, Serializable, (String) ⇒ Iterable[String], AnyRef, Any
Simple English document tokenizer that splits up words on whitespace or punctuation, but keeps word-internal punctuation within the word. Skips whitespace.
Because this class may improve over time in non-backwards-compatible ways, the default behavior of SimpleEnglishTokenizer.apply() is to return an instance of SimpleEnglishTokenizer.V1. To get an instance of the old version (based on patterns by Steven Bethard), you can call SimpleEnglishTokenizer.V0().