A Word Segmenter backed by Java's BreakIterator.
Finds all occurrences of the given pattern in the document.
Splits the input document according to the given pattern.
Simple English document tokenizer that splits up words on whitespace or punctuation, but keeps word-internal punctuation within the word.
Abstract trait for tokenizers, which act as functions from a String to an Iterable[String].
Tokenizes by splitting on the regular expression \s+.
PTBTokenizer tokenizes sentences into treebank style sentences.
Companion object for Tokenizer that supports automatic TextSerialization of Tokenizer and its subtypes.