chalk.text

tokenize

package tokenize

Visibility
  1. Public
  2. All

Type Members

  1. class JavaWordTokenizer extends Tokenizer

    A Word Segmenter backed by Java's BreakIterator.

  2. case class RegexSearchTokenizer(pattern: String) extends Tokenizer with Product with Serializable

    Finds all occurrences of the given pattern in the document.

  3. case class RegexSplitTokenizer(pattern: String) extends Tokenizer with Product with Serializable

    Splits the input document according to the given pattern.

  4. trait SimpleEnglishTokenizer extends Tokenizer

    Simple English document tokenizer that splits up words on whitespace or punctuation, but keeps word-internal punctuation within the word.

  5. trait Tokenizer extends (String) ⇒ Iterable[String] with Serializable

    Abstract trait for tokenizers, which act as functions from a String to an Iterable[String].

  6. class WhitespaceTokenizer extends RegexSplitTokenizer

    Tokenizes by splitting on the regular expression \s+.

Value Members

  1. object JavaWordTokenizer extends JavaWordTokenizer

  2. object PTBTokenizer extends Tokenizer

    PTBTokenizer tokenizes sentences into treebank style sentences.

  3. object SimpleEnglishTokenizer extends Serializable

  4. object Tokenizer extends Serializable

    Companion object for Tokenizer that supports automatic TextSerialization of Tokenizer and its subtypes.

  5. object WhitespaceTokenizer extends Serializable

Ungrouped