Packages

p

epic

preprocess

package preprocess

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class JavaSentenceSegmenter extends SentenceSegmenter

    A Sentence Segmenter backed by Java's BreakIterator.

    A Sentence Segmenter backed by Java's BreakIterator. Given an input string, it will return an iterator over sentences

  2. class JavaWordTokenizer extends Tokenizer

    A Word Segmenter backed by Java's BreakIterator.

    A Word Segmenter backed by Java's BreakIterator. Given an input string, it will return an iterator over sentences Doesn't return spaces, does return punctuation.

  3. case class MLSentenceSegmenter(inf: ClassificationInference) extends SentenceSegmenter with Serializable with Product with Serializable
    Annotations
    @SerialVersionUID()
  4. class NewLineSentenceSegmenter extends SentenceSegmenter
  5. case class RegexSearchTokenizer(pattern: String) extends Tokenizer with Product with Serializable

    Finds all occurrences of the given pattern in the document.

  6. case class RegexSplitTokenizer(pattern: String) extends Tokenizer with Product with Serializable

    Splits the input document according to the given pattern.

    Splits the input document according to the given pattern. Does not return the splits.

  7. class SegmentingIterator extends Iterator[Span]
  8. trait SentenceSegmenter extends StringAnalysisFunction[Any, Sentence] with (String) ⇒ Iterable[String] with Serializable

  9. class StreamSentenceSegmenter extends AnyRef

    TODO

  10. trait Tokenizer extends StringAnalysisFunction[Sentence, Token] with Serializable with (String) ⇒ IndexedSeq[String]

    Abstract trait for tokenizers, which annotate sentence-segmented text with tokens.

    Abstract trait for tokenizers, which annotate sentence-segmented text with tokens. Tokenizers work with both raw strings and epic.slab.StringSlabs.

    Annotations
    @SerialVersionUID()
  11. class TreebankTokenizer extends Tokenizer with Serializable
    Annotations
    @SerialVersionUID()
  12. class WhitespaceTokenizer extends RegexSplitTokenizer

    Tokenizes by splitting on the regular expression \s+.

Value Members

  1. object JavaSentenceSegmenter extends JavaSentenceSegmenter
  2. object JavaWordTokenizer extends JavaWordTokenizer
  3. object MLSentenceSegmenter extends Serializable
  4. object RegexSentenceSegmenter extends SentenceSegmenter

    A simple regex sentence segmenter.

  5. object SegmentSentences
  6. object StreamSentenceSegmenter
  7. object TreebankTokenizer extends TreebankTokenizer
  8. object WhitespaceTokenizer extends Serializable

Ungrouped