package preprocess
- Alphabetic
- Public
- All
Type Members
-
class
JavaSentenceSegmenter extends SentenceSegmenter
A Sentence Segmenter backed by Java's BreakIterator.
A Sentence Segmenter backed by Java's BreakIterator. Given an input string, it will return an iterator over sentences
-
class
JavaWordTokenizer extends Tokenizer
A Word Segmenter backed by Java's BreakIterator.
A Word Segmenter backed by Java's BreakIterator. Given an input string, it will return an iterator over sentences Doesn't return spaces, does return punctuation.
-
case class
MLSentenceSegmenter(inf: ClassificationInference) extends SentenceSegmenter with Serializable with Product with Serializable
- Annotations
- @SerialVersionUID()
- class NewLineSentenceSegmenter extends SentenceSegmenter
-
case class
RegexSearchTokenizer(pattern: String) extends Tokenizer with Product with Serializable
Finds all occurrences of the given pattern in the document.
-
case class
RegexSplitTokenizer(pattern: String) extends Tokenizer with Product with Serializable
Splits the input document according to the given pattern.
Splits the input document according to the given pattern. Does not return the splits.
- class SegmentingIterator extends Iterator[Span]
- trait SentenceSegmenter extends StringAnalysisFunction[Any, Sentence] with (String) ⇒ Iterable[String] with Serializable
-
class
StreamSentenceSegmenter extends AnyRef
TODO
-
trait
Tokenizer extends StringAnalysisFunction[Sentence, Token] with Serializable with (String) ⇒ IndexedSeq[String]
Abstract trait for tokenizers, which annotate sentence-segmented text with tokens.
Abstract trait for tokenizers, which annotate sentence-segmented text with tokens. Tokenizers work with both raw strings and epic.slab.StringSlabs.
- Annotations
- @SerialVersionUID()
-
class
TreebankTokenizer extends Tokenizer with Serializable
- Annotations
- @SerialVersionUID()
-
class
WhitespaceTokenizer extends RegexSplitTokenizer
Tokenizes by splitting on the regular expression \s+.
Value Members
- object JavaSentenceSegmenter extends JavaSentenceSegmenter
- object JavaWordTokenizer extends JavaWordTokenizer
- object MLSentenceSegmenter extends Serializable
-
object
RegexSentenceSegmenter extends SentenceSegmenter
A simple regex sentence segmenter.
- object SegmentSentences
- object StreamSentenceSegmenter
- object TreebankTokenizer extends TreebankTokenizer
- object WhitespaceTokenizer extends Serializable