Class AnalyzingSentenceTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class AnalyzingSentenceTokenizer
    extends org.apache.lucene.analysis.Tokenizer
    Tokenizer which splits the input into sentences and emits only those sentences that do not contain too many stopwords. Sentences that contain many commas are split into their comma-separated parts and analyzed per part. If the input contains only a single sentence, it is always emitted.
    Author:
    Shopping24 GmbH
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

        org.apache.lucene.util.AttributeSource.State
    • Field Summary

      • Fields inherited from class org.apache.lucene.analysis.Tokenizer

        input
      • Fields inherited from class org.apache.lucene.analysis.TokenStream

        DEFAULT_TOKEN_ATTRIBUTE_FACTORY
    • Constructor Summary

      Constructors 
      Constructor Description
      AnalyzingSentenceTokenizer​(org.apache.lucene.util.AttributeFactory factory, boolean removeBadSentences, org.apache.lucene.analysis.CharArraySet stopWords, float commaWordThreshold, float maxStopwordRatio, int minSentenceLength)
      Construct a token stream processing the given input using the given AttributeFactory.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void end()
      Sets the final offset, does not reset internal state.
      boolean incrementToken()
      protected boolean incrementTokenInternal()  
      void reset()
      Method is called after the input has been set.
      • Methods inherited from class org.apache.lucene.analysis.Tokenizer

        close, correctOffset, setReader
      • Methods inherited from class org.apache.lucene.util.AttributeSource

        addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
    • Constructor Detail

      • AnalyzingSentenceTokenizer

        public AnalyzingSentenceTokenizer​(org.apache.lucene.util.AttributeFactory factory,
                                          boolean removeBadSentences,
                                          org.apache.lucene.analysis.CharArraySet stopWords,
                                          float commaWordThreshold,
                                          float maxStopwordRatio,
                                          int minSentenceLength)
        Construct a token stream processing the given input using the given AttributeFactory.
        Parameters:
        factory - the factory.
        removeBadSentences - if true, sentences with too many stopwords are filtered out.
        stopWords - the stopwords.
        commaWordThreshold - the threshold that defines the "comma density" that, if exceeded, causes a sentence to be split into sub-sentences that are analyzed individually.
        maxStopwordRatio - if the ratio of stopwords exceeds this threshold, the sentence is filtered out.
        minSentenceLength - a sentence must contain at least this many words, otherwise it is not analyzed and always emitted.
    • Method Detail

      • end

        public void end()
                 throws IOException
        Sets the final offset, does not reset internal state.
        Overrides:
        end in class org.apache.lucene.analysis.TokenStream
        Throws:
        IOException
      • reset

        public void reset()
                   throws IOException
        Method is called after the input has been set. This should reset all internal state and adjust to the new input.
        Overrides:
        reset in class org.apache.lucene.analysis.Tokenizer
        Throws:
        IOException
      • incrementToken

        public final boolean incrementToken()
                                     throws IOException
        Specified by:
        incrementToken in class org.apache.lucene.analysis.TokenStream
        Returns:
        true to indicate to the caller to read the current attribute state and false to indicate the end of the token stream.
        Throws:
        IOException
      • incrementTokenInternal

        protected boolean incrementTokenInternal()
                                          throws IOException
        Returns:
        true if the current attribute state should be emitted
        Throws:
        IOException