Package com.s24.search.solr.analyzers
Class AnalyzingSentenceTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- com.s24.search.solr.analyzers.AnalyzingSentenceTokenizer
-
- All Implemented Interfaces:
Closeable,AutoCloseable
public class AnalyzingSentenceTokenizer extends org.apache.lucene.analysis.TokenizerTokenizer which splits the input into sentences and emits only those sentences that do not contain too many stopwords. Sentences that contain many commas are split into their comma-separated parts and analyzed per part. If the input contains only a single sentence, it is always emitted.- Author:
- Shopping24 GmbH
-
-
Constructor Summary
Constructors Constructor Description AnalyzingSentenceTokenizer(org.apache.lucene.util.AttributeFactory factory, boolean removeBadSentences, org.apache.lucene.analysis.CharArraySet stopWords, float commaWordThreshold, float maxStopwordRatio, int minSentenceLength)Construct a token stream processing the given input using the given AttributeFactory.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidend()Sets the final offset, does not reset internal state.booleanincrementToken()protected booleanincrementTokenInternal()voidreset()Method is called after the input has been set.-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Constructor Detail
-
AnalyzingSentenceTokenizer
public AnalyzingSentenceTokenizer(org.apache.lucene.util.AttributeFactory factory, boolean removeBadSentences, org.apache.lucene.analysis.CharArraySet stopWords, float commaWordThreshold, float maxStopwordRatio, int minSentenceLength)Construct a token stream processing the given input using the given AttributeFactory.- Parameters:
factory- the factory.removeBadSentences- iftrue, sentences with too many stopwords are filtered out.stopWords- the stopwords.commaWordThreshold- the threshold that defines the "comma density" that, if exceeded, causes a sentence to be split into sub-sentences that are analyzed individually.maxStopwordRatio- if the ratio of stopwords exceeds this threshold, the sentence is filtered out.minSentenceLength- a sentence must contain at least this many words, otherwise it is not analyzed and always emitted.
-
-
Method Detail
-
end
public void end() throws IOExceptionSets the final offset, does not reset internal state.- Overrides:
endin classorg.apache.lucene.analysis.TokenStream- Throws:
IOException
-
reset
public void reset() throws IOExceptionMethod is called after the input has been set. This should reset all internal state and adjust to the new input.- Overrides:
resetin classorg.apache.lucene.analysis.Tokenizer- Throws:
IOException
-
incrementToken
public final boolean incrementToken() throws IOException- Specified by:
incrementTokenin classorg.apache.lucene.analysis.TokenStream- Returns:
trueto indicate to the caller to read the current attribute state andfalseto indicate the end of the token stream.- Throws:
IOException
-
incrementTokenInternal
protected boolean incrementTokenInternal() throws IOException- Returns:
trueif the current attribute state should be emitted- Throws:
IOException
-
-