lux.index.analysis
Class TextOffsetTokenStream
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
lux.index.analysis.TextOffsetTokenStream
- All Implemented Interfaces:
- Closeable
- Direct Known Subclasses:
- AttributeTokenStream, ElementTokenStream, XmlTextTokenStream
public abstract class TextOffsetTokenStream
- extends org.apache.lucene.analysis.TokenStream
This TokenStream records the offsets (character positions in the original text) of every token.
It records the start offset of each text node, and whenever there is a difference between the
length of the serialized XML and the length of the text, it records the offset just after the
discrepancy. For example if a character entity (like &) occurs in the XML, this is translated
to "&" in the text, and a character offset is recorded for the character just following the "&".
| Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State |
|
Field Summary |
protected Reader |
charStream
|
protected Iterator<net.sf.saxon.s9api.XdmNode> |
contentIter
|
protected net.sf.saxon.s9api.XdmNode |
curNode
|
protected static net.sf.saxon.s9api.XdmSequenceIterator |
EMPTY
|
protected org.apache.lucene.analysis.tokenattributes.CharTermAttribute |
termAtt
|
|
Constructor Summary |
TextOffsetTokenStream(String fieldName,
org.apache.lucene.analysis.Analyzer analyzer,
org.apache.lucene.analysis.TokenStream wrapped,
net.sf.saxon.s9api.XdmNode doc,
Offsets offsets)
|
| Methods inherited from class org.apache.lucene.analysis.TokenStream |
close, end |
| Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState |
curNode
protected net.sf.saxon.s9api.XdmNode curNode
contentIter
protected Iterator<net.sf.saxon.s9api.XdmNode> contentIter
termAtt
protected org.apache.lucene.analysis.tokenattributes.CharTermAttribute termAtt
charStream
protected Reader charStream
EMPTY
protected static final net.sf.saxon.s9api.XdmSequenceIterator EMPTY
TextOffsetTokenStream
public TextOffsetTokenStream(String fieldName,
org.apache.lucene.analysis.Analyzer analyzer,
org.apache.lucene.analysis.TokenStream wrapped,
net.sf.saxon.s9api.XdmNode doc,
Offsets offsets)
resetTokenizer
protected boolean resetTokenizer(CharSequence text)
reset
public void reset()
throws IOException
- Overrides:
reset in class org.apache.lucene.analysis.TokenStream
- Throws:
IOException
reset
public void reset(Reader reader)
throws IOException
- Throws:
IOException
incrementToken
public boolean incrementToken()
throws IOException
- Specified by:
incrementToken in class org.apache.lucene.analysis.TokenStream
- Throws:
IOException
getWrappedTokenStream
public org.apache.lucene.analysis.TokenStream getWrappedTokenStream()
- Returns:
- the underlying stream of text tokens to which additional xml-related attributes are added by this.
setWrappedTokenStream
protected void setWrappedTokenStream(org.apache.lucene.analysis.TokenStream wrapped)
incrementWrappedTokenStream
protected boolean incrementWrappedTokenStream()
throws IOException
- Throws:
IOException
Copyright © 2013. All Rights Reserved.