public abstract class TextOffsetTokenStream
extends org.apache.lucene.analysis.TokenStream
This TokenStream records the offsets (character positions in the original text) of every token. It records the start offset of each text node, and whenever there is a difference between the length of the serialized XML and the length of the text, it records the offset just after the discrepancy. For example if a character entity (like &) occurs in the XML, this is translated to "&" in the text, and a character offset is recorded for the character just following the "&".
| Modifier and Type | Field and Description |
|---|---|
protected Reader |
charStream |
protected Iterator<net.sf.saxon.s9api.XdmNode> |
contentIter |
protected net.sf.saxon.s9api.XdmNode |
curNode |
protected static net.sf.saxon.s9api.XdmSequenceIterator |
EMPTY |
protected org.apache.lucene.analysis.tokenattributes.CharTermAttribute |
termAtt |
| Constructor and Description |
|---|
TextOffsetTokenStream(String fieldName,
org.apache.lucene.analysis.Analyzer analyzer,
org.apache.lucene.analysis.TokenStream wrapped,
net.sf.saxon.s9api.XdmNode doc,
Offsets offsets) |
| Modifier and Type | Method and Description |
|---|---|
org.apache.lucene.analysis.TokenStream |
getWrappedTokenStream() |
boolean |
incrementToken() |
protected boolean |
incrementWrappedTokenStream() |
void |
reset() |
void |
reset(Reader reader) |
protected boolean |
resetTokenizer(CharSequence text) |
protected void |
setWrappedTokenStream(org.apache.lucene.analysis.TokenStream wrapped) |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreStateprotected net.sf.saxon.s9api.XdmNode curNode
protected Iterator<net.sf.saxon.s9api.XdmNode> contentIter
protected org.apache.lucene.analysis.tokenattributes.CharTermAttribute termAtt
protected Reader charStream
protected static final net.sf.saxon.s9api.XdmSequenceIterator EMPTY
protected boolean resetTokenizer(CharSequence text)
public void reset()
throws IOException
reset in class org.apache.lucene.analysis.TokenStreamIOExceptionpublic void reset(Reader reader) throws IOException
IOExceptionpublic boolean incrementToken()
throws IOException
incrementToken in class org.apache.lucene.analysis.TokenStreamIOExceptionpublic org.apache.lucene.analysis.TokenStream getWrappedTokenStream()
protected void setWrappedTokenStream(org.apache.lucene.analysis.TokenStream wrapped)
protected boolean incrementWrappedTokenStream()
throws IOException
IOExceptionCopyright © 2013. All Rights Reserved.