lux.index.analysis
Class TextOffsetTokenStream

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by lux.index.analysis.TextOffsetTokenStream
All Implemented Interfaces:
Closeable
Direct Known Subclasses:
AttributeTokenStream, ElementTokenStream, XmlTextTokenStream

public abstract class TextOffsetTokenStream
extends org.apache.lucene.analysis.TokenStream

This TokenStream records the offsets (character positions in the original text) of every token. It records the start offset of each text node, and whenever there is a difference between the length of the serialized XML and the length of the text, it records the offset just after the discrepancy. For example if a character entity (like &) occurs in the XML, this is translated to "&" in the text, and a character offset is recorded for the character just following the "&".


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Field Summary
protected  Reader charStream
           
protected  Iterator<net.sf.saxon.s9api.XdmNode> contentIter
           
protected  net.sf.saxon.s9api.XdmNode curNode
           
protected static net.sf.saxon.s9api.XdmSequenceIterator EMPTY
           
protected  org.apache.lucene.analysis.tokenattributes.CharTermAttribute termAtt
           
 
Constructor Summary
TextOffsetTokenStream(String fieldName, org.apache.lucene.analysis.Analyzer analyzer, org.apache.lucene.analysis.TokenStream wrapped, net.sf.saxon.s9api.XdmNode doc, Offsets offsets)
           
 
Method Summary
 org.apache.lucene.analysis.TokenStream getWrappedTokenStream()
           
 boolean incrementToken()
           
protected  boolean incrementWrappedTokenStream()
           
 void reset()
           
 void reset(Reader reader)
           
protected  boolean resetTokenizer(CharSequence text)
           
protected  void setWrappedTokenStream(org.apache.lucene.analysis.TokenStream wrapped)
           
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
close, end
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

curNode

protected net.sf.saxon.s9api.XdmNode curNode

contentIter

protected Iterator<net.sf.saxon.s9api.XdmNode> contentIter

termAtt

protected org.apache.lucene.analysis.tokenattributes.CharTermAttribute termAtt

charStream

protected Reader charStream

EMPTY

protected static final net.sf.saxon.s9api.XdmSequenceIterator EMPTY
Constructor Detail

TextOffsetTokenStream

public TextOffsetTokenStream(String fieldName,
                             org.apache.lucene.analysis.Analyzer analyzer,
                             org.apache.lucene.analysis.TokenStream wrapped,
                             net.sf.saxon.s9api.XdmNode doc,
                             Offsets offsets)
Method Detail

resetTokenizer

protected boolean resetTokenizer(CharSequence text)

reset

public void reset()
           throws IOException
Overrides:
reset in class org.apache.lucene.analysis.TokenStream
Throws:
IOException

reset

public void reset(Reader reader)
           throws IOException
Throws:
IOException

incrementToken

public boolean incrementToken()
                       throws IOException
Specified by:
incrementToken in class org.apache.lucene.analysis.TokenStream
Throws:
IOException

getWrappedTokenStream

public org.apache.lucene.analysis.TokenStream getWrappedTokenStream()
Returns:
the underlying stream of text tokens to which additional xml-related attributes are added by this.

setWrappedTokenStream

protected void setWrappedTokenStream(org.apache.lucene.analysis.TokenStream wrapped)

incrementWrappedTokenStream

protected boolean incrementWrappedTokenStream()
                                       throws IOException
Throws:
IOException


Copyright © 2013. All Rights Reserved.