lux.index
Class XmlIndexer

java.lang.Object
  extended by lux.index.XmlIndexer

public class XmlIndexer
extends Object

Indexes XML documents. The constructor accepts a set of flags that define a set of fields known to XmlIndexer. The fields are represented by instances of XmlField. Instances of XmlField are immutable; they hold no data, merely serving as markers. Additional fields can also be added using addField(). A field may be associated with a StAXHandler; the indexer is responsible for feeding the handlers with StAX (XML) events. Some fields may share the same handler. The association between field and handler is implicit: the field calls an XmlIndexer getter to retrieve the handler. Also, this class is not thread-safe This is all kind of a mess, and not readily extendable. If you want to add a new type of field (a new XmlField instance), you have to modify the indexer, which has knowledge of all the possible fields. This is not a good design. Also, not every combination of indexing options will actually work. We need to consider which things one might actually want to turn on and off. We could make each field act as a StAXHandler factory? For efficiency though, some fields share the same handler instance. For now, we leave things as they are; we'll refactor as we add more fields. Indexing is triggered by a call to indexDocument(). read(InputStream) parses and gathers the values. which are retrieved by calling XmlField.getFieldValues(XmlIndexer) for each field.


Constructor Summary
XmlIndexer()
          Make a new instance with default options
XmlIndexer(IndexConfiguration config)
          Make a new instance with the given configuration.
XmlIndexer(IndexConfiguration indexConfig, Compiler compiler)
          Make a new instance with the given options and Compiler.
XmlIndexer(long options)
          Make a new instance with the given options.
 
Method Summary
 org.apache.lucene.document.Document createLuceneDocument()
           
 net.sf.saxon.s9api.XdmValue evaluateXPath(String xpath)
          this is primarily for internal use
 IndexConfiguration getConfiguration()
           
 byte[] getDocumentBytes()
           
 String getDocumentText()
           
 XmlPathMapper getPathMapper()
          Primarily for internal use.
 SaxonDocBuilder getSaxonDocBuilder()
          Primarily for internal use.
 String getURI()
           
 net.sf.saxon.s9api.XdmNode getXdmNode()
           
 net.sf.saxon.s9api.XPathCompiler getXPathCompiler()
          this is primarily for internal use
 void index(InputStream xml, String inputUri)
          Index the document read from the stream, caching field values to be written to the Lucene index.
 void index(net.sf.saxon.om.NodeInfo doc, String inputUri)
          Index the document read from the String, caching field values to be written to the Lucene index.
 void index(Reader xml, String inputUri)
          Index the document read from the Reader, caching field values to be written to the Lucene index.
 void indexDocument(org.apache.lucene.index.IndexWriter indexWriter, String docUri, InputStream xmlStream)
          Index and write a document to the Lucene index.
 void indexDocument(org.apache.lucene.index.IndexWriter indexWriter, String path, net.sf.saxon.om.NodeInfo node)
          Index and write a document to the Lucene index.
 void indexDocument(org.apache.lucene.index.IndexWriter indexWriter, String docUri, String xml)
          Index and write a document to the Lucene index.
protected  void init()
          initialize the indexer; an extension of the constructors.
 org.apache.lucene.index.IndexWriter newIndexWriter(org.apache.lucene.store.Directory dir)
          Constructs a new Lucene IndexWriter for the given index directory supplied with the proper analyzers for each field.
 void reset()
          Clear out internal storage cached by #index when indexing a document
 void storeDocument(org.apache.lucene.index.IndexWriter indexWriter, String docUri, byte[] bytes)
          Fully read the stream and store it as a document without attempting to parse or index it.
 void storeDocument(org.apache.lucene.index.IndexWriter indexWriter, String docUri, InputStream input)
          Fully read the stream and store it as a document without attempting to parse or index it.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

XmlIndexer

public XmlIndexer()
Make a new instance with default options


XmlIndexer

public XmlIndexer(IndexConfiguration config)
Make a new instance with the given configuration. Options in the configuration control how documents are indexed, and which kinds of indexed values will be available after indexing a document.

Parameters:
config - the index configuration to use

XmlIndexer

public XmlIndexer(long options)
Make a new instance with the given options. Used mostly for testing.

Parameters:
options - the index configuration options to use

XmlIndexer

public XmlIndexer(IndexConfiguration indexConfig,
                  Compiler compiler)
Make a new instance with the given options and Compiler. The runtime uses this to index documents from its nodes directly, without serializing and parsing.

Parameters:
indexConfig - the index configuration options to use
compiler - the indexer will make XPath that is compatible with this compiler
Method Detail

init

protected void init()
initialize the indexer; an extension of the constructors. Creates subsidiary objects required for indexing based on the index options.


newIndexWriter

public org.apache.lucene.index.IndexWriter newIndexWriter(org.apache.lucene.store.Directory dir)
                                                   throws IOException
Constructs a new Lucene IndexWriter for the given index directory supplied with the proper analyzers for each field. The directory must exist: if there is no index in the directory, a new one will be created. If there is an existing directory, it will be locked for writing until the writer is closed.

Parameters:
dir - the directory where the index is stored
Returns:
the IndexWriter
Throws:
IOException - if there is a problem with the index

getXPathCompiler

public net.sf.saxon.s9api.XPathCompiler getXPathCompiler()
this is primarily for internal use

Returns:
an XPathCompiler

evaluateXPath

public net.sf.saxon.s9api.XdmValue evaluateXPath(String xpath)
                                          throws net.sf.saxon.s9api.SaxonApiException
this is primarily for internal use

Parameters:
xpath - an xpath expression to evaluate
Returns:
the result of evaluating the xpath expression with the last indexed as context
Throws:
net.sf.saxon.s9api.SaxonApiException - if there is an error during compilation or evaluation

index

public void index(InputStream xml,
                  String inputUri)
           throws XMLStreamException
Index the document read from the stream, caching field values to be written to the Lucene index.

Parameters:
xml - the document, as a byte-based InputStream
inputUri - the uri to assign to the document
Throws:
XMLStreamException

index

public void index(Reader xml,
                  String inputUri)
           throws XMLStreamException
Index the document read from the Reader, caching field values to be written to the Lucene index.

Parameters:
xml - the document, as a character-based Reader
inputUri - the uri to assign to the document
Throws:
XMLStreamException

index

public void index(net.sf.saxon.om.NodeInfo doc,
                  String inputUri)
           throws XMLStreamException
Index the document read from the String, caching field values to be written to the Lucene index.

Parameters:
doc - the document (or element) as a Saxon NodeInfo
inputUri - the uri to assign to the document
Throws:
XMLStreamException

reset

public void reset()
Clear out internal storage cached by #index when indexing a document


getURI

public String getURI()
Returns:
the uri cached from the last invocation of #index

getXdmNode

public net.sf.saxon.s9api.XdmNode getXdmNode()
Returns:
the document cached from the last invocation of #index, as a Saxon XdmNode. This will be null if the indexer options don't require the generation of an XdmNode.

getDocumentText

public String getDocumentText()
Returns:
the document cached from the last invocation of #index, as a String. This will be null if the indexer options don't require the generation of a serialized document. The document is always re-serialized after parsing.

getDocumentBytes

public byte[] getDocumentBytes()
Returns:
the document bytes; this will be non-null if storeDocument(IndexWriter, String, InputStream) was called.

indexDocument

public void indexDocument(org.apache.lucene.index.IndexWriter indexWriter,
                          String docUri,
                          String xml)
                   throws XMLStreamException,
                          IOException
Index and write a document to the Lucene index.

Parameters:
indexWriter - the Lucene IndexWriter for the index to write to
docUri - the uri to assign to the document; any scheme will be stripped: only the path is stored in the index
xml - the text of an xml document to index
Throws:
XMLStreamException - if there is an error parsing the document
IOException - if there is an error writing to the index

indexDocument

public void indexDocument(org.apache.lucene.index.IndexWriter indexWriter,
                          String docUri,
                          InputStream xmlStream)
                   throws XMLStreamException,
                          IOException
Index and write a document to the Lucene index.

Parameters:
indexWriter - the Lucene IndexWriter for the index to write to
docUri - the uri to assign to the document; any scheme will be stripped: only the path is stored in the index
xmlStream - a stream from which the text of an xml document is to be read
Throws:
XMLStreamException - if there is an error parsing the document
IOException - if there is an error writing to the index

storeDocument

public void storeDocument(org.apache.lucene.index.IndexWriter indexWriter,
                          String docUri,
                          InputStream input)
                   throws IOException
Fully read the stream and store it as a document without attempting to parse or index it. Used for binary and other non-XML text.

Parameters:
indexWriter - the Lucene IndexWriter for the index to write to
docUri - the uri to assign to the document; any scheme will be stripped: only the path is stored in the index
input - the stream to read the document from
Throws:
IOException - if there is an error writing to the index

storeDocument

public void storeDocument(org.apache.lucene.index.IndexWriter indexWriter,
                          String docUri,
                          byte[] bytes)
                   throws IOException
Fully read the stream and store it as a document without attempting to parse or index it. Used for binary and other non-XML text.

Parameters:
indexWriter - the Lucene IndexWriter for the index to write to
docUri - the uri to assign to the document; any scheme will be stripped: only the path is stored in the index
bytes - the document bytes to store
Throws:
IOException - if there is an error writing to the index

indexDocument

public void indexDocument(org.apache.lucene.index.IndexWriter indexWriter,
                          String path,
                          net.sf.saxon.om.NodeInfo node)
                   throws XMLStreamException,
                          IOException
Index and write a document to the Lucene index.

Parameters:
indexWriter - the Lucene IndexWriter for the index to write to
path - the uri to assign to the document
node - an xml document to index, as a Saxon NodeInfo
Throws:
XMLStreamException - if there is an error parsing the document
IOException - if there is an error writing to the index

createLuceneDocument

public org.apache.lucene.document.Document createLuceneDocument()
Returns:
a Lucene Document created from the field values stored in this indexer. The document is ready to be inserted into Lucene via IndexWriter.addDocument(java.lang.Iterable).

getSaxonDocBuilder

public SaxonDocBuilder getSaxonDocBuilder()
Primarily for internal use.

Returns:
the SaxonDocBuilder used by the indexer to construct XdmNodes.

getPathMapper

public XmlPathMapper getPathMapper()
Primarily for internal use.

Returns:
the XmlPathMapper used by the indexer to gather node paths.

getConfiguration

public IndexConfiguration getConfiguration()
Returns:
the index configuration


Copyright © 2013. All Rights Reserved.