org.ow2.weblab.service.normaliser.tika
Class TikaExtractorService

java.lang.Object
  extended by org.ow2.weblab.service.normaliser.tika.TikaExtractorService
All Implemented Interfaces:
org.ow2.weblab.core.services.Analyser

public class TikaExtractorService
extends java.lang.Object
implements org.ow2.weblab.core.services.Analyser

Tika extractor is quite simple since it does not handle with structure of documents (sheets in Excel, paragraphs in Word, etc.) The structure might have been represented as various MediaUnits.

To do:
Rewrite the class comment which is not good... TODO

Field Summary
protected  org.ow2.weblab.content.api.ContentManager contentManager
          The ContentManager to use.
protected  org.apache.commons.logging.Log logger
          The logger to be used inside this class.
protected  boolean removeContent
          Whether or not to remove content.
protected  TikaConfiguration serviceConfig
          The configuration to be used for the service.
protected  java.text.DateFormat simpleDateFormat
          The formatter used to annotate dates (like 2011-12-31)
protected  org.apache.tika.config.TikaConfig tikaConfig
          The configuration Tika by it self.
 
Constructor Summary
TikaExtractorService(TikaConfiguration conf)
          The only constructor of this class that needs a configuration.
 
Method Summary
protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values, java.lang.String unit)
          Adding unit on each values of the list
protected  void annotate(org.ow2.weblab.core.model.Document document, java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
          Annotates document with the predicates and literals contained in toAnnot.
protected  org.ow2.weblab.core.model.Document checkArgs(org.ow2.weblab.core.services.analyser.ProcessArgs args)
          Get the document inside the process args or throw an InvalidParameterException if not possible.
 java.util.Map<java.lang.String,java.util.List<java.lang.String>> extractTextAndMetadata(org.ow2.weblab.core.model.Document document, java.io.File contentFile, boolean forceAutoDetectParser)
           
protected  java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(org.apache.tika.metadata.Metadata metadata)
          The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated.
 org.ow2.weblab.core.services.analyser.ProcessReturn process(org.ow2.weblab.core.services.analyser.ProcessArgs args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected final org.apache.commons.logging.Log logger
The logger to be used inside this class.


contentManager

protected final org.ow2.weblab.content.api.ContentManager contentManager
The ContentManager to use. Various implementation exists. They are defined through a configuration file.


serviceConfig

protected final TikaConfiguration serviceConfig
The configuration to be used for the service.


tikaConfig

protected final org.apache.tika.config.TikaConfig tikaConfig
The configuration Tika by it self.


removeContent

protected final boolean removeContent
Whether or not to remove content. Just a flag to prevent calculation on each process method call. True only and only if the reader of the content manager is not a file AND


simpleDateFormat

protected final java.text.DateFormat simpleDateFormat
The formatter used to annotate dates (like 2011-12-31)

Constructor Detail

TikaExtractorService

public TikaExtractorService(TikaConfiguration conf)
                     throws org.apache.tika.exception.TikaException,
                            java.io.IOException
The only constructor of this class that needs a configuration.

Parameters:
conf - The service configuration.
Throws:
java.io.IOException - If an error occurs accessing the tika configuration or instanciating the content manager.
org.apache.tika.exception.TikaException - If an error occurs reading the tika configuration.
Method Detail

process

public org.ow2.weblab.core.services.analyser.ProcessReturn process(org.ow2.weblab.core.services.analyser.ProcessArgs args)
                                                            throws org.ow2.weblab.core.services.InvalidParameterException,
                                                                   org.ow2.weblab.core.services.ContentNotAvailableException,
                                                                   org.ow2.weblab.core.services.UnexpectedException
Specified by:
process in interface org.ow2.weblab.core.services.Analyser
Throws:
org.ow2.weblab.core.services.InvalidParameterException
org.ow2.weblab.core.services.ContentNotAvailableException
org.ow2.weblab.core.services.UnexpectedException

checkArgs

protected org.ow2.weblab.core.model.Document checkArgs(org.ow2.weblab.core.services.analyser.ProcessArgs args)
                                                throws org.ow2.weblab.core.services.InvalidParameterException
Get the document inside the process args or throw an InvalidParameterException if not possible.

Parameters:
args - The ProcessArgs of the process method.
Returns:
The Document that must be contained by args.
Throws:
org.ow2.weblab.core.services.InvalidParameterException - If resource in args is null or not a Document.

extractTextAndMetadata

public java.util.Map<java.lang.String,java.util.List<java.lang.String>> extractTextAndMetadata(org.ow2.weblab.core.model.Document document,
                                                                                               java.io.File contentFile,
                                                                                               boolean forceAutoDetectParser)
                                                                                        throws org.ow2.weblab.core.services.UnexpectedException,
                                                                                               org.ow2.weblab.core.services.ContentNotAvailableException
Parameters:
document - The document to be fill with MediaUnit units
contentFile - The file to be parsed
forceAutoDetectParser - Whether to let Tika guess the parser to use from file content or use existing mimeType on the document (dc:format) to select the appropriated parser.
Throws:
org.ow2.weblab.core.services.UnexpectedException - If the Tika parser fails.
org.ow2.weblab.core.services.ContentNotAvailableException - If the file is not reachable. (This should not appear this its access has been checked before)

annotate

protected void annotate(org.ow2.weblab.core.model.Document document,
                        java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Annotates document with the predicates and literals contained in toAnnot.

Parameters:
document - The Document to be annotated.
toAnnot - The Map of predicate and their literal values.

fillMapWithMetadata

protected java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(org.apache.tika.metadata.Metadata metadata)
The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated. It can map some Tikas properties with DublinCore and DCTerms ones and for any metadata it also create a dirty predicate using the base URI.

Parameters:
metadata - The dirty map of metadata extrated by Tika.
Returns:
A map of RDF predicates and their literal values.

addUnitOnValues

protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values,
                                                                  java.lang.String unit)
Adding unit on each values of the list

Parameters:
values - the List of values
unit - the unit to add
Returns:
the List of values with unit


Copyright © 2004-2012. All Rights Reserved.