org.ow2.weblab.services.normaliser.tika
Class TikaExtractorService

java.lang.Object
  extended by org.ow2.weblab.services.normaliser.tika.TikaExtractorService
All Implemented Interfaces:
org.ow2.weblab.core.services.Analyser

public class TikaExtractorService
extends java.lang.Object
implements org.ow2.weblab.core.services.Analyser

Tika extractor is quite simple since it does not handle with structure of documents (sheets in Excel, paragraphs in Word, etc.) The structure might have been represented as various MediaUnits

To do:
Maybe some properties shall be extracted to a configuration file.

Field Summary
static java.lang.String BASE_URI_PROPERTY_NAME
           
static java.lang.String CONFIG_FILE
          Properties file
protected  org.ow2.weblab.content.ContentManager contentManager
          The BinaryFolderContentManager to use
static java.lang.String OVERRIDE_METADATA_PROPERTY_NAME
           
static java.lang.String REMOVE_COTNENT_PROPERTY_NAME
           
static java.lang.String XHTML_FOLDER_PROPERTY_NAME
           
static java.lang.String XHTML_SAVE
           
 
Constructor Summary
TikaExtractorService()
          The default and only constructor.
 
Method Summary
protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values, java.lang.String unit)
          Adding unit on each values of the list
protected static void annotate(org.ow2.weblab.core.model.Document document, java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
          Annotates cu with the predicates and literals contained in toAnnot.
protected static org.ow2.weblab.core.model.Document checkArgs(org.ow2.weblab.core.services.analyser.ProcessArgs args)
           
protected static void cleanMap(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
          Modify the Map in parameter.
protected static java.lang.String convertToISO8601Date(java.lang.String inDateStr)
           
static void extractTextAndMetadata(org.ow2.weblab.core.model.Document document, java.io.File file, java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot, boolean forceAutoDetectParser)
           
protected static java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot, org.apache.tika.metadata.Metadata metadata)
          The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated.
protected static org.apache.tika.config.TikaConfig getTikaConfig()
           
protected  void loadTikaServiceProps()
           
 org.ow2.weblab.core.services.analyser.ProcessReturn process(org.ow2.weblab.core.services.analyser.ProcessArgs args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CONFIG_FILE

public static final java.lang.String CONFIG_FILE
Properties file

See Also:
Constant Field Values

BASE_URI_PROPERTY_NAME

public static final java.lang.String BASE_URI_PROPERTY_NAME
See Also:
Constant Field Values

REMOVE_COTNENT_PROPERTY_NAME

public static final java.lang.String REMOVE_COTNENT_PROPERTY_NAME
See Also:
Constant Field Values

OVERRIDE_METADATA_PROPERTY_NAME

public static final java.lang.String OVERRIDE_METADATA_PROPERTY_NAME
See Also:
Constant Field Values

XHTML_FOLDER_PROPERTY_NAME

public static final java.lang.String XHTML_FOLDER_PROPERTY_NAME
See Also:
Constant Field Values

XHTML_SAVE

public static final java.lang.String XHTML_SAVE
See Also:
Constant Field Values

contentManager

protected org.ow2.weblab.content.ContentManager contentManager
The BinaryFolderContentManager to use

Constructor Detail

TikaExtractorService

public TikaExtractorService()
The default and only constructor. It load the content manager and initializes the list of date predicates.

Method Detail

process

public org.ow2.weblab.core.services.analyser.ProcessReturn process(org.ow2.weblab.core.services.analyser.ProcessArgs args)
                                                            throws org.ow2.weblab.core.services.AccessDeniedException,
                                                                   org.ow2.weblab.core.services.ContentNotAvailableException,
                                                                   org.ow2.weblab.core.services.InsufficientResourcesException,
                                                                   org.ow2.weblab.core.services.InvalidParameterException,
                                                                   org.ow2.weblab.core.services.ServiceNotConfiguredException,
                                                                   org.ow2.weblab.core.services.UnexpectedException,
                                                                   org.ow2.weblab.core.services.UnsupportedRequestException
Specified by:
process in interface org.ow2.weblab.core.services.Analyser
Throws:
org.ow2.weblab.core.services.AccessDeniedException
org.ow2.weblab.core.services.ContentNotAvailableException
org.ow2.weblab.core.services.InsufficientResourcesException
org.ow2.weblab.core.services.InvalidParameterException
org.ow2.weblab.core.services.ServiceNotConfiguredException
org.ow2.weblab.core.services.UnexpectedException
org.ow2.weblab.core.services.UnsupportedRequestException

annotate

protected static void annotate(org.ow2.weblab.core.model.Document document,
                               java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Annotates cu with the predicates and literals contained in toAnnot.

Parameters:
document - The Document to be annotated.
toAnnot - The Map of predicate and their literal values.

extractTextAndMetadata

public static void extractTextAndMetadata(org.ow2.weblab.core.model.Document document,
                                          java.io.File file,
                                          java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
                                          boolean forceAutoDetectParser)
                                   throws org.ow2.weblab.core.services.UnexpectedException,
                                          org.ow2.weblab.core.services.ContentNotAvailableException
Throws:
org.ow2.weblab.core.services.UnexpectedException
org.ow2.weblab.core.services.ContentNotAvailableException

cleanMap

protected static void cleanMap(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Modify the Map in parameter. Convert dates into W3C ISO8601 standard format and remove empty properties

Parameters:
toAnnot - The Map of predicates and values to be cleaned from empty String, List and convert dates into W3C ISO8601 standard format.

fillMapWithMetadata

protected static java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
                                                                                                      org.apache.tika.metadata.Metadata metadata)
The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated. It can map some Tikas properties with DublinCore and DCTerms ones and for any metadata it also create a dirty predicate using the base URI.

Parameters:
toAnnot - The empty map of predicate values.
metadata - The dirty map of metadata extrated by Tika.
Returns:
A map of RDF predicates and their literal values.

checkArgs

protected static org.ow2.weblab.core.model.Document checkArgs(org.ow2.weblab.core.services.analyser.ProcessArgs args)
                                                       throws org.ow2.weblab.core.services.InvalidParameterException
Parameters:
args - The ProcessArgs of the process method.
Returns:
The ComposedUnit that must be contained by args.
Throws:
ProcessException - If resource in args is not a ComposedUnit.
org.ow2.weblab.core.services.InvalidParameterException

convertToISO8601Date

protected static java.lang.String convertToISO8601Date(java.lang.String inDateStr)
Parameters:
inDateStr - The input date that might be in three different formats. The Office one e.g.: Mon Jan 05 16:53:20 CET 2009 or already in ISO8601 format. Else the date will be logged as error, an replaced by the empty String.
Returns:
The date in ISO8601 format

addUnitOnValues

protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values,
                                                                  java.lang.String unit)
Adding unit on each values of the list

Parameters:
values - the List of values
unit - the unit to add
Returns:
the List of values with unit

getTikaConfig

protected static org.apache.tika.config.TikaConfig getTikaConfig()
                                                          throws org.ow2.weblab.core.services.AccessDeniedException
Throws:
org.ow2.weblab.core.services.AccessDeniedException

loadTikaServiceProps

protected void loadTikaServiceProps()


Copyright © 2004-2011. All Rights Reserved.