org.ow2.weblab.services.normaliser.tika
Class TikaExtractorService

java.lang.Object
  extended by org.ow2.weblab.services.normaliser.tika.TikaExtractorService
All Implemented Interfaces:
org.weblab_project.services.analyser.Analyser

public class TikaExtractorService
extends java.lang.Object
implements org.weblab_project.services.analyser.Analyser

Tika extractor is quite simple since it does not handle with structure of documents (sheets in Excel, paragraphs in Word, etc.) The structure might have been represented as various MediaUnits

To do:
Maybe some properties shall be extracted to a configuration file.

Field Summary
static java.lang.String BASE_URI_PROPERTY_NAME
           
static java.lang.String CONFIG_FILE
          Properties file
static java.lang.String OVERRIDE_METADATA_PROPERTY_NAME
           
static java.lang.String REMOVE_COTNENT_PROPERTY_NAME
           
 
Constructor Summary
TikaExtractorService()
          The default and only constructor.
 
Method Summary
protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values, java.lang.String unit)
          Adding unit on each values of the list
protected static void annotate(org.weblab_project.core.model.ComposedUnit cu, java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
          Annotates cu with the predicates and literals contained in toAnnot.
protected static org.weblab_project.core.model.ComposedUnit checkArgs(org.weblab_project.services.analyser.types.ProcessArgs args)
           
protected static void cleanMap(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
          Modify the Map in parameter.
protected static java.lang.String convertToISO8601Date(java.lang.String inDateStr)
           
static void extractTextAndMetadata(org.weblab_project.core.model.ComposedUnit cu, java.io.File file, java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot, boolean forceAutoDetectParser)
           
protected static java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot, org.apache.tika.metadata.Metadata metadata)
          The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated.
protected static org.apache.tika.config.TikaConfig getTikaConfig()
           
protected  void loadTikaServiceProps()
           
 org.weblab_project.services.analyser.types.ProcessReturn process(org.weblab_project.services.analyser.types.ProcessArgs args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CONFIG_FILE

public static final java.lang.String CONFIG_FILE
Properties file

See Also:
Constant Field Values

BASE_URI_PROPERTY_NAME

public static final java.lang.String BASE_URI_PROPERTY_NAME
See Also:
Constant Field Values

REMOVE_COTNENT_PROPERTY_NAME

public static final java.lang.String REMOVE_COTNENT_PROPERTY_NAME
See Also:
Constant Field Values

OVERRIDE_METADATA_PROPERTY_NAME

public static final java.lang.String OVERRIDE_METADATA_PROPERTY_NAME
See Also:
Constant Field Values
Constructor Detail

TikaExtractorService

public TikaExtractorService()
The default and only constructor. It load the content manager and initialises the list of date predicates.

Method Detail

process

public org.weblab_project.services.analyser.types.ProcessReturn process(org.weblab_project.services.analyser.types.ProcessArgs args)
                                                                 throws org.weblab_project.services.analyser.ProcessException
Specified by:
process in interface org.weblab_project.services.analyser.Analyser
Throws:
org.weblab_project.services.analyser.ProcessException

annotate

protected static void annotate(org.weblab_project.core.model.ComposedUnit cu,
                               java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Annotates cu with the predicates and literals contained in toAnnot.

Parameters:
cu - The Composed Unit to be annotated.
toAnnot - The Map of predicate and their literal values.

extractTextAndMetadata

public static void extractTextAndMetadata(org.weblab_project.core.model.ComposedUnit cu,
                                          java.io.File file,
                                          java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
                                          boolean forceAutoDetectParser)
                                   throws org.weblab_project.services.analyser.ProcessException
Throws:
org.weblab_project.services.analyser.ProcessException

cleanMap

protected static void cleanMap(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Modify the Map in parameter. Convert dates into W3C ISO8601 standard format and remove empty properties

Parameters:
toAnnot - The Map of predicates and values to be cleaned from empty String, List and convert dates into W3C ISO8601 standard format.

fillMapWithMetadata

protected static java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
                                                                                                      org.apache.tika.metadata.Metadata metadata)
The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated. It can map some Tikas properties with DublinCore and DCTerms ones and for any metadata it also create a dirty predicate using the base URI.

Parameters:
toAnnot - The empty map of predicate values.
metadata - The dirty map of metadata extrated by Tika.
Returns:
A map of RDF predicates and their literal values.

checkArgs

protected static org.weblab_project.core.model.ComposedUnit checkArgs(org.weblab_project.services.analyser.types.ProcessArgs args)
                                                               throws org.weblab_project.services.analyser.ProcessException
Parameters:
args - The ProcessArgs of the process method.
Returns:
The ComposedUnit that must be contained by args.
Throws:
org.weblab_project.services.analyser.ProcessException - If resource in args is not a ComposedUnit.

convertToISO8601Date

protected static java.lang.String convertToISO8601Date(java.lang.String inDateStr)
Parameters:
inDateStr - The input date that might be in two different formats. The Office one e.g.: Mon Jan 05 16:53:20 CET 2009 or already in ISO8601 format. Else the date will be logged as error, an replaced by the empty String.
Returns:
The date in ISO8601 format

addUnitOnValues

protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values,
                                                                  java.lang.String unit)
Adding unit on each values of the list

Parameters:
values - the List of values
unit - the unit to add
Returns:
the List of values with unit

getTikaConfig

protected static org.apache.tika.config.TikaConfig getTikaConfig()
                                                          throws org.weblab_project.services.analyser.ProcessException
Throws:
org.weblab_project.services.analyser.ProcessException

loadTikaServiceProps

protected void loadTikaServiceProps()


Copyright © 2004-2010. All Rights Reserved.