org.ow2.weblab.services.normaliser.tika
Class TikaExtractorService

java.lang.Object
  extended by org.ow2.weblab.services.normaliser.tika.TikaExtractorService
All Implemented Interfaces:
org.weblab_project.services.analyser.Analyser

public class TikaExtractorService
extends java.lang.Object
implements org.weblab_project.services.analyser.Analyser

Tika extractor is quite simple since it does not handle with structure of documents (sheets in Excel, paragraphs in Word, etc.) The structure might have been represented as various MediaUnits

To do:
Maybe some properties shall be extracted to a configuration file.

Constructor Summary
TikaExtractorService()
          The default and only constructor.
 
Method Summary
protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values, java.lang.String unit)
          Adding unit on each values of the list
protected static void annotate(org.weblab_project.core.model.ComposedUnit cu, java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
          Annotates cu with the predicates and literals contained in toAnnot.
protected static org.weblab_project.core.model.ComposedUnit checkArgs(org.weblab_project.services.analyser.types.ProcessArgs args)
           
protected static void cleanMap(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
          Modify the Map in parameter.
protected static java.lang.String convertToISO8601Date(java.lang.String inDateStr)
           
static void extractTextAndMetadata(org.weblab_project.core.model.ComposedUnit cu, java.io.File file, java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot, boolean forceAutoDetectParser)
           
protected static java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot, org.apache.tika.metadata.Metadata metadata)
          The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated.
protected static org.apache.tika.config.TikaConfig getTikaConfig()
           
 org.weblab_project.services.analyser.types.ProcessReturn process(org.weblab_project.services.analyser.types.ProcessArgs args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TikaExtractorService

public TikaExtractorService()
The default and only constructor. It load the content manager and initialises the list of date predicates.

Method Detail

process

public org.weblab_project.services.analyser.types.ProcessReturn process(org.weblab_project.services.analyser.types.ProcessArgs args)
                                                                 throws org.weblab_project.services.analyser.ProcessException
Specified by:
process in interface org.weblab_project.services.analyser.Analyser
Throws:
org.weblab_project.services.analyser.ProcessException

annotate

protected static void annotate(org.weblab_project.core.model.ComposedUnit cu,
                               java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Annotates cu with the predicates and literals contained in toAnnot.

Parameters:
cu - The Composed Unit to be annotated.
toAnnot - The Map of predicate and their literal values.

extractTextAndMetadata

public static void extractTextAndMetadata(org.weblab_project.core.model.ComposedUnit cu,
                                          java.io.File file,
                                          java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
                                          boolean forceAutoDetectParser)
                                   throws org.weblab_project.services.analyser.ProcessException
Throws:
org.weblab_project.services.analyser.ProcessException

cleanMap

protected static void cleanMap(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Modify the Map in parameter. Convert dates into W3C ISO8601 standard format and remove empty properties

Parameters:
toAnnot - The Map of predicates and values to be cleaned from empty String, List and convert dates into W3C ISO8601 standard format.

fillMapWithMetadata

protected static java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
                                                                                                      org.apache.tika.metadata.Metadata metadata)
The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated. It can map some Tikas properties with DublinCore and DCTerms ones and for any metadata it also create a dirty predicate using the base URI.

Parameters:
toAnnot - The empty map of predicate values.
metadata - The dirty map of metadata extrated by Tika.
Returns:
A map of RDF predicates and their literal values.

checkArgs

protected static org.weblab_project.core.model.ComposedUnit checkArgs(org.weblab_project.services.analyser.types.ProcessArgs args)
                                                               throws org.weblab_project.services.analyser.ProcessException
Parameters:
args - The ProcessArgs of the process method.
Returns:
The ComposedUnit that must be contained by args.
Throws:
org.weblab_project.services.analyser.ProcessException - If resource in args is not a ComposedUnit.

convertToISO8601Date

protected static java.lang.String convertToISO8601Date(java.lang.String inDateStr)
Parameters:
inDateStr - The input date that might be in two different formats. The Office one e.g.: Mon Jan 05 16:53:20 CET 2009 or already in ISO8601 format. Else the date will be logged as error, an replaced by the empty String.
Returns:
The date in ISO8601 format

addUnitOnValues

protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values,
                                                                  java.lang.String unit)
Adding unit on each values of the list

Parameters:
values - the List of values
unit - the unit to add
Returns:
the List of values with unit

getTikaConfig

protected static org.apache.tika.config.TikaConfig getTikaConfig()
                                                          throws org.weblab_project.services.analyser.ProcessException
Throws:
org.weblab_project.services.analyser.ProcessException


Copyright © 2004-2010. All Rights Reserved.