|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.ow2.weblab.services.normaliser.tika.TikaExtractorService
public class TikaExtractorService
Tika extractor is quite simple since it does not handle with structure of documents (sheets in Excel, paragraphs in Word, etc.) The structure might have been represented as various MediaUnits
| Constructor Summary | |
|---|---|
TikaExtractorService()
The default and only constructor. |
|
| Method Summary | |
|---|---|
protected static java.util.List<java.lang.String> |
addUnitOnValues(java.util.List<java.lang.String> values,
java.lang.String unit)
Adding unit on each values of the list |
protected static void |
annotate(org.weblab_project.core.model.ComposedUnit cu,
java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Annotates cu with the predicates and literals contained in
toAnnot. |
protected static org.weblab_project.core.model.ComposedUnit |
checkArgs(org.weblab_project.services.analyser.types.ProcessArgs args)
|
protected static void |
cleanMap(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Modify the Map in parameter. |
protected static java.lang.String |
convertToISO8601Date(java.lang.String inDateStr)
|
static void |
extractTextAndMetadata(org.weblab_project.core.model.ComposedUnit cu,
java.io.File file,
java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
boolean forceAutoDetectParser)
|
protected static java.util.Map<java.lang.String,java.util.List<java.lang.String>> |
fillMapWithMetadata(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
org.apache.tika.metadata.Metadata metadata)
The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated. |
protected static org.apache.tika.config.TikaConfig |
getTikaConfig()
|
org.weblab_project.services.analyser.types.ProcessReturn |
process(org.weblab_project.services.analyser.types.ProcessArgs args)
|
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public TikaExtractorService()
| Method Detail |
|---|
public org.weblab_project.services.analyser.types.ProcessReturn process(org.weblab_project.services.analyser.types.ProcessArgs args)
throws org.weblab_project.services.analyser.ProcessException
process in interface org.weblab_project.services.analyser.Analyserorg.weblab_project.services.analyser.ProcessException
protected static void annotate(org.weblab_project.core.model.ComposedUnit cu,
java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
cu with the predicates and literals contained in
toAnnot.
cu - The Composed Unit to be annotated.toAnnot - The Map of predicate and their literal values.
public static void extractTextAndMetadata(org.weblab_project.core.model.ComposedUnit cu,
java.io.File file,
java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
boolean forceAutoDetectParser)
throws org.weblab_project.services.analyser.ProcessException
org.weblab_project.services.analyser.ProcessExceptionprotected static void cleanMap(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Map in parameter. Convert dates into W3C ISO8601
standard format and remove empty properties
toAnnot - The Map of predicates and values to be cleaned
from empty String, List and convert dates into W3C ISO8601
standard format.
protected static java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
org.apache.tika.metadata.Metadata metadata)
toAnnot - The empty map of predicate values.metadata - The dirty map of metadata extrated by Tika.
protected static org.weblab_project.core.model.ComposedUnit checkArgs(org.weblab_project.services.analyser.types.ProcessArgs args)
throws org.weblab_project.services.analyser.ProcessException
args - The ProcessArgs of the process method.
ComposedUnit that must be contained by
args.
org.weblab_project.services.analyser.ProcessException - If resource in args is not a
ComposedUnit.protected static java.lang.String convertToISO8601Date(java.lang.String inDateStr)
inDateStr - The input date that might be in two different formats. The
Office one e.g.: Mon Jan 05 16:53:20 CET 2009 or
already in ISO8601 format. Else the date will be logged as
error, an replaced by the empty String.
protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values,
java.lang.String unit)
values - the List of valuesunit - the unit to add
List of values with unit
protected static org.apache.tika.config.TikaConfig getTikaConfig()
throws org.weblab_project.services.analyser.ProcessException
org.weblab_project.services.analyser.ProcessException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||