|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.ow2.weblab.services.normaliser.tika.TikaExtractorService
public class TikaExtractorService
Tika extractor is quite simple since it does not handle with structure of documents (sheets in Excel, paragraphs in Word, etc.) The structure might have been represented as various MediaUnits
| Field Summary | |
|---|---|
static java.lang.String |
BASE_URI_PROPERTY_NAME
|
static java.lang.String |
CONFIG_FILE
Properties file |
protected org.ow2.weblab.content.ContentManager |
contentManager
The BinaryFolderContentManager to use |
static java.lang.String |
OVERRIDE_METADATA_PROPERTY_NAME
|
static java.lang.String |
REMOVE_COTNENT_PROPERTY_NAME
|
static java.lang.String |
XHTML_FOLDER_PROPERTY_NAME
|
static java.lang.String |
XHTML_SAVE
|
| Constructor Summary | |
|---|---|
TikaExtractorService()
The default and only constructor. |
|
| Method Summary | |
|---|---|
protected static java.util.List<java.lang.String> |
addUnitOnValues(java.util.List<java.lang.String> values,
java.lang.String unit)
Adding unit on each values of the list |
protected static void |
annotate(org.ow2.weblab.core.model.Document document,
java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Annotates cu with the predicates and literals contained in
toAnnot. |
protected static org.ow2.weblab.core.model.Document |
checkArgs(org.ow2.weblab.core.services.analyser.ProcessArgs args)
|
protected static void |
cleanMap(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Modify the Map in parameter. |
protected static java.lang.String |
convertToISO8601Date(java.lang.String inDateStr)
|
static void |
extractTextAndMetadata(org.ow2.weblab.core.model.Document document,
java.io.File file,
java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
boolean forceAutoDetectParser)
|
protected static java.util.Map<java.lang.String,java.util.List<java.lang.String>> |
fillMapWithMetadata(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
org.apache.tika.metadata.Metadata metadata)
The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated. |
protected static org.apache.tika.config.TikaConfig |
getTikaConfig()
|
protected void |
loadTikaServiceProps()
|
org.ow2.weblab.core.services.analyser.ProcessReturn |
process(org.ow2.weblab.core.services.analyser.ProcessArgs args)
|
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final java.lang.String CONFIG_FILE
public static final java.lang.String BASE_URI_PROPERTY_NAME
public static final java.lang.String REMOVE_COTNENT_PROPERTY_NAME
public static final java.lang.String OVERRIDE_METADATA_PROPERTY_NAME
public static final java.lang.String XHTML_FOLDER_PROPERTY_NAME
public static final java.lang.String XHTML_SAVE
protected org.ow2.weblab.content.ContentManager contentManager
BinaryFolderContentManager to use
| Constructor Detail |
|---|
public TikaExtractorService()
| Method Detail |
|---|
public org.ow2.weblab.core.services.analyser.ProcessReturn process(org.ow2.weblab.core.services.analyser.ProcessArgs args)
throws org.ow2.weblab.core.services.AccessDeniedException,
org.ow2.weblab.core.services.ContentNotAvailableException,
org.ow2.weblab.core.services.InsufficientResourcesException,
org.ow2.weblab.core.services.InvalidParameterException,
org.ow2.weblab.core.services.ServiceNotConfiguredException,
org.ow2.weblab.core.services.UnexpectedException,
org.ow2.weblab.core.services.UnsupportedRequestException
process in interface org.ow2.weblab.core.services.Analyserorg.ow2.weblab.core.services.AccessDeniedException
org.ow2.weblab.core.services.ContentNotAvailableException
org.ow2.weblab.core.services.InsufficientResourcesException
org.ow2.weblab.core.services.InvalidParameterException
org.ow2.weblab.core.services.ServiceNotConfiguredException
org.ow2.weblab.core.services.UnexpectedException
org.ow2.weblab.core.services.UnsupportedRequestException
protected static void annotate(org.ow2.weblab.core.model.Document document,
java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
cu with the predicates and literals contained in
toAnnot.
document - The Document to be annotated.toAnnot - The Map of predicate and their literal values.
public static void extractTextAndMetadata(org.ow2.weblab.core.model.Document document,
java.io.File file,
java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
boolean forceAutoDetectParser)
throws org.ow2.weblab.core.services.UnexpectedException,
org.ow2.weblab.core.services.ContentNotAvailableException
org.ow2.weblab.core.services.UnexpectedException
org.ow2.weblab.core.services.ContentNotAvailableExceptionprotected static void cleanMap(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Map in parameter. Convert dates into W3C ISO8601
standard format and remove empty properties
toAnnot - The Map of predicates and values to be cleaned
from empty String, List and convert dates into W3C ISO8601
standard format.
protected static java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot,
org.apache.tika.metadata.Metadata metadata)
toAnnot - The empty map of predicate values.metadata - The dirty map of metadata extrated by Tika.
protected static org.ow2.weblab.core.model.Document checkArgs(org.ow2.weblab.core.services.analyser.ProcessArgs args)
throws org.ow2.weblab.core.services.InvalidParameterException
args - The ProcessArgs of the process method.
ComposedUnit that must be contained by
args.
ProcessException - If resource in args is not a
ComposedUnit.
org.ow2.weblab.core.services.InvalidParameterExceptionprotected static java.lang.String convertToISO8601Date(java.lang.String inDateStr)
inDateStr - The input date that might be in three different formats. The
Office one e.g.: Mon Jan 05 16:53:20 CET 2009 or
already in ISO8601 format. Else the date will be logged as
error, an replaced by the empty String.
protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values,
java.lang.String unit)
values - the List of valuesunit - the unit to add
List of values with unit
protected static org.apache.tika.config.TikaConfig getTikaConfig()
throws org.ow2.weblab.core.services.AccessDeniedException
org.ow2.weblab.core.services.AccessDeniedExceptionprotected void loadTikaServiceProps()
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||