|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.ow2.weblab.service.normaliser.tika.TikaExtractorService
public class TikaExtractorService
Tika extractor is quite simple since it does not handle with structure of documents (sheets in Excel, paragraphs in Word, etc.) The structure might have been represented as various MediaUnits.
| Field Summary | |
|---|---|
protected org.ow2.weblab.content.api.ContentManager |
contentManager
The ContentManager to use. |
protected org.apache.commons.logging.Log |
logger
The logger to be used inside this class. |
protected boolean |
removeContent
Whether or not to remove content. |
protected TikaConfiguration |
serviceConfig
The configuration to be used for the service. |
protected java.text.DateFormat |
simpleDateFormat
The formatter used to annotate dates (like 2011-12-31) |
protected org.apache.tika.config.TikaConfig |
tikaConfig
The configuration Tika by it self. |
| Constructor Summary | |
|---|---|
TikaExtractorService(TikaConfiguration conf)
The only constructor of this class that needs a configuration. |
|
| Method Summary | |
|---|---|
protected static java.util.List<java.lang.String> |
addUnitOnValues(java.util.List<java.lang.String> values,
java.lang.String unit)
Adding unit on each values of the list |
protected void |
annotate(org.ow2.weblab.core.model.Document document,
java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
Annotates document with the predicates and literals contained in toAnnot. |
protected org.ow2.weblab.core.model.Document |
checkArgs(org.ow2.weblab.core.services.analyser.ProcessArgs args)
Get the document inside the process args or throw an InvalidParameterException if not possible. |
java.util.Map<java.lang.String,java.util.List<java.lang.String>> |
extractTextAndMetadata(org.ow2.weblab.core.model.Document document,
java.io.File contentFile,
boolean forceAutoDetectParser)
|
protected java.util.Map<java.lang.String,java.util.List<java.lang.String>> |
fillMapWithMetadata(org.apache.tika.metadata.Metadata metadata)
The method converts the metadata extracted by Tika into a Map of predicates with their values that can be annotated. |
org.ow2.weblab.core.services.analyser.ProcessReturn |
process(org.ow2.weblab.core.services.analyser.ProcessArgs args)
|
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
protected final org.apache.commons.logging.Log logger
protected final org.ow2.weblab.content.api.ContentManager contentManager
ContentManager to use. Various implementation exists. They are defined through a configuration file.
protected final TikaConfiguration serviceConfig
protected final org.apache.tika.config.TikaConfig tikaConfig
protected final boolean removeContent
protected final java.text.DateFormat simpleDateFormat
| Constructor Detail |
|---|
public TikaExtractorService(TikaConfiguration conf)
throws org.apache.tika.exception.TikaException,
java.io.IOException
conf - The service configuration.
java.io.IOException - If an error occurs accessing the tika configuration or instanciating the content manager.
org.apache.tika.exception.TikaException - If an error occurs reading the tika configuration.| Method Detail |
|---|
public org.ow2.weblab.core.services.analyser.ProcessReturn process(org.ow2.weblab.core.services.analyser.ProcessArgs args)
throws org.ow2.weblab.core.services.InvalidParameterException,
org.ow2.weblab.core.services.ContentNotAvailableException,
org.ow2.weblab.core.services.UnexpectedException
process in interface org.ow2.weblab.core.services.Analyserorg.ow2.weblab.core.services.InvalidParameterException
org.ow2.weblab.core.services.ContentNotAvailableException
org.ow2.weblab.core.services.UnexpectedException
protected org.ow2.weblab.core.model.Document checkArgs(org.ow2.weblab.core.services.analyser.ProcessArgs args)
throws org.ow2.weblab.core.services.InvalidParameterException
InvalidParameterException if not possible.
args - The ProcessArgs of the process method.
Document that must be contained by args.
org.ow2.weblab.core.services.InvalidParameterException - If resource in args is null or not a Document.
public java.util.Map<java.lang.String,java.util.List<java.lang.String>> extractTextAndMetadata(org.ow2.weblab.core.model.Document document,
java.io.File contentFile,
boolean forceAutoDetectParser)
throws org.ow2.weblab.core.services.UnexpectedException,
org.ow2.weblab.core.services.ContentNotAvailableException
document - The document to be fill with MediaUnit unitscontentFile - The file to be parsedforceAutoDetectParser - Whether to let Tika guess the parser to use from file content or use existing mimeType on the document (dc:format) to select the appropriated
parser.
org.ow2.weblab.core.services.UnexpectedException - If the Tika parser fails.
org.ow2.weblab.core.services.ContentNotAvailableException - If the file is not reachable. (This should not appear this its access has been checked before)
protected void annotate(org.ow2.weblab.core.model.Document document,
java.util.Map<java.lang.String,java.util.List<java.lang.String>> toAnnot)
document with the predicates and literals contained in toAnnot.
document - The Document to be annotated.toAnnot - The Map of predicate and their literal values.protected java.util.Map<java.lang.String,java.util.List<java.lang.String>> fillMapWithMetadata(org.apache.tika.metadata.Metadata metadata)
metadata - The dirty map of metadata extrated by Tika.
protected static java.util.List<java.lang.String> addUnitOnValues(java.util.List<java.lang.String> values,
java.lang.String unit)
values - the List of valuesunit - the unit to add
List of values with unit
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||