|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.ow2.weblab.service.normaliser.tika.TikaExtractorService
public class TikaExtractorService
Tika extractor is quite simple since it does not handle with structure of documents (sheets in Excel, paragraphs in Word, etc.) The structure might have been represented as various MediaUnits.
| Field Summary | |
|---|---|
protected org.ow2.weblab.content.api.ContentManager |
contentManager
The ContentManager to use. |
protected org.apache.commons.logging.Log |
logger
The logger to be used inside this class. |
protected MetadataWriter |
metadataWriter
MetadataWriter used |
protected boolean |
removeContent
Whether or not to remove content. |
protected TikaConfiguration |
serviceConfig
The configuration to be used for the service. |
protected java.text.DateFormat |
simpleDateFormat
The formatter used to annotate dates (like 2011-12-31) |
protected TikaConfig |
tikaConfig
The configuration Tika by it self. |
| Constructor Summary | |
|---|---|
TikaExtractorService(TikaConfiguration conf)
The only constructor of this class that needs a configuration. |
|
| Method Summary | |
|---|---|
protected Document |
checkArgs(ProcessArgs args)
Get the document inside the process args or throw an InvalidParameterException if not possible. |
protected Metadata |
extractTextAndMetadata(Document document,
java.io.File contentFile,
boolean forceAutoDetectParser)
|
ProcessReturn |
process(ProcessArgs args)
|
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
protected final org.apache.commons.logging.Log logger
protected final org.ow2.weblab.content.api.ContentManager contentManager
ContentManager to use. Various implementation exists.
They are defined through a configuration file.
protected final TikaConfiguration serviceConfig
protected final TikaConfig tikaConfig
protected final boolean removeContent
protected final java.text.DateFormat simpleDateFormat
protected MetadataWriter metadataWriter
| Constructor Detail |
|---|
public TikaExtractorService(TikaConfiguration conf)
throws TikaException,
java.io.IOException
conf - The service configuration.
java.io.IOException - If an error occurs accessing the tika configuration or
instanciating the content manager.
TikaException - If an error occurs reading the tika configuration.| Method Detail |
|---|
public ProcessReturn process(ProcessArgs args)
throws InvalidParameterException,
ContentNotAvailableException,
UnexpectedException
process in interface AnalyserInvalidParameterException
ContentNotAvailableException
UnexpectedException
protected Document checkArgs(ProcessArgs args)
throws InvalidParameterException
InvalidParameterException if not possible.
args - The ProcessArgs of the process method.
Document that must be contained by
args.
InvalidParameterException - If resource in args is
null or not a Document.
protected Metadata extractTextAndMetadata(Document document,
java.io.File contentFile,
boolean forceAutoDetectParser)
throws UnexpectedException,
ContentNotAvailableException
document - The document to be fill with MediaUnit unitscontentFile - The file to be parsedforceAutoDetectParser - Whether to let Tika guess the parser to use from file content
or use existing mimeType on the document (dc:format) to select
the appropriated parser.
UnexpectedException - If the Tika parser fails.
ContentNotAvailableException - If the file is not reachable. (This should not appear this
its access has been checked before)
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||