public class TikaHelperService extends Object
| Modifier and Type | Field and Description |
|---|---|
static String |
DEFAULT_ENCODING |
static String |
ENV_OCR_SERVICE_ENDPOINT |
static String |
ENV_OCR_SERVICE_MODE |
static String |
ENV_OCR_STRATEGY |
static String |
OCR_STRATEGY_AUTO |
static String |
OCR_STRATEGY_NO_OCR |
static String |
OCR_STRATEGY_OCR_AND_TEXT_EXTRACTION |
static String |
OCR_STRATEGY_OCR_ONLY |
static String |
PLUGIN_ERROR |
| Constructor and Description |
|---|
TikaHelperService() |
| Modifier and Type | Method and Description |
|---|---|
String |
doORCProcessing(org.imixs.workflow.FileData fileData,
List<String> options)
This method sends the content of a document to the Tika Rest API for OCR
processing.
|
String |
extractText(org.imixs.workflow.ItemCollection snapshot,
Pattern mlFilenamePattern,
String _ocrStategy,
List<String> options)
Extracts the textual information from document attachments.
|
public static final String DEFAULT_ENCODING
public static final String PLUGIN_ERROR
public static final String ENV_OCR_SERVICE_ENDPOINT
public static final String ENV_OCR_SERVICE_MODE
public static final String ENV_OCR_STRATEGY
public static final String OCR_STRATEGY_NO_OCR
public static final String OCR_STRATEGY_OCR_AND_TEXT_EXTRACTION
public static final String OCR_STRATEGY_OCR_ONLY
public static final String OCR_STRATEGY_AUTO
public String extractText(org.imixs.workflow.ItemCollection snapshot, Pattern mlFilenamePattern, String _ocrStategy, List<String> options) throws org.imixs.workflow.exceptions.PluginException
The method extracts the textual content for each new file attachment of a given workitem. The text information is stored in the $file attribute 'text'.
For PDF files with textual content the method calls the method 'extractTextFromPDF' using the PDFBox api. In other cases, the method sends the content via a Rest API to the tika server for OCR processing.
The method also extracts files already stored in a snapshot workitem. In this case the method tests if the $file attribute 'text' already exists.
workitem - - workitem with file attachmentspdf_mode - - TEXT_ONLY, OCR_ONLY, TEXT_AND_OCRoptions - - optional tika header paramsorg.imixs.workflow.exceptions.PluginExceptionpublic String doORCProcessing(org.imixs.workflow.FileData fileData, List<String> options) throws IOException
In case the contentType is PDF then the following tika specific header is added:
X-Tika-PDFOcrStrategy: ocr_only
fileData - - file content and metadataIOExceptionCopyright © 2020–2021 Imixs Software Solutions GmbH. All rights reserved.