public class OCRService extends Object
The text information is stored in the $file attribute 'text'.
For PDF files with textual content the PDFBox api is used. In other cases, the method sends the content via a Rest API to the tika server for OCR processing. The environment variable OCR_PDF_MODE defines how PDF files will be scanned. Possible values are TEXT_ONLY | OCR_ONLY | TEXT_AND_OCR (default)
For OCR processing the service expects a valid Rest API end-point defined by the Environment Parameter 'TIKA_SERVICE_ENDPONT'. If the TIKA_SERVICE_ENDPONT is not set, then the service will be skipped.
The environment parameter 'TIKA_SERVICE_MODE' must be set to 'auto' to enable the service.
See also the project: https://github.com/imixs/imixs-docker/tree/master/tika
| Modifier and Type | Field and Description |
|---|---|
static String |
DEFAULT_ENCODING |
static String |
ENV_OCR_SERVICE_ENDPOINT |
static String |
ENV_OCR_SERVICE_MODE |
static String |
ENV_OCR_STRATEGY |
static String |
FILE_ATTRIBUTE_TEXT |
static String |
OCR_STRATEGY_AUTO |
static String |
OCR_STRATEGY_NO_OCR |
static String |
OCR_STRATEGY_OCR_AND_TEXT_EXTRACTION |
static String |
OCR_STRATEGY_OCR_ONLY |
static String |
PLUGIN_ERROR |
| Constructor and Description |
|---|
OCRService() |
| Modifier and Type | Method and Description |
|---|---|
String |
doORCProcessing(org.imixs.workflow.FileData fileData,
List<String> options)
This method sends the content of a document to the Tika Rest API for OCR
processing.
|
void |
extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot)
Extracts the textual information from document attachments.
|
void |
extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot,
String _ocrStategy,
List<String> options)
Extracts the textual information from document attachments.
|
public static final String FILE_ATTRIBUTE_TEXT
public static final String DEFAULT_ENCODING
public static final String PLUGIN_ERROR
public static final String ENV_OCR_SERVICE_ENDPOINT
public static final String ENV_OCR_SERVICE_MODE
public static final String ENV_OCR_STRATEGY
public static final String OCR_STRATEGY_NO_OCR
public static final String OCR_STRATEGY_OCR_AND_TEXT_EXTRACTION
public static final String OCR_STRATEGY_OCR_ONLY
public static final String OCR_STRATEGY_AUTO
public void extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot)
throws org.imixs.workflow.exceptions.PluginException
The method extracts the textual content for each new document of a given workitem. For PDF files with textual content the method calls the method 'extractTextFromPDF' using the PDFBox api. In other cases, the method sends the content via a Rest API to the tika server for OCR processing.
The result is stored into the fileData attribute 'text'
workitem - org.imixs.workflow.exceptions.PluginExceptionpublic void extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot,
String _ocrStategy,
List<String> options)
throws org.imixs.workflow.exceptions.PluginException
The method extracts the textual content for each new file attachment of a given workitem. The text information is stored in the $file attribute 'text'.
For PDF files with textual content the method calls the method 'extractTextFromPDF' using the PDFBox api. In other cases, the method sends the content via a Rest API to the tika server for OCR processing.
The method also extracts files already stored in a snapshot workitem. In this case the method tests if the $file attribute 'text' already exists.
workitem - - workitem with file attachmentspdf_mode - - TEXT_ONLY, OCR_ONLY, TEXT_AND_OCRoptions - - optional tika header paramsorg.imixs.workflow.exceptions.PluginExceptionpublic String doORCProcessing(org.imixs.workflow.FileData fileData, List<String> options) throws IOException
In case the contentType is PDF then the following tika specific header is added:
X-Tika-PDFOcrStrategy: ocr_only
fileData - - file content and metadataIOExceptionCopyright © 2016–2020 Imixs Software Solutions GmbH. All rights reserved.