public class OCRService extends Object
The text information is stored in the $file attribute 'text'.
For PDF files with textual content the PDFBox api is used. In other cases, the method sends the content via a Rest API to the tika server for OCR processing.
For OCR processing the service expects a valid Rest API end-point defined by the Environment Parameter 'TIKA_SERVICE_ENDPONT'. If the TIKA_SERVICE_ENDPONT is not set, then the service will be skipped.
The environment parameter 'TIKA_SERVICE_MODE' must be set to 'auto' to enable the service.
See also the project: https://github.com/imixs/imixs-docker/tree/master/tika
| Modifier and Type | Field and Description |
|---|---|
static String |
DEFAULT_ENCODING |
static String |
ENV_TIKA_OCR_MODE |
static String |
ENV_TIKA_SERVICE_ENDPOINT |
static String |
ENV_TIKA_SERVICE_MODE |
static String |
FILE_ATTRIBUTE_TEXT |
static String |
PLUGIN_ERROR |
| Constructor and Description |
|---|
OCRService() |
| Modifier and Type | Method and Description |
|---|---|
String |
doORCProcessing(org.imixs.workflow.FileData fileData,
List<String> options)
This method sends the content of a document to the Tika Rest API for OCR
processing.
|
String |
doPDFTextExtraction(org.imixs.workflow.FileData fileData)
This method extracts the text from the given content of an PDF file.
|
void |
extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot)
Extracts the textual information from document attachments.
|
void |
extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot,
String _ocrmode,
List<String> options)
Extracts the textual information from document attachments.
|
public static final String FILE_ATTRIBUTE_TEXT
public static final String DEFAULT_ENCODING
public static final String PLUGIN_ERROR
public static final String ENV_TIKA_SERVICE_ENDPOINT
public static final String ENV_TIKA_SERVICE_MODE
public static final String ENV_TIKA_OCR_MODE
public void extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot)
throws org.imixs.workflow.exceptions.PluginException
The method extracts the textual content for each new document of a given workitem. For PDF files with textual content the method calls the method 'extractTextFromPDF' using the PDFBox api. In other cases, the method sends the content via a Rest API to the tika server for OCR processing.
The result is stored into the fileData attribute 'text'
workitem - org.imixs.workflow.exceptions.PluginExceptionpublic void extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot,
String _ocrmode,
List<String> options)
throws org.imixs.workflow.exceptions.PluginException
The method extracts the textual content for each new file attachment of a given workitem. The text information is stored in the $file attribute 'text'.
For PDF files with textual content the method calls the method 'extractTextFromPDF' using the PDFBox api. In other cases, the method sends the content via a Rest API to the tika server for OCR processing.
The method also extracts files already stored in a snapshot workitem. In this case the method tests if the $file attribute 'text' already exists.
workitem - - workitem with file attachments_ocrmode - - PDF_ONLY, OCR_ONLY, MIXEDoptions - - optional tika header paramsorg.imixs.workflow.exceptions.PluginExceptionpublic String doORCProcessing(org.imixs.workflow.FileData fileData, List<String> options) throws IOException
In case the contentType is PDF then the following tika specific header is added:
X-Tika-PDFOcrStrategy: ocr_only
fileData - - file content and metadataIOExceptionpublic String doPDFTextExtraction(org.imixs.workflow.FileData fileData)
Extracting text is one of the main features of the PDF box library. You can extract text using the getText() method of the PDFTextStripper class. This class extracts all the text from the given PDF document.
content - Copyright © 2016–2020 Imixs Software Solutions GmbH. All rights reserved.