public class TikaService extends Object
For the text extraction the services sends the content of a document to an instance of a Apache Tika server via the Rest API. The environment variable OCR_STRATEGY defines how PDF files will be scanned. Possible values are:
The service expects a valid Rest API end-point to an instance of a Tika Server defined by the Environment Parameter 'TIKA_SERVICE_ENDPONT'.
The environment parameter 'TIKA_SERVICE_MODE' must be set to 'auto' to enable the service.
See also the project: https://github.com/imixs/imixs-docker/tree/master/tika
| Modifier and Type | Field and Description |
|---|---|
static String |
DEFAULT_ENCODING |
static String |
ENV_OCR_SERVICE_ENDPOINT |
static String |
ENV_OCR_SERVICE_MAXFILESIZE |
static String |
ENV_OCR_SERVICE_MODE |
static String |
ENV_OCR_STRATEGY |
static String |
FILE_ATTRIBUTE_TEXT |
static String |
OCR_STRATEGY_AUTO |
static String |
OCR_STRATEGY_NO_OCR |
static String |
OCR_STRATEGY_OCR_AND_TEXT_EXTRACTION |
static String |
OCR_STRATEGY_OCR_ONLY |
static String |
PLUGIN_ERROR |
| Constructor and Description |
|---|
TikaService() |
| Modifier and Type | Method and Description |
|---|---|
String |
doORCProcessing(org.imixs.workflow.FileData fileData,
List<String> options,
int maxPdfPages)
This method sends the content of a document to the Tika Rest API for OCR
processing.
|
void |
extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot)
Extracts the textual information from document attachments.
|
void |
extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot,
String _ocrStategy,
List<String> options,
String filePatternRegex,
int maxPdfPages)
Extracts the textual information from document attachments.
|
public static final String FILE_ATTRIBUTE_TEXT
public static final String DEFAULT_ENCODING
public static final String PLUGIN_ERROR
public static final String ENV_OCR_SERVICE_ENDPOINT
public static final String ENV_OCR_SERVICE_MODE
public static final String ENV_OCR_SERVICE_MAXFILESIZE
public static final String ENV_OCR_STRATEGY
public static final String OCR_STRATEGY_NO_OCR
public static final String OCR_STRATEGY_OCR_AND_TEXT_EXTRACTION
public static final String OCR_STRATEGY_OCR_ONLY
public static final String OCR_STRATEGY_AUTO
public void extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot)
throws org.imixs.workflow.exceptions.PluginException
The method extracts the textual content for each new document of a given workitem. For PDF files with textual content the method calls the method 'extractTextFromPDF' using the PDFBox api. In other cases, the method sends the content via a Rest API to the tika server for OCR processing.
The result is stored into the fileData attribute 'text'
workitem - org.imixs.workflow.exceptions.PluginExceptionpublic void extractText(org.imixs.workflow.ItemCollection workitem,
org.imixs.workflow.ItemCollection snapshot,
String _ocrStategy,
List<String> options,
String filePatternRegex,
int maxPdfPages)
throws org.imixs.workflow.exceptions.PluginException
The method extracts the textual content for each new file attachment of a given workitem. The text information is stored in the $file attribute 'text'.
For PDF files with textual content the method calls the method 'extractTextFromPDF' using the PDFBox api. In other cases, the method sends the content via a Rest API to the tika server for OCR processing.
The method also extracts files already stored in a snapshot workitem. In this case the method tests if the $file attribute 'text' already exists.
An optional param 'filePattern' can be provided to extract text only from Attachments mating the given file pattern (regex).
The optioanl param 'maxPages' can be provided to reduce the size of PDF documents to a maximum of pages. This avoids blocking the tika service by processing to large documetns. For example only the first 5 pages can be scanned.
workitem - - workitem with file attachmentspdf_mode - - TEXT_ONLY, OCR_ONLY, TEXT_AND_OCRoptions - - optional tika header paramsfilePatternRegex - - optional regular expression to match filesorg.imixs.workflow.exceptions.PluginExceptionpublic String doORCProcessing(org.imixs.workflow.FileData fileData, List<String> options, int maxPdfPages) throws IOException
In case the contentType is PDF then the following tika specific header is added:
X-Tika-PDFOcrStrategy: ocr_only
fileData - - file content and metadataIOExceptionCopyright © 2016–2023 Imixs Software Solutions GmbH. All rights reserved.