Class TikaService


  • public class TikaService
    extends Object
    The OCRService extracts the textual information from document attachments of a workitem and stores the data into the $file attribute 'text'.

    For the text extraction the services sends the content of a document to an instance of a Apache Tika server via the Rest API. The environment variable OCR_STRATEGY defines how PDF files will be scanned. Possible values are:

    • AUTO - The best OCR strategy is chosen by the Tika Server itself. This is the default setting.
    • NO_OCR - OCR processing is disabled and text is extracted only from PDF files including a raw text. If a pdf file does not contain raw text data no text will be extracted!
    • OCR_ONLY - PDF files will always be OCR scanned even if the pdf file contains text data.
    • OCR_AND_TEXT_EXTRACTION - OCR processing and raw text extraction is performed. Note: This may result is a duplication of text and the mode is not recommended.
    • The service expects a valid Rest API end-point to an instance of a Tika Server defined by the Environment Parameter 'TIKA_SERVICE_ENDPONT'.

      The environment parameter 'TIKA_SERVICE_MODE' must be set to 'auto' to enable the service.

      See also the project: https://github.com/imixs/imixs-docker/tree/master/tika

    Version:
    1.1
    Author:
    rsoika
    • Constructor Detail

      • TikaService

        public TikaService()
    • Method Detail

      • extractText

        public void extractText​(org.imixs.workflow.ItemCollection workitem,
                                org.imixs.workflow.ItemCollection snapshot)
                         throws org.imixs.workflow.exceptions.PluginException
        Extracts the textual information from document attachments.

        The method extracts the textual content for each new document of a given workitem. For PDF files with textual content the method calls the method 'extractTextFromPDF' using the PDFBox api. In other cases, the method sends the content via a Rest API to the tika server for OCR processing.

        The result is stored into the fileData attribute 'text'

        Parameters:
        workitem -
        Throws:
        org.imixs.workflow.exceptions.PluginException
      • extractText

        public void extractText​(org.imixs.workflow.ItemCollection workitem,
                                org.imixs.workflow.ItemCollection snapshot,
                                String _ocrStategy,
                                List<String> options,
                                String filePatternRegex,
                                int maxPdfPages)
                         throws org.imixs.workflow.exceptions.PluginException
        Extracts the textual information from document attachments.

        The method extracts the textual content for each new file attachment of a given workitem. The text information is stored in the $file attribute 'text'.

        For PDF files with textual content the method calls the method 'extractTextFromPDF' using the PDFBox api. In other cases, the method sends the content via a Rest API to the tika server for OCR processing.

        The method also extracts files already stored in a snapshot workitem. In this case the method tests if the $file attribute 'text' already exists.

        An optional param 'filePattern' can be provided to extract text only from Attachments mating the given file pattern (regex).

        The optioanl param 'maxPages' can be provided to reduce the size of PDF documents to a maximum of pages. This avoids blocking the tika service by processing to large documetns. For example only the first 5 pages can be scanned.

        Parameters:
        workitem - - workitem with file attachments
        pdf_mode - - TEXT_ONLY, OCR_ONLY, TEXT_AND_OCR
        options - - optional tika header params
        filePatternRegex - - optional regular expression to match files
        Throws:
        org.imixs.workflow.exceptions.PluginException
      • doORCProcessing

        public String doORCProcessing​(org.imixs.workflow.FileData fileData,
                                      List<String> options,
                                      int maxPdfPages)
                               throws IOException
        This method sends the content of a document to the Tika Rest API for OCR processing.

        In case the contentType is PDF then the following tika specific header is added:

        X-Tika-PDFOcrStrategy: ocr_only

        Parameters:
        fileData - - file content and metadata
        Returns:
        text content
        Throws:
        IOException