Class TikaTextExtractionFilter

java.lang.Object
org.dspace.app.mediafilter.MediaFilter
org.dspace.app.mediafilter.TikaTextExtractionFilter
All Implemented Interfaces:
FormatFilter

public class TikaTextExtractionFilter extends MediaFilter
Text Extraction media filter which uses Apache Tika to extract text from a large number of file formats (including all Microsoft formats, PDF, HTML, Text, etc). For a more complete list of file formats supported by Tika see the Tika documentation: https://tika.apache.org/2.3.0/formats.html
  • Constructor Details

    • TikaTextExtractionFilter

      public TikaTextExtractionFilter()
  • Method Details

    • getFilteredName

      public String getFilteredName(String oldFilename)
      Description copied from interface: FormatFilter
      Get a filename for a newly created filtered bitstream
      Parameters:
      oldFilename - name of source bitstream
      Returns:
      filename generated by the filter - for example, document.pdf becomes document.pdf.txt
    • getBundleName

      public String getBundleName()
      Returns:
      name of the bundle this filter will stick its generated Bitstreams
    • getFormatString

      public String getFormatString()
      Returns:
      name of the bitstream format (say "HTML" or "Microsoft Word") returned by this filter look in the bitstream format registry or mediafilter.cfg for valid format strings.
    • getDescription

      public String getDescription()
      Returns:
      string to describe the newly-generated Bitstream - how it was produced is a good idea
    • getDestinationStream

      public InputStream getDestinationStream(Item currentItem, InputStream source, boolean verbose) throws Exception
      Description copied from interface: FormatFilter
      Read the source stream and produce the filtered content.
      Parameters:
      currentItem - Item
      source - input stream
      verbose - verbosity flag
      Returns:
      result of filter's transformation as a byte stream.
      Throws:
      Exception - if error