Package org.dspace.app.mediafilter
Class TikaTextExtractionFilter
- java.lang.Object
-
- org.dspace.app.mediafilter.MediaFilter
-
- org.dspace.app.mediafilter.TikaTextExtractionFilter
-
- All Implemented Interfaces:
FormatFilter
public class TikaTextExtractionFilter extends MediaFilter
Text Extraction media filter which uses Apache Tika to extract text from a large number of file formats (including all Microsoft formats, PDF, HTML, Text, etc). For a more complete list of file formats supported by Tika see the Tika documentation: https://tika.apache.org/2.3.0/formats.html
-
-
Constructor Summary
Constructors Constructor Description TikaTextExtractionFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description StringgetBundleName()StringgetDescription()InputStreamgetDestinationStream(Item currentItem, InputStream source, boolean verbose)Read the source stream and produce the filtered content.StringgetFilteredName(String oldFilename)Get a filename for a newly created filtered bitstreamStringgetFormatString()-
Methods inherited from class org.dspace.app.mediafilter.MediaFilter
postProcessBitstream, preProcessBitstream
-
-
-
-
Method Detail
-
getFilteredName
public String getFilteredName(String oldFilename)
Description copied from interface:FormatFilterGet a filename for a newly created filtered bitstream- Parameters:
oldFilename- name of source bitstream- Returns:
- filename generated by the filter - for example, document.pdf becomes document.pdf.txt
-
getBundleName
public String getBundleName()
- Returns:
- name of the bundle this filter will stick its generated Bitstreams
-
getFormatString
public String getFormatString()
- Returns:
- name of the bitstream format (say "HTML" or "Microsoft Word") returned by this filter look in the bitstream format registry or mediafilter.cfg for valid format strings.
-
getDescription
public String getDescription()
- Returns:
- string to describe the newly-generated Bitstream - how it was produced is a good idea
-
getDestinationStream
public InputStream getDestinationStream(Item currentItem, InputStream source, boolean verbose) throws Exception
Description copied from interface:FormatFilterRead the source stream and produce the filtered content.- Parameters:
currentItem- Itemsource- input streamverbose- verbosity flag- Returns:
- result of filter's transformation as a byte stream.
- Throws:
Exception- if error
-
-