Package org.dspace.app.mediafilter
Class TikaTextExtractionFilter
java.lang.Object
org.dspace.app.mediafilter.MediaFilter
org.dspace.app.mediafilter.TikaTextExtractionFilter
- All Implemented Interfaces:
FormatFilter
Text Extraction media filter which uses Apache Tika to extract text from a large number of file formats (including
all Microsoft formats, PDF, HTML, Text, etc). For a more complete list of file formats supported by Tika see the
Tika documentation: https://tika.apache.org/2.3.0/formats.html
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptiongetDestinationStream(Item currentItem, InputStream source, boolean verbose) Read the source stream and produce the filtered content.getFilteredName(String oldFilename) Get a filename for a newly created filtered bitstreamMethods inherited from class org.dspace.app.mediafilter.MediaFilter
postProcessBitstream, preProcessBitstream
-
Constructor Details
-
TikaTextExtractionFilter
public TikaTextExtractionFilter()
-
-
Method Details
-
getFilteredName
Description copied from interface:FormatFilterGet a filename for a newly created filtered bitstream- Parameters:
oldFilename- name of source bitstream- Returns:
- filename generated by the filter - for example, document.pdf becomes document.pdf.txt
-
getBundleName
- Returns:
- name of the bundle this filter will stick its generated Bitstreams
-
getFormatString
- Returns:
- name of the bitstream format (say "HTML" or "Microsoft Word") returned by this filter look in the bitstream format registry or mediafilter.cfg for valid format strings.
-
getDescription
- Returns:
- string to describe the newly-generated Bitstream - how it was produced is a good idea
-
getDestinationStream
public InputStream getDestinationStream(Item currentItem, InputStream source, boolean verbose) throws Exception Description copied from interface:FormatFilterRead the source stream and produce the filtered content.- Parameters:
currentItem- Itemsource- input streamverbose- verbosity flag- Returns:
- result of filter's transformation as a byte stream.
- Throws:
Exception- if error
-