public class Extractor extends Object
| Modifier and Type | Class and Description |
|---|---|
static class |
Extractor.EmbedHandling |
static class |
Extractor.OutputFormat |
| Constructor and Description |
|---|
Extractor()
Create a new extractor, which will OCR images by default if Tesseract is available locally, extract inline
images from PDF files and OCR them and use PDFBox's non-sequential PDF parser.
|
| Modifier and Type | Method and Description |
|---|---|
Extractor |
configure(Options<String> options) |
void |
disableOcr()
Disable OCR.
|
Reader |
extract(TikaDocument tikaDocument)
This method will wrap the given
TikaDocument in a TikaInputStream and return a Reader
which can be used to initiate extraction on demand. |
void |
extract(TikaDocument tikaDocument,
Spewer spewer)
Extract and spew content from a document.
|
void |
extract(TikaDocument tikaDocument,
Spewer spewer,
Reporter reporter)
Extract and spew content from a document.
|
protected Reader |
extract(TikaDocument tikaDocument,
org.apache.tika.io.TikaInputStream input)
Create a pull-parser from the given
TikaInputStream. |
Extractor.EmbedHandling |
getEmbedHandling()
Get the embed handling mode.
|
Path |
getEmbedOutputPath()
Get the output directory path for embed files.
|
Extractor.OutputFormat |
getOutputFormat()
Get the extraction output format.
|
void |
setDigestAlgorithms(org.apache.tika.parser.utils.CommonsDigester.DigestAlgorithm... digestAlgorithms) |
void |
setEmbedHandling(Extractor.EmbedHandling embedHandling)
Set the embed handling mode.
|
void |
setEmbedOutputPath(Path embedOutput)
Set the output directory path for embed files.
|
void |
setOcrLanguage(String ocrLanguage)
Set the languages used by Tesseract.
|
void |
setOcrTimeout(java.time.Duration duration)
Instructs Tesseract to attempt OCR for no longer than the given duration.
|
void |
setOutputFormat(Extractor.OutputFormat outputFormat)
Set the output format.
|
public Extractor()
public void setOutputFormat(Extractor.OutputFormat outputFormat)
outputFormat - the output formatpublic Extractor.OutputFormat getOutputFormat()
public void setEmbedHandling(Extractor.EmbedHandling embedHandling)
embedHandling - the embed handling modepublic Extractor.EmbedHandling getEmbedHandling()
public void setEmbedOutputPath(Path embedOutput)
embedOutput - the embed output pathpublic Path getEmbedOutputPath()
public void setOcrLanguage(String ocrLanguage)
ocrLanguage - the languages to use, for example "eng" or "ita+spa"public void setOcrTimeout(java.time.Duration duration)
duration - the duration before timeoutpublic void setDigestAlgorithms(org.apache.tika.parser.utils.CommonsDigester.DigestAlgorithm... digestAlgorithms)
public void disableOcr()
public Reader extract(TikaDocument tikaDocument) throws IOException
TikaDocument in a TikaInputStream and return a Reader
which can be used to initiate extraction on demand.
Internally, this method uses TikaInputStream.get(java.io.InputStream, org.apache.tika.io.TemporaryResources) which ensures that the resource name and content
length metadata properties are set automatically.tikaDocument - the file to extract fromReader that can be used to read extracted text on demand.IOExceptionpublic void extract(TikaDocument tikaDocument, Spewer spewer) throws IOException
extract(TikaDocument),
this method creates a TikaInputStream from the path of the given document.tikaDocument - document to extract fromspewer - endpoint to write toIOException - if there was an error reading or writing the documentpublic void extract(TikaDocument tikaDocument, Spewer spewer, Reporter reporter)
extract(TikaDocument, Spewer) with
the exception that the document will be skipped if the reporter returns false for a call to
Reporter.skip(TikaDocument).
If the document is not skipped, then the result of the extraction is passed to the reporter in a call to
Reporter.save(TikaDocument, ExtractionStatus, Exception).tikaDocument - document to extract fromspewer - endpoint to write toreporter - used to check whether the document should be skipped and save extraction statusprotected Reader extract(TikaDocument tikaDocument, org.apache.tika.io.TikaInputStream input) throws IOException
TikaInputStream.input - the stream to extract fromtikaDocument - file that is being extracted fromIOExceptionCopyright © 2018 The International Consortium of Investigative Journalists. All rights reserved.