public class Extractor extends Object
| Modifier and Type | Class and Description |
|---|---|
static class |
Extractor.EmbedHandling |
static class |
Extractor.OutputFormat |
| Constructor and Description |
|---|
Extractor() |
Extractor(DocumentFactory factory)
Create a new extractor, which will OCR images by default if Tesseract is available locally, extract inline
images from PDF files and OCR them and use PDFBox's non-sequential PDF parser.
|
| Modifier and Type | Method and Description |
|---|---|
Extractor |
configure(Options<String> options) |
void |
disableOcr()
Disable OCR.
|
TikaDocument |
extract(Path path)
Create a pull-parser from the given
TikaInputStream. |
void |
extract(Path path,
Spewer spewer)
Extract and spew content from a document.
|
void |
extract(Path path,
Spewer spewer,
Reporter reporter)
Extract and spew content from a document.
|
Extractor.EmbedHandling |
getEmbedHandling()
Get the embed handling mode.
|
Path |
getEmbedOutputPath()
Get the output directory path for embed files.
|
Extractor.OutputFormat |
getOutputFormat()
Get the extraction output format.
|
void |
setDigestAlgorithm(String digestAlgorithm) |
void |
setDigester(org.apache.tika.parser.DigestingParser.Digester digester) |
void |
setEmbedHandling(Extractor.EmbedHandling embedHandling)
Set the embed handling mode.
|
void |
setEmbedOutputPath(Path embedOutput)
Set the output directory path for embed files.
|
void |
setOcrLanguage(String ocrLanguage)
Set the languages used by Tesseract.
|
void |
setOcrTimeout(java.time.Duration duration)
Instructs Tesseract to attempt OCR for no longer than the given duration.
|
void |
setOutputFormat(Extractor.OutputFormat outputFormat)
Set the output format.
|
public Extractor(DocumentFactory factory)
public Extractor()
public void setOutputFormat(Extractor.OutputFormat outputFormat)
outputFormat - the output formatpublic Extractor.OutputFormat getOutputFormat()
public void setEmbedHandling(Extractor.EmbedHandling embedHandling)
embedHandling - the embed handling modepublic Extractor.EmbedHandling getEmbedHandling()
public void setEmbedOutputPath(Path embedOutput)
embedOutput - the embed output pathpublic Path getEmbedOutputPath()
public void setOcrLanguage(String ocrLanguage)
ocrLanguage - the languages to use, for example "eng" or "ita+spa"public void setOcrTimeout(java.time.Duration duration)
duration - the duration before timeoutpublic void setDigestAlgorithm(String digestAlgorithm)
public void setDigester(org.apache.tika.parser.DigestingParser.Digester digester)
public void disableOcr()
public void extract(Path path, Spewer spewer) throws IOException
extract(Path),
this method creates a TikaInputStream from the path of the given document.path - document to extract fromspewer - endpoint to write toIOException - if there was an error reading or writing the documentpublic void extract(Path path, Spewer spewer, Reporter reporter)
extract(Path, Spewer) with
the exception that the document will be skipped if the reporter returns false for a call to
Reporter.skip(Path).
If the document is not skipped, then the result of the extraction is passed to the reporter in a call to
Reporter.save(Path, ExtractionStatus, Exception).path - document to extract fromspewer - endpoint to write toreporter - used to check whether the document should be skipped and save extraction statuspublic TikaDocument extract(Path path) throws IOException
TikaInputStream.path - the stream to extract fromIOExceptionCopyright © 2019 The International Consortium of Investigative Journalists. All rights reserved.