- EmbedBlocker - Class in org.icij.extract.extractor
-
A custom extractor that prevents Tika from parsing any embedded documents.
- EmbedBlocker() - Constructor for class org.icij.extract.extractor.EmbedBlocker
-
- EmbeddedTikaDocument - Class in org.icij.extract.document
-
- EmbeddingHTMLParsingReader - Class in org.icij.extract.parser
-
Example:
final String uuid = UUID.randomUUID().toString();
final String open = uuid + "/";
final String close = "/" + uuid;
context.set(Parser.class, EmptyParser.INSTANCE);
context.set(EmbeddedDocumentExtractor.class, new EmbedLinker(document, tmp, open, close));
reader = new EmbeddingHTMLParsingReader(document, open, close, parser, input, metadata, context);
- EmbeddingHTMLParsingReader(TikaDocument, String, String, Parser, TikaInputStream, Metadata, ParseContext) - Constructor for class org.icij.extract.parser.EmbeddingHTMLParsingReader
-
- EmbedLinker - Class in org.icij.extract.extractor
-
A custom extractor that saves all embeds to temporary files and records the new paths.
- EmbedParser - Class in org.icij.extract.extractor
-
A custom extractor that is an almost exact copy of Tika's default extractor for embedded documents.
- EmbedSpawner - Class in org.icij.extract.extractor
-
- encode(Object) - Method in class org.icij.extract.redis.DocumentEncoder
-
- encode(Object) - Method in class org.icij.extract.redis.ResultEncoder
-
- endDocument() - Method in class org.icij.extract.parser.HTML5Serializer
-
Must be called last.
- endElement(String, String, String) - Method in class org.icij.extract.parser.HTML5Serializer
-
Writes an end tag if the element is an XHTML element and is not an empty
element in HTML 4.01 Strict.
- endPrefixMapping(String) - Method in class org.icij.extract.parser.HTML5Serializer
-
This method does nothing.
- equals(Object) - Method in class org.icij.extract.document.TikaDocument
-
- equals(Object) - Method in class org.icij.task.Option
-
- equals(Object) - Method in class org.icij.task.Options
-
- exclude(String) - Method in class org.icij.extract.Scanner
-
Add a glob pattern for excluding files and directories.
- execute(Runnable) - Method in class org.icij.concurrent.BlockingThreadPoolExecutor
-
Before calling super's version of this method, a permit is acquired in order to queue the task for execution.
- executor - Variable in class org.icij.concurrent.ExecutorProxy
-
The executor proxied by the implementing class.
- ExecutorProxy - Class in org.icij.concurrent
-
A class of traits used by implementing classes that proxy an executor.
- ExecutorProxy(ExecutorService) - Constructor for class org.icij.concurrent.ExecutorProxy
-
Instantiate a proxy for the given executor.
- extract(TikaDocument) - Method in class org.icij.extract.extractor.Extractor
-
This method will wrap the given
TikaDocument in a
TikaInputStream and return a
Reader
which can be used to initiate extraction on demand.
- extract(TikaDocument, Spewer) - Method in class org.icij.extract.extractor.Extractor
-
Extract and spew content from a document.
- extract(TikaDocument, Spewer, Reporter) - Method in class org.icij.extract.extractor.Extractor
-
Extract and spew content from a document.
- extract(TikaDocument, TikaInputStream) - Method in class org.icij.extract.extractor.Extractor
-
Create a pull-parser from the given TikaInputStream.
- ExtractionStatus - Enum in org.icij.extract.extractor
-
Status for the extraction result of a file.
- extractor - Variable in class org.icij.extract.extractor.DocumentConsumer
-
- Extractor - Class in org.icij.extract.extractor
-
A reusable class that sets up Tika parsers based on runtime options.
- Extractor() - Constructor for class org.icij.extract.extractor.Extractor
-
Create a new extractor, which will OCR images by default if Tesseract is available locally, extract inline
images from PDF files and OCR them and use PDFBox's non-sequential PDF parser.
- Extractor.EmbedHandling - Enum in org.icij.extract.extractor
-
- Extractor.OutputFormat - Enum in org.icij.extract.extractor
-
- save(TikaDocument, Report) - Method in class org.icij.extract.report.Reporter
-
Save the extraction report for the given tikaDocument.
- save(TikaDocument, ExtractionStatus, Exception) - Method in class org.icij.extract.report.Reporter
-
Save the extraction status and optional exception for the given tikaDocument.
- save(TikaDocument, ExtractionStatus) - Method in class org.icij.extract.report.Reporter
-
Save the extraction status for the given tikaDocument.
- scan(Path) - Method in class org.icij.extract.Scanner
-
Queue a scanning job.
- scan(Path[]) - Method in class org.icij.extract.Scanner
-
Submit all of the given paths to the scanner for execution, returning a list of
Future objects
representing those tasks.
- scan(String[]) - Method in class org.icij.extract.Scanner
-
- Scanner - Class in org.icij.extract
-
Scanner for scanning the directory tree starting at a given path.
- Scanner(DocumentFactory, BlockingQueue<TikaDocument>) - Constructor for class org.icij.extract.Scanner
-
- Scanner(DocumentFactory, BlockingQueue<TikaDocument>, SealableLatch) - Constructor for class org.icij.extract.Scanner
-
- Scanner(DocumentFactory, BlockingQueue<TikaDocument>, SealableLatch, Notifiable) - Constructor for class org.icij.extract.Scanner
-
Creates a
Scanner that sends all results straight to the underlying
BlockingQueue on a
single thread.
- ScannerVisitor - Class in org.icij.extract
-
- ScannerVisitor(Path, BlockingQueue<TikaDocument>, DocumentFactory, Options<String>) - Constructor for class org.icij.extract.ScannerVisitor
-
Instantiate a new task for scanning the given path.
- seal() - Method in class org.icij.concurrent.BooleanSealableLatch
-
- seal() - Method in interface org.icij.concurrent.SealableLatch
-
- SealableLatch - Interface in org.icij.concurrent
-
- setDigestAlgorithms(CommonsDigester.DigestAlgorithm...) - Method in class org.icij.extract.extractor.Extractor
-
- setDocumentLocator(Locator) - Method in class org.icij.extract.parser.HTML5Serializer
-
This method does nothing.
- setEmbedHandling(Extractor.EmbedHandling) - Method in class org.icij.extract.extractor.Extractor
-
Set the embed handling mode.
- setEmbedOutputPath(Path) - Method in class org.icij.extract.extractor.Extractor
-
Set the output directory path for embed files.
- setForeignId(String) - Method in class org.icij.extract.document.TikaDocument
-
- setLatch(SealableLatch) - Method in class org.icij.extract.queue.DocumentQueueDrainer
-
If given, the latch should be used to signal that the queue should be polled.
- setMaxDepth(int) - Method in class org.icij.extract.Scanner
-
Set the maximum depth to recurse when scanning.
- setMaximumPoolSize(int) - Method in class org.icij.concurrent.BlockingThreadPoolExecutor
-
Increase or decreases the maximum pool size by adjusting the number of permits accordingly.
- setOcrLanguage(String) - Method in class org.icij.extract.extractor.Extractor
-
Set the languages used by Tesseract.
- setOcrTimeout(Duration) - Method in class org.icij.extract.extractor.Extractor
-
Instructs Tesseract to attempt OCR for no longer than the given duration.
- setOutputDirectory(Path) - Method in class org.icij.spewer.FileSpewer
-
- setOutputEncoding(Charset) - Method in class org.icij.spewer.Spewer
-
- setOutputFormat(Extractor.OutputFormat) - Method in class org.icij.extract.extractor.Extractor
-
Set the output format.
- setPollTimeout(Duration) - Method in class org.icij.extract.queue.DocumentQueueDrainer
-
Set the amount of time to wait until an item becomes available.
- setReader(Reader) - Method in class org.icij.extract.document.TikaDocument
-
- setReader(TikaDocument.ReaderGenerator) - Method in class org.icij.extract.document.TikaDocument
-
- setReporter(Reporter) - Method in class org.icij.extract.extractor.DocumentConsumer
-
Set the reporter.
- setTags(Map<String, String>) - Method in class org.icij.spewer.Spewer
-
- setVerifyHostname(String) - Method in class org.icij.spewer.http.PinnedHttpClientBuilder
-
- shouldParseEmbedded(Metadata) - Method in class org.icij.extract.extractor.EmbedBlocker
-
- shouldParseEmbedded(Metadata) - Method in class org.icij.extract.extractor.EmbedLinker
-
Always returns true.
- shutdown() - Method in class org.icij.concurrent.ExecutorProxy
-
Shuts down the executor.
- shutdown() - Method in interface org.icij.concurrent.Shutdownable
-
- Shutdownable - Interface in org.icij.concurrent
-
- shutdownNow() - Method in class org.icij.concurrent.ExecutorProxy
-
Shut down the executor immediately, halting running tasks and discarding waiting tasks.
- signal() - Method in class org.icij.concurrent.BooleanSealableLatch
-
- signal() - Method in interface org.icij.concurrent.SealableLatch
-
- skip(long) - Method in class org.icij.extract.io.TokenReplacingReader
-
- skip(TikaDocument) - Method in class org.icij.extract.report.Reporter
-
Check whether a path should be skipped.
- skippedEntity(String) - Method in class org.icij.extract.parser.HTML5Serializer
-
This method does nothing.
- spewer - Variable in class org.icij.extract.extractor.DocumentConsumer
-
- Spewer - Class in org.icij.spewer
-
Base class for
Spewer superclasses that write text output from a
ParsingReader to specific
endpoints.
- Spewer(FieldNames) - Constructor for class org.icij.spewer.Spewer
-
- startDocument() - Method in class org.icij.extract.parser.HTML5Serializer
-
Must be called first.
- startElement(String, String, String, Attributes) - Method in class org.icij.extract.parser.HTML5Serializer
-
Writes a start tag if the element is an XHTML element.
- startPrefixMapping(String, String) - Method in class org.icij.extract.parser.HTML5Serializer
-
This method does nothing.
- StringOptionParser - Class in org.icij.task
-
- StringOptionParser(Option<String>) - Constructor for class org.icij.task.StringOptionParser
-
- SystemFileMatcher - Class in org.icij.extract.io.file
-
Create a
PathMatcher that matches operating-system-generated files.
- SystemFileMatcher() - Constructor for class org.icij.extract.io.file.SystemFileMatcher
-
- value() - Method in class org.icij.task.Option
-
- value(Function<V, R>) - Method in class org.icij.task.Option
-
- valueIfPresent(String) - Method in class org.icij.task.Options
-
- valueOf(String) - Static method in enum org.icij.extract.extractor.ExtractionStatus
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.icij.extract.extractor.Extractor.EmbedHandling
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.icij.extract.extractor.Extractor.OutputFormat
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.icij.extract.IndexType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.icij.extract.OutputType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.icij.extract.queue.DocumentQueueType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.icij.extract.report.ReportMapType
-
Returns the enum constant of this type with the specified name.
- values() - Static method in enum org.icij.extract.extractor.ExtractionStatus
-
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Static method in enum org.icij.extract.extractor.Extractor.EmbedHandling
-
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Static method in enum org.icij.extract.extractor.Extractor.OutputFormat
-
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Static method in enum org.icij.extract.IndexType
-
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Static method in enum org.icij.extract.OutputType
-
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Static method in enum org.icij.extract.queue.DocumentQueueType
-
Returns an array containing the constants of this enum type, in
the order they are declared.
- values() - Static method in enum org.icij.extract.report.ReportMapType
-
Returns an array containing the constants of this enum type, in
the order they are declared.
- values - Variable in class org.icij.task.Option
-
- values() - Method in class org.icij.task.Option
-
- values(Function<V, R>) - Method in class org.icij.task.Option
-
- verify(String, SSLSession) - Method in class org.icij.spewer.http.PinnedHttpClientBuilder.BodgeHostnameVerifier
-
- visitFile(Path, BasicFileAttributes) - Method in class org.icij.extract.ScannerVisitor
-
- visitFileFailed(Path, IOException) - Method in class org.icij.extract.ScannerVisitor
-