Package org.pipecraft.pipes.utils.multi
Class LocalMultiFileReaderConfig<T>
- java.lang.Object
-
- org.pipecraft.pipes.utils.multi.LocalMultiFileReaderConfig<T>
-
- Type Parameters:
T- The data type of the items to read
public class LocalMultiFileReaderConfig<T> extends Object
A builder + configuration for pipes reading multiple local files. Suitable for sync/async pipes. Supports: - Filtering files - Automatic file sharding (either by file count or file volume) - Reading from multiple paths, recursively or not - Setting parallelism level (async pipes only) - Defining the file reading order (sync pipes only)- Author:
- Eyal Schneider
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classLocalMultiFileReaderConfig.Builder<T>
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static <T> LocalMultiFileReaderConfig.Builder<T>builder(DecoderFactory<T> decoderFactory)Creates a builder set up with the given file data decoder and additional defaults.static <T> LocalMultiFileReaderConfig.Builder<T>builder(DecoderFactory<T> decoderFactory, FileReadOptions readOptions)Creates a builder set up with the given file data decoder and additional defaults.static <T> LocalMultiFileReaderConfig.Builder<T>builder(PipeReaderSupplier<T,File> supplier)Creates a builder set up with the given pipe supplier and additional defaults.Predicate<File>getFileFilter()Comparator<File>getFileOrder()Collection<String>getPaths()PipeReaderSupplier<T,File>getPipeSupplier()ShardSpecifiergetShardSpecifier()intgetThreadNum()booleanisBalancedSharding()Relevant only when automatic sharding is enabled.booleanisRecursivePaths()
-
-
-
Method Detail
-
builder
public static <T> LocalMultiFileReaderConfig.Builder<T> builder(PipeReaderSupplier<T,File> supplier)
Creates a builder set up with the given pipe supplier and additional defaults.- Parameters:
supplier- Produces a pipe ready to read items from a given file. Setting a supplier as a file handler provides maximal control for the caller. The supplier receives the file's input stream plus the file metadata, and builds a pipe reading items from it. As a simple alternative, aDecoderFactorycan be passed as a file handler.- Returns:
- The new builder, initialized with the given supplier
-
builder
public static <T> LocalMultiFileReaderConfig.Builder<T> builder(DecoderFactory<T> decoderFactory, FileReadOptions readOptions)
Creates a builder set up with the given file data decoder and additional defaults.- Parameters:
decoderFactory- Defines how items should be decoded from files. As an alternative when more control is needed, aPipeReaderSuppliercan be passed instead.readOptions- Allows defining decompression and read buffer size to apply before decoding.
-
builder
public static <T> LocalMultiFileReaderConfig.Builder<T> builder(DecoderFactory<T> decoderFactory)
Creates a builder set up with the given file data decoder and additional defaults.- Parameters:
decoderFactory- Defines how items should be decoded from files. As an alternative when more control is needed, aPipeReaderSuppliercan be passed instead.
-
getFileFilter
public Predicate<File> getFileFilter()
- Returns:
- The file predicate resulting from ANDing all the specified filters. Given a file object, determines whether the file should be read. By default, the predicate accepts all files.
-
getShardSpecifier
public ShardSpecifier getShardSpecifier()
- Returns:
- The identity of the shard to read. Null when automatic sharding is off. When enabled, all files passing the filter conditions are automatically assigned a shard, and checked whether they match the specified shard. Sharding is based on hashing of the file paths when balancing is turned off, and based on file sizes when balancing is turned on.
-
isBalancedSharding
public boolean isBalancedSharding()
Relevant only when automatic sharding is enabled.- Returns:
- true when balanced sharding is enabled. When enabled, sharding tries to balance the total bytes volume in each shard. Otherwise, shards are more-less balanced by file count. By default this parameter is false. This option has some caveats to be aware about: 1. It consumes more memory since it stores all file references in memory. If you are working with millions of files, this may require careful memory settings. 2. When using in a distributed system, it is the responsibility of the user to guarantee that no file is added/changed once the workers start. Failing to do so will result in severe silent problems such as files handled by multiple instances, or files not handled at all.
-
getPaths
public Collection<String> getPaths()
- Returns:
- The set of paths (full local paths) of folders to read files from.
-
isRecursivePaths
public boolean isRecursivePaths()
- Returns:
- true if and only if the files to fetch should be detected recursively under the given paths. By default set to false.
-
getPipeSupplier
public PipeReaderSupplier<T,File> getPipeSupplier()
- Returns:
- The pipe supplier specifying how items are extracted from files.
Produces a pipe ready to read items from a given file.
Setting a supplier as a file handler provides maximal control for the caller.
The supplier receives the file's input stream plus the file object, and builds
a pipe reading items from it. As a simple alternative, a
DecoderFactorycan be passed as a file handler.
-
getThreadNum
public int getThreadNum()
- Returns:
- The number of threads to use for reading files when used by the async pipe. For sync pipes this configuration has no effect. By default, the number of machine cores is used.
-
getFileOrder
public Comparator<File> getFileOrder()
- Returns:
- The order by which files should be read. The order is defined as a comparator on file objects. This only applies for sync reading. By default order is lexicographic on the full path.
-
-