Package org.pipecraft.pipes.utils.multi
Class StorageMultiFileReaderConfig<T,B>
- java.lang.Object
-
- org.pipecraft.pipes.utils.multi.StorageMultiFileReaderConfig<T,B>
-
- Type Parameters:
T- The required data type of the items to readB- The file metadata type used by the storage implementation
public class StorageMultiFileReaderConfig<T,B> extends Object
A builder + configuration for pipes reading multiple remote files from storage. Suitable for sync/async pipes. Supports: - Filtering files - Automatic file sharding (either by file count or file volume) - Reading from multiple remote paths, recursively or not - Setting parallelism level - Choosing whether to stream data or to download and then read locally - Defining the file reading order (sync pipes only)- Author:
- Eyal Schneider
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classStorageMultiFileReaderConfig.Builder<T,B>
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static <T,B>
StorageMultiFileReaderConfig.Builder<T,B>builder(DecoderFactory<T> decoderFactory)Creates a builder set up with the given file data decoder and additional defaults.static <T,B>
StorageMultiFileReaderConfig.Builder<T,B>builder(DecoderFactory<T> decoderFactory, FileReadOptions readOptions)Creates a builder set up with the given file data decoder and additional defaults.static <T,B>
StorageMultiFileReaderConfig.Builder<T,B>builder(PipeReaderSupplier<T,B> supplier)Creates a builder set up with the given pipe supplier and additional defaults.Bucket<B>getBucket()Predicate<B>getFileFilter()Comparator<B>getFileOrder()Collection<String>getPaths()PipeReaderSupplier<T,B>getPipeSupplier()ShardSpecifiergetShardSpecifier()intgetThreadNum()FilegetTmpFolder()booleanisBalancedSharding()Relevant only when automatic sharding is enabled.booleanisDownloadFirst()booleanisRecursivePaths()
-
-
-
Method Detail
-
builder
public static <T,B> StorageMultiFileReaderConfig.Builder<T,B> builder(PipeReaderSupplier<T,B> supplier)
Creates a builder set up with the given pipe supplier and additional defaults.- Parameters:
supplier- Produces a pipe ready to read items from a given file. Setting a supplier as a file handler provides maximal control for the caller. The supplier receives the file's input stream plus the file metadata, and builds a pipe reading items from it. As a simple alternative, aDecoderFactorycan be passed as a file handler.- Returns:
- The new builder, initialized with the given supplier
-
builder
public static <T,B> StorageMultiFileReaderConfig.Builder<T,B> builder(DecoderFactory<T> decoderFactory, FileReadOptions readOptions)
Creates a builder set up with the given file data decoder and additional defaults.- Parameters:
decoderFactory- Defines how items should be decoded from files. As an alternative when more control is needed, aPipeReaderSuppliercan be passed instead.readOptions- Allows defining decompression and read buffer size to apply before decoding.
-
builder
public static <T,B> StorageMultiFileReaderConfig.Builder<T,B> builder(DecoderFactory<T> decoderFactory)
Creates a builder set up with the given file data decoder and additional defaults.- Parameters:
decoderFactory- Defines how items should be decoded from files. As an alternative when more control is needed, aPipeReaderSuppliercan be passed instead.
-
getFileFilter
public Predicate<B> getFileFilter()
- Returns:
- The file predicate resulting from ANDing all the specified filters. Given a file metadata object, determines whether the file should be read. By default, the predicate accepts all files.
-
getShardSpecifier
public ShardSpecifier getShardSpecifier()
- Returns:
- The identity of the shard to read. Null when automatic sharding is off. When enabled, all files passing the filter conditions are automatically assigned a shard, and checked whether they match the specified shard. Sharding is based on hashing of the file paths when balancing is turned off, and based on file sizes when balancing is turned on.
-
isBalancedSharding
public boolean isBalancedSharding()
Relevant only when automatic sharding is enabled.- Returns:
- true when balanced sharding is enabled. When enabled, sharding tries to balance the total bytes volume in each shard. Otherwise, shards are more-less balanced by file count. By default this parameter is false. This option has some caveats to be aware of: 1. It consumes more memory since it stores all remote file references in memory. If you are working with millions of remote files, this may require careful memory settings. 2. When using in a distributed system, it is the responsibility of the user to guarantee that no file is added/changed once the workers start. Failing to do so will result in severe silent problems such as files handled by multiple instances, or files not handled at all.
-
getPaths
public Collection<String> getPaths()
- Returns:
- The set of paths of folders to read files from. Should be relative to the bucket.
-
isRecursivePaths
public boolean isRecursivePaths()
- Returns:
- true if and only if the files to fetch should be detected recursively under the given paths. By default set to false.
-
getPipeSupplier
public PipeReaderSupplier<T,B> getPipeSupplier()
- Returns:
- The pipe supplier specifying how items are extracted from files.
Produces a pipe ready to read items from a given file.
Setting a supplier as a file handler provides maximal control for the caller.
The supplier receives the file's input stream plus the file metadata, and builds
a pipe reading items from it. As a simple alternative, a
DecoderFactorycan be passed as a file handler.
-
getThreadNum
public int getThreadNum()
- Returns:
- The number of threads to use for reading files. Refers to file download/streaming parallelism, and when used by an async pipe and the download flag is set, refers also to the number of threads used to read the local files after they are all downloaded. By default the thread count is set to the number of cores in the machine.
-
isDownloadFirst
public boolean isDownloadFirst()
- Returns:
- When true, indicates that instead of streaming (the default behavior), we first download all files (in an efficient manner), and then read them locally. This approach can become relevant when any of the following applies: 1) There are very few files to fetch 2) The files differ significantly in their sizes 3) Local disk is SSD
-
getTmpFolder
public File getTmpFolder()
- Returns:
- The temp folder to use when download flag is turned on. Null when download flag is turned off.
-
getFileOrder
public Comparator<B> getFileOrder()
- Returns:
- The order by which files should be read. The order is defined as a comparator on file metadata objects. This only applies for sync readers. By default order is lexicographic on the full path.
-
-