Package org.pipecraft.pipes.utils.multi
Class StorageMultiFileReaderConfig.Builder<T,B>
- java.lang.Object
-
- org.pipecraft.pipes.utils.multi.StorageMultiFileReaderConfig.Builder<T,B>
-
- Enclosing class:
- StorageMultiFileReaderConfig<T,B>
public static class StorageMultiFileReaderConfig.Builder<T,B> extends Object
-
-
Method Summary
-
-
-
Method Detail
-
andFilter
public StorageMultiFileReaderConfig.Builder<T,B> andFilter(Predicate<B> fileFilter)
- Parameters:
fileFilter- The file predicate to AND with existing ones. Given a file metadata object, determines whether the file should be read. By default, the filter accepts all files.- Returns:
- this builder
-
andPathFilter
public StorageMultiFileReaderConfig.Builder<T,B> andPathFilter(Predicate<String> filePathFilter)
- Parameters:
filePathFilter- The file predicate to AND with existing ones. Given a file path (relative to bucket), determines whether the file should be read. By default, the filter accepts all files.- Returns:
- this builder
-
andFilter
public StorageMultiFileReaderConfig.Builder<T,B> andFilter(String fileRegex)
- Parameters:
fileRegex- A regex to AND with existing file filters. The regex is applied on a path relative to the bucket.- Returns:
- this builder
- Throws:
PatternSyntaxException- In case the regex is illegal
-
shard
public StorageMultiFileReaderConfig.Builder<T,B> shard(ShardSpecifier shardSpecifier, boolean isBalanced)
- Parameters:
shardSpecifier- Identifies the shard to read. When using this method, all files passing the filters are automatically assigned a shard. Sharding is based on hashing of their paths when balancing is turned off, and based on file sizes when balancing is turned on.isBalanced- Indicates whether the sharding should be based on file sizes, in order to achieve a semi-balanced partition of the data into shards. This option has some caveats to be aware of: 1. It consumes more memory since it stores all remote file references in memory. If you are working with millions of remote files, this may require careful memory settings. 2. When using in a distributed system, it is the responsibility of the user to guarantee that no file is added/changed once the workers start. Failing to do so will result in severe silent problems such as files handled by multiple instances, or files not handled at all.- Returns:
- this builder
-
shard
public StorageMultiFileReaderConfig.Builder<T,B> shard(ShardSpecifier shardSpecifier)
- Parameters:
shardSpecifier- Indicates that automatic data sharding is requested. All files passing the filter conditions are automatically assigned a shard. Sharding is based on hashing of their paths.- Returns:
- this builder
-
bucket
public StorageMultiFileReaderConfig.Builder<T,B> bucket(Storage<?,B> storage, String bucketName)
- Parameters:
storage- The storage implementation to usebucketName- The name of the bucket containing the files to read- Returns:
- this builder
-
bucket
public StorageMultiFileReaderConfig.Builder<T,B> bucket(Bucket<B> bucket)
- Parameters:
bucket- The bucket containing the files to read- Returns:
- this builder
-
paths
public StorageMultiFileReaderConfig.Builder<T,B> paths(Collection<String> paths, boolean isRecursive)
- Parameters:
paths- The set of folder paths to read files from. Paths should be relative to the bucket.isRecursive- Indicates whether files should be fetched from the paths recursively or not.- Returns:
- this builder
-
paths
public StorageMultiFileReaderConfig.Builder<T,B> paths(Collection<String> paths)
- Parameters:
paths- The set of folder paths to read files from, in a non-recursive manner. Paths should be relative to the bucket.- Returns:
- this builder
-
paths
public StorageMultiFileReaderConfig.Builder<T,B> paths(String... paths)
- Parameters:
paths- The set of folder paths to read files from, in a non-recursive manner. Paths should be relative to the bucket.- Returns:
- this builder
-
paths
public StorageMultiFileReaderConfig.Builder<T,B> paths(String path, boolean isRecursive)
- Parameters:
path- The folder path to read files from. Path should be relative to the bucket.isRecursive- Indicates whether files should be fetched from the path recursively or not.- Returns:
- this builder
-
paths
public StorageMultiFileReaderConfig.Builder<T,B> paths(String path)
- Parameters:
path- The folder path to read files from, in a non-recursive manner. Path should be relative to the bucket.- Returns:
- this builder
-
threadNum
public StorageMultiFileReaderConfig.Builder<T,B> threadNum(int threadNum)
- Parameters:
threadNum- The number of threads to use for reading files. Refers to file download/streaming parallelism, and when used by an async pipe and the download flag is set, refers also to the number of threads used to read the local files after they are all downloaded. By default the thread count is set to the number of cores in the machine.- Returns:
- this builder
-
downloadFirst
public StorageMultiFileReaderConfig.Builder<T,B> downloadFirst(File tmpFolder)
Indicates that instead of streaming (the default behavior), we first download all files (in an efficient manner), and then read them locally.- Parameters:
tmpFolder- The temporary folder to download to. All files inside this folder will be marked as temporary (meaning that they will be deteled on JVM termination), but for a timely disposal of these resources, it is strongly recommended that the caller explicitly deletes the folder as soon as the pipeline ends. This approach can become relevant when any of the following applies: 1) There are very few files to fetch 2) The files differ significantly in their sizes 3) Local disk is SSD- Returns:
- this builder
-
downloadFirst
public StorageMultiFileReaderConfig.Builder<T,B> downloadFirst()
Indicates that instead of streaming (the default behavior), we first download all files (in an efficient manner), and then read them locally. Note: When calling this method the system's default temp folder will be used, and files inside it will be created as temporary (deleted on JVM exit). This is usually ok, but in cases where download operations are expected to run multiple times in the same JVM run, it is recommended to pass a temp folder as a parameter, so that it can be explicitly deleted by the caller when the pipeline ends. This approach can become relevant when any of the following applies: 1) There are very few files to fetch 2) The files differ significantly in their sizes 3) Local disk is SSD- Returns:
- this builder
-
fileOrder
public StorageMultiFileReaderConfig.Builder<T,B> fileOrder(Comparator<B> fileOrder)
- Parameters:
fileOrder- Forces an order by which files should be read. The order is defined as a comparator on file metadata objects. This only applies for sync reading. By default order is lexicographic on the full path.- Returns:
- this builder
-
build
public StorageMultiFileReaderConfig<T,B> build()
- Returns:
- A new multi reader config based on current builder state.
-
-