Class StorageMultiFileReaderConfig.Builder<T,​B>

    • Method Detail

      • andFilter

        public StorageMultiFileReaderConfig.Builder<T,​B> andFilter​(Predicate<B> fileFilter)
        Parameters:
        fileFilter - The file predicate to AND with existing ones. Given a file metadata object, determines whether the file should be read. By default, the filter accepts all files.
        Returns:
        this builder
      • andPathFilter

        public StorageMultiFileReaderConfig.Builder<T,​B> andPathFilter​(Predicate<String> filePathFilter)
        Parameters:
        filePathFilter - The file predicate to AND with existing ones. Given a file path (relative to bucket), determines whether the file should be read. By default, the filter accepts all files.
        Returns:
        this builder
      • shard

        public StorageMultiFileReaderConfig.Builder<T,​B> shard​(ShardSpecifier shardSpecifier,
                                                                     boolean isBalanced)
        Parameters:
        shardSpecifier - Identifies the shard to read. When using this method, all files passing the filters are automatically assigned a shard. Sharding is based on hashing of their paths when balancing is turned off, and based on file sizes when balancing is turned on.
        isBalanced - Indicates whether the sharding should be based on file sizes, in order to achieve a semi-balanced partition of the data into shards. This option has some caveats to be aware of: 1. It consumes more memory since it stores all remote file references in memory. If you are working with millions of remote files, this may require careful memory settings. 2. When using in a distributed system, it is the responsibility of the user to guarantee that no file is added/changed once the workers start. Failing to do so will result in severe silent problems such as files handled by multiple instances, or files not handled at all.
        Returns:
        this builder
      • shard

        public StorageMultiFileReaderConfig.Builder<T,​B> shard​(ShardSpecifier shardSpecifier)
        Parameters:
        shardSpecifier - Indicates that automatic data sharding is requested. All files passing the filter conditions are automatically assigned a shard. Sharding is based on hashing of their paths.
        Returns:
        this builder
      • bucket

        public StorageMultiFileReaderConfig.Builder<T,​B> bucket​(Storage<?,​B> storage,
                                                                      String bucketName)
        Parameters:
        storage - The storage implementation to use
        bucketName - The name of the bucket containing the files to read
        Returns:
        this builder
      • paths

        public StorageMultiFileReaderConfig.Builder<T,​B> paths​(Collection<String> paths,
                                                                     boolean isRecursive)
        Parameters:
        paths - The set of folder paths to read files from. Paths should be relative to the bucket.
        isRecursive - Indicates whether files should be fetched from the paths recursively or not.
        Returns:
        this builder
      • paths

        public StorageMultiFileReaderConfig.Builder<T,​B> paths​(String... paths)
        Parameters:
        paths - The set of folder paths to read files from, in a non-recursive manner. Paths should be relative to the bucket.
        Returns:
        this builder
      • paths

        public StorageMultiFileReaderConfig.Builder<T,​B> paths​(String path,
                                                                     boolean isRecursive)
        Parameters:
        path - The folder path to read files from. Path should be relative to the bucket.
        isRecursive - Indicates whether files should be fetched from the path recursively or not.
        Returns:
        this builder
      • paths

        public StorageMultiFileReaderConfig.Builder<T,​B> paths​(String path)
        Parameters:
        path - The folder path to read files from, in a non-recursive manner. Path should be relative to the bucket.
        Returns:
        this builder
      • threadNum

        public StorageMultiFileReaderConfig.Builder<T,​B> threadNum​(int threadNum)
        Parameters:
        threadNum - The number of threads to use for reading files. Refers to file download/streaming parallelism, and when used by an async pipe and the download flag is set, refers also to the number of threads used to read the local files after they are all downloaded. By default the thread count is set to the number of cores in the machine.
        Returns:
        this builder
      • downloadFirst

        public StorageMultiFileReaderConfig.Builder<T,​B> downloadFirst​(File tmpFolder)
        Indicates that instead of streaming (the default behavior), we first download all files (in an efficient manner), and then read them locally.
        Parameters:
        tmpFolder - The temporary folder to download to. All files inside this folder will be marked as temporary (meaning that they will be deteled on JVM termination), but for a timely disposal of these resources, it is strongly recommended that the caller explicitly deletes the folder as soon as the pipeline ends. This approach can become relevant when any of the following applies: 1) There are very few files to fetch 2) The files differ significantly in their sizes 3) Local disk is SSD
        Returns:
        this builder
      • downloadFirst

        public StorageMultiFileReaderConfig.Builder<T,​B> downloadFirst()
        Indicates that instead of streaming (the default behavior), we first download all files (in an efficient manner), and then read them locally. Note: When calling this method the system's default temp folder will be used, and files inside it will be created as temporary (deleted on JVM exit). This is usually ok, but in cases where download operations are expected to run multiple times in the same JVM run, it is recommended to pass a temp folder as a parameter, so that it can be explicitly deleted by the caller when the pipeline ends. This approach can become relevant when any of the following applies: 1) There are very few files to fetch 2) The files differ significantly in their sizes 3) Local disk is SSD
        Returns:
        this builder
      • fileOrder

        public StorageMultiFileReaderConfig.Builder<T,​B> fileOrder​(Comparator<B> fileOrder)
        Parameters:
        fileOrder - Forces an order by which files should be read. The order is defined as a comparator on file metadata objects. This only applies for sync reading. By default order is lexicographic on the full path.
        Returns:
        this builder