Class StorageMultiFileReaderConfig<T,​B>

  • Type Parameters:
    T - The required data type of the items to read
    B - The file metadata type used by the storage implementation

    public class StorageMultiFileReaderConfig<T,​B>
    extends Object
    A builder + configuration for pipes reading multiple remote files from storage. Suitable for sync/async pipes. Supports: - Filtering files - Automatic file sharding (either by file count or file volume) - Reading from multiple remote paths, recursively or not - Setting parallelism level - Choosing whether to stream data or to download and then read locally - Defining the file reading order (sync pipes only)
    Author:
    Eyal Schneider
    • Method Detail

      • builder

        public static <T,​B> StorageMultiFileReaderConfig.Builder<T,​B> builder​(PipeReaderSupplier<T,​B> supplier)
        Creates a builder set up with the given pipe supplier and additional defaults.
        Parameters:
        supplier - Produces a pipe ready to read items from a given file. Setting a supplier as a file handler provides maximal control for the caller. The supplier receives the file's input stream plus the file metadata, and builds a pipe reading items from it. As a simple alternative, a DecoderFactory can be passed as a file handler.
        Returns:
        The new builder, initialized with the given supplier
      • builder

        public static <T,​B> StorageMultiFileReaderConfig.Builder<T,​B> builder​(DecoderFactory<T> decoderFactory,
                                                                                          FileReadOptions readOptions)
        Creates a builder set up with the given file data decoder and additional defaults.
        Parameters:
        decoderFactory - Defines how items should be decoded from files. As an alternative when more control is needed, a PipeReaderSupplier can be passed instead.
        readOptions - Allows defining decompression and read buffer size to apply before decoding.
      • builder

        public static <T,​B> StorageMultiFileReaderConfig.Builder<T,​B> builder​(DecoderFactory<T> decoderFactory)
        Creates a builder set up with the given file data decoder and additional defaults.
        Parameters:
        decoderFactory - Defines how items should be decoded from files. As an alternative when more control is needed, a PipeReaderSupplier can be passed instead.
      • getFileFilter

        public Predicate<B> getFileFilter()
        Returns:
        The file predicate resulting from ANDing all the specified filters. Given a file metadata object, determines whether the file should be read. By default, the predicate accepts all files.
      • getShardSpecifier

        public ShardSpecifier getShardSpecifier()
        Returns:
        The identity of the shard to read. Null when automatic sharding is off. When enabled, all files passing the filter conditions are automatically assigned a shard, and checked whether they match the specified shard. Sharding is based on hashing of the file paths when balancing is turned off, and based on file sizes when balancing is turned on.
      • isBalancedSharding

        public boolean isBalancedSharding()
        Relevant only when automatic sharding is enabled.
        Returns:
        true when balanced sharding is enabled. When enabled, sharding tries to balance the total bytes volume in each shard. Otherwise, shards are more-less balanced by file count. By default this parameter is false. This option has some caveats to be aware of: 1. It consumes more memory since it stores all remote file references in memory. If you are working with millions of remote files, this may require careful memory settings. 2. When using in a distributed system, it is the responsibility of the user to guarantee that no file is added/changed once the workers start. Failing to do so will result in severe silent problems such as files handled by multiple instances, or files not handled at all.
      • getBucket

        public Bucket<B> getBucket()
        Returns:
        The bucket containing the files to read
      • getPaths

        public Collection<String> getPaths()
        Returns:
        The set of paths of folders to read files from. Should be relative to the bucket.
      • isRecursivePaths

        public boolean isRecursivePaths()
        Returns:
        true if and only if the files to fetch should be detected recursively under the given paths. By default set to false.
      • getPipeSupplier

        public PipeReaderSupplier<T,​B> getPipeSupplier()
        Returns:
        The pipe supplier specifying how items are extracted from files. Produces a pipe ready to read items from a given file. Setting a supplier as a file handler provides maximal control for the caller. The supplier receives the file's input stream plus the file metadata, and builds a pipe reading items from it. As a simple alternative, a DecoderFactory can be passed as a file handler.
      • getThreadNum

        public int getThreadNum()
        Returns:
        The number of threads to use for reading files. Refers to file download/streaming parallelism, and when used by an async pipe and the download flag is set, refers also to the number of threads used to read the local files after they are all downloaded. By default the thread count is set to the number of cores in the machine.
      • isDownloadFirst

        public boolean isDownloadFirst()
        Returns:
        When true, indicates that instead of streaming (the default behavior), we first download all files (in an efficient manner), and then read them locally. This approach can become relevant when any of the following applies: 1) There are very few files to fetch 2) The files differ significantly in their sizes 3) Local disk is SSD
      • getTmpFolder

        public File getTmpFolder()
        Returns:
        The temp folder to use when download flag is turned on. Null when download flag is turned off.
      • getFileOrder

        public Comparator<B> getFileOrder()
        Returns:
        The order by which files should be read. The order is defined as a comparator on file metadata objects. This only applies for sync readers. By default order is lexicographic on the full path.