Class AsyncSharderPipe<T>

  • Type Parameters:
    T - The input items' data type
    All Implemented Interfaces:
    Closeable, AutoCloseable, BasePipe
    Direct Known Subclasses:
    AsyncSharderByHashPipe

    public class AsyncSharderPipe<T>
    extends TerminalPipe
    A terminal pipe that receives an async pipe as input, and splits the contents of the input pipe into multiple files according to some sharding criteria based on individual items. The async input allows high throughput through parallel writes to different files. The writing is done using the threads provided by the input pipe. The implementation allows calling close() by any thread after start() has been invoked. Note that this implementation keeps all shard files open at the same time, so make sure the system can handle this number of open files.
    Author:
    Eyal Schneider
    • Constructor Detail

      • AsyncSharderPipe

        public AsyncSharderPipe​(AsyncPipe<T> input,
                                EncoderFactory<? super T> encoderFactory,
                                Function<? super T,​String> shardSelectorFunction,
                                File folder,
                                FileWriteOptions writeOptions)
        Constructor
        Parameters:
        input - The input pipe
        encoderFactory - The encoder factory to use for writing items into the different shards
        shardSelectorFunction - Given an item, selects the corresponding shard id. Files will use this id as a name. Must not return null for any non null input!
        folder - The folder where to place all shards. Must exist.
        writeOptions - Specify how the shard files should be written
      • AsyncSharderPipe

        public AsyncSharderPipe​(AsyncPipe<T> input,
                                EncoderFactory<? super T> encoderFactory,
                                Function<? super T,​String> shardSelectorFunction,
                                File folder)
        Constructor Uses default file write options
        Parameters:
        input - The input pipe
        encoderFactory - The encoder factory to use for writing items into the different shards
        shardSelectorFunction - Given an item, selects the corresponding shard id. Files will use this id as a name. Must not return null for any non null input!
        folder - The folder where to place all shards. Must exist.
    • Method Detail

      • start

        public void start()
                   throws PipeException,
                          InterruptedException
        Description copied from interface: BasePipe
        Performs pre-processing prior to item flow throw the pipe. Implementations must call the same method for all their input pipes before accessing their items. This is typically done here.
        Throws:
        PipeException - In case of pipe errors in this pipe or somewhere up-stream.
        InterruptedException - In case that the operation has been interrupted by another thread.
      • getShardSizes

        public Map<String,​Integer> getShardSizes()
        Returns:
        The counts of items written to each shard. Call this method only after start() has been called and completed successfully.
      • getProgress

        public float getProgress()
        Specified by:
        getProgress in interface BasePipe
        Overrides:
        getProgress in class TerminalPipe
        Returns:
        The pipe flow progress, as a floating number between 0.0 and 1.0. Important implementation rules: 1) Calling this method before start() call is complete isn't allowed and has an undefined behavior. 2) Implementation should do best effort to provide an estimate of the progress this pipe has made (0.0 - 1.0) 3) When the pipe is fully consumed, getProgress() should return 1.0. 4) Results must be monotonous, i.e. results of consecutive calls may never be decreasing. 5) Thread safety: progress may be maintained by some thread/s but monitoring by other threads. Implementations must be thread safe.