Class SharderByItemPipe<T>

  • Type Parameters:
    T - The input items' data type
    All Implemented Interfaces:
    Closeable, AutoCloseable, BasePipe
    Direct Known Subclasses:
    SharderByHashPipe

    public class SharderByItemPipe<T>
    extends TerminalPipe
    A terminal pipe that splits the contents of the input pipe into multiple files, according to some sharding criteria based on each item. The original order is preserved in each shard. Note that this implementation keeps all shard files open at the same time, so make sure the system can handle this number of open files.
    Author:
    Eyal Schneider
    • Constructor Detail

      • SharderByItemPipe

        public SharderByItemPipe​(Pipe<T> input,
                                 EncoderFactory<? super T> encoderFactory,
                                 FailableFunction<? super T,​String,​PipeException> shardSelectorFunction,
                                 File folder,
                                 FileWriteOptions writeOptions)
        Constructor
        Parameters:
        input - The input pipe
        encoderFactory - The encoder factory to use for writing items into the different shards
        shardSelectorFunction - Given an item, selects the corresponding shard id. Files will use this id as a name. Must not return null for any non null input!
        folder - The folder where to place all shards. Must exist.
        writeOptions - Specify how the shard files should be written
      • SharderByItemPipe

        public SharderByItemPipe​(Pipe<T> input,
                                 EncoderFactory<? super T> encoderFactory,
                                 FailableFunction<? super T,​String,​PipeException> shardSelectorFunction,
                                 File folder)
        Constructor
        Parameters:
        input - The input pipe
        encoderFactory - The encoder factory to use for writing items into the different shards
        shardSelectorFunction - Given an item, selects the corresponding shard id. Files will use this id as a name. Must not return null for any non null input!
        folder - The folder where to place all shards. Must exist.
    • Method Detail

      • start

        public void start()
                   throws PipeException,
                          InterruptedException
        Description copied from interface: BasePipe
        Performs pre-processing prior to item flow throw the pipe. Implementations must call the same method for all their input pipes before accessing their items. This is typically done here.
        Throws:
        PipeException - In case of pipe errors in this pipe or somewhere up-stream.
        InterruptedException - In case that the operation has been interrupted by another thread.
      • getShardSizes

        public Map<String,​Integer> getShardSizes()
        Returns:
        The counts of items written to each shard. Call this method only after start() has been called and completed successfully.