Class SharderBySeqPipe<T>

  • Type Parameters:
    T - The input items' data type
    All Implemented Interfaces:
    Closeable, AutoCloseable, BasePipe

    public class SharderBySeqPipe<T>
    extends CompoundTerminalPipe
    A terminal pipe that splits the contents of the input pipe according to some criteria which breaks the input pipe into disjoint contiguous sequences. Unlike other sharder pipes, this implementation assumes that input items are already grouped by target shard, therefore it can work file by file, avoiding the need to maintain many open files at the same time. Note that if a sequence corresponds to an already processed shard, the shard's file will be overwritten.
    Author:
    Eyal Schneider
    • Constructor Detail

      • SharderBySeqPipe

        public SharderBySeqPipe​(Pipe<T> input,
                                EncoderFactory<? super T> encoderFactory,
                                FailableFunction<? super T,​String,​PipeException> shardSelectorFunction,
                                File folder,
                                FileWriteOptions fileWriteOptions)
        Constructor
        Parameters:
        input - The input pipe
        encoderFactory - The encoder factory to use for writing items into the different shards
        shardSelectorFunction - Given an item, selects the corresponding shard id. Files will use this id as a name. Must not return null for any non null input!
        folder - The folder where to place all shards. Must exist.
        fileWriteOptions - Define how files should be written
      • SharderBySeqPipe

        public SharderBySeqPipe​(Pipe<T> input,
                                EncoderFactory<? super T> encoderFactory,
                                FailableFunction<? super T,​String,​PipeException> shardSelectorFunction,
                                File folder)
        Constructor Uses default file write options
        Parameters:
        input - The input pipe
        encoderFactory - The encoder factory to use for writing items into the different shards
        shardSelectorFunction - Given an item, selects the corresponding shard id. Files will use this id as a name. May be stateful. Must not return null for any non null input!
        folder - The folder where to place all shards. Must exist.