Class SharderByHashPipe<T>

  • Type Parameters:
    T - The input items' data type
    All Implemented Interfaces:
    Closeable, AutoCloseable, BasePipe

    public class SharderByHashPipe<T>
    extends SharderByItemPipe<T>
    A terminal pipe that splits the contents of the input pipe into multiple files, according to a hash on a some feature of the item. The original order is preserved in each shard. Note that this implementation keeps all shard files open at the same time, so make sure the system can handle this number of open files.
    Author:
    Eyal Schneider
    • Constructor Detail

      • SharderByHashPipe

        public SharderByHashPipe​(Pipe<T> input,
                                 EncoderFactory<T> encoderFactory,
                                 FailableFunction<? super T,​?,​PipeException> featureSelectorFunction,
                                 Function<Integer,​String> fileNameFunction,
                                 int shardCount,
                                 File folder,
                                 FileWriteOptions writeOptions)
        Constructor
        Parameters:
        input - The input pipe
        encoderFactory - The encoder factory to use for writing items into the different shards
        featureSelectorFunction - Given an item, selects some feature from it to be hashed and used for shard selection. Must not return null for any non null input!
        fileNameFunction - Given a shard id, returns the file corresponding file name
        shardCount - The required number of shards.
        folder - The folder where to place all shards. Must exist. The files will be named according to fileNameFunction.
        writeOptions - Specify how the shard files should be written
      • SharderByHashPipe

        public SharderByHashPipe​(Pipe<T> input,
                                 EncoderFactory<T> encoderFactory,
                                 FailableFunction<? super T,​?,​PipeException> featureSelectorFunction,
                                 int shardCount,
                                 File folder,
                                 FileWriteOptions writeOptions)
        Constructor
        Parameters:
        input - The input pipe
        encoderFactory - The encoder factory to use for writing items into the different shards
        featureSelectorFunction - Given an item, selects some feature from it to be hashed and used for shard selection. Must not return null for any non null input!
        shardCount - The required number of shards.
        folder - The folder where to place all shards. Must exist. The files will be named "0","1","2"...shardCount-1
        writeOptions - Specify how the shard files should be written