Class GrouperPipe<T>

  • Type Parameters:
    T - The data type of items in the input pipe
    All Implemented Interfaces:
    Closeable, AutoCloseable, BasePipe, Pipe<T>

    public class GrouperPipe<T>
    extends CompoundPipe<T>
    Splits the input items into different families, and emits items from the same family sequentially. The order between families is arbitrary, but the ordering of items inside each family as preserved like in the input. This pipe makes use of temporary disk space. The configuration of partitionCount is critical for bounding memory usage. The more partitions are used, the less memory is required. It's recommended to set this number to {estimated total input data volume} / {max memory allowed for this pipe to use}.
    Author:
    Eyal Schneider
    • Constructor Detail

      • GrouperPipe

        public GrouperPipe​(Pipe<T> input,
                           FailableFunction<T,​?,​PipeException> discriminator,
                           CodecFactory<T> inputCodec,
                           int partitionCount,
                           File tmpFolder)
        Constructor
        Parameters:
        input - The input pipe to wrap
        discriminator - The function that identifies which "family" an item belongs to. Important: The value returned by the discriminator should have an equals() implementation which is consistent with the family partitioning and with the hashcode() method implementation.
        inputCodec - A codec allowing writing/reading input records
        partitionCount - The number of partitions to split input into. Assuming a good hash function on item keys, and assuming that the families defined by the discriminator are even in size, the caller can assume the partitions are more-less balanced in size. This number determines the amount of memory to be used, so it should be defined with caution. The more partitions are used, the less total memory is required. However, note that for each partition the class maintains an open file on disk.
        tmpFolder - The folder where to store temporary data