Class HashJoinPipe<K,​L,​R>

  • Type Parameters:
    K - The type of the key used for matching records
    L - The type of left side records
    R - The type of right side records
    All Implemented Interfaces:
    Closeable, AutoCloseable, BasePipe, Pipe<JoinRecord<K,​L,​R>>

    public class HashJoinPipe<K,​L,​R>
    extends CompoundPipe<JoinRecord<K,​L,​R>>
    A pipe performing a join operation between a 'left' pipe of type L, and a list of 'right' pipes of type R. In contrast to SortedJoinPipe, left and right pipes don't have to be ordered. This pipe uses a grace-hash-join approach, and requires the caller to be careful with the data partitioning definitions, in order to prevent OOM errors. Duplicates are allowed. The output type for this pipe is JoinRecord, which consists of the key, the left matches and the right matches. The join can work in LEFT/INNER/FULL_INNER/OUTER mode. See JoinMode for more details.
    Author:
    Eyal Schneider
    • Constructor Detail

      • HashJoinPipe

        public HashJoinPipe​(Pipe<L> leftPipe,
                            FailableFunction<L,​K,​PipeException> leftKeyExtractor,
                            List<? extends Pipe<R>> rightPipes,
                            FailableFunction<R,​K,​PipeException> rightKeyExtractor,
                            JoinMode joinMode,
                            int partitionCount,
                            CodecFactory<L> leftCodec,
                            CodecFactory<R> rightCodec,
                            File tmpFolder)
        Constructor
        Parameters:
        leftPipe - The left side pipe in the join operation
        leftKeyExtractor - The extractor of the key from the data type of the left pipe
        rightPipes - The list of right side pipes. The order is important, and determines the ids given to the pipes in the iteration outputs (See JoinRecord).
        rightKeyExtractor - The extractor of the key from the data type of the right pipes
        joinMode - The policy for performing the join. See JoinMode.
        partitionCount - The number of partitions every pipe should be split into. Assuming a good hash function on item keys, the caller can assume the partitions are more-less even in size. In the worst case, the same partition of all pipes will be loaded into memory at the same time. This number determines the amount of memory to be used, so it should be determined with caution.
        leftCodec - An encoder/decoder factory for items of left pipe. Used for intermediate storage needed for the hash join.
        rightCodec - An encoder/decoder factory for items of right pipe. Used for intermediate storage needed for the hash join.
        tmpFolder - The folder to use for temporary storage of pipe contents
      • HashJoinPipe

        public HashJoinPipe​(Pipe<L> leftPipe,
                            FailableFunction<L,​K,​PipeException> leftKeyExtractor,
                            Pipe<R> rightPipe,
                            FailableFunction<R,​K,​PipeException> rightKeyExtractor,
                            JoinMode joinMode,
                            int partitionCount,
                            CodecFactory<L> leftCodec,
                            CodecFactory<R> rightCodec,
                            File tmpFolder)
        Constructor To be used when there's a single right pipe.
        Parameters:
        leftPipe - The left side pipe in the join operation
        leftKeyExtractor - The extractor of the key from the data type of the left pipe
        rightPipe - The right side pipe
        rightKeyExtractor - The extractor of the key from the data type of the right pipes
        joinMode - The policy for performing the join. See JoinMode.
        partitionCount - The number of partitions every pipe should be split into. Assuming a good hash function on item keys, the caller can assume the partitions are more-less even in size. In the worst case, the same partition of all pipes will be loaded into memory at the same time. This number determines the amount of memory to be used, so it should be determined with caution.
        leftCodec - An encoder/decoder factory for items of left pipe. Used for intermediate storage needed for the hash join.
        rightCodec - An encoder/decoder factory for items of right pipes. Used for intermediate storage needed for the hash join.
        tmpFolder - The folder to use for temporary storage of pipe contents
      • HashJoinPipe

        public HashJoinPipe​(List<? extends Pipe<R>> rightPipes,
                            FailableFunction<R,​K,​PipeException> rightKeyExtractor,
                            int partitionCount,
                            CodecFactory<R> rightCodec,
                            File tmpFolder)
        Constructor A constructor for the case of no left pipe. Assumes join type OUTER among the right pipes.
        Parameters:
        rightPipes - The list of right side pipes. The order is important, and determines the ids given to the pipes in the iteration outputs (See JoinRecord).
        rightKeyExtractor - The extractor of the key from the data type of the right pipes
        partitionCount - The number of partitions every pipe should be split into. Assuming a good hash function on item keys, the caller can assume the partitions are more-less even in size. In the worst case, the same partition of all pipes will be loaded into memory at the same time. This number determines the amount of memory to be used, so it should be determined with caution.
        rightCodec - An encoder/decoder factory for items of right pipes. Used for intermediate storage needed for the hash join.
        tmpFolder - The folder to use for temporary storage of pipe contents