Package org.pipecraft.pipes.sync.inter
Class DedupPipe<T>
- java.lang.Object
-
- org.pipecraft.pipes.sync.inter.CompoundPipe<O>
-
- org.pipecraft.pipes.sync.inter.reduct.HashReductorPipe<T,T>
-
- org.pipecraft.pipes.sync.inter.DedupPipe<T>
-
- Type Parameters:
T- The data type of items in the input pipe
- All Implemented Interfaces:
Closeable,AutoCloseable,BasePipe,Pipe<T>
public class DedupPipe<T> extends HashReductorPipe<T,T>
Uses item equality (equals() method) in input pipe items for performing a dedup operation. Only one arbitrary instance per equivalence set is produced in the output. The implementation relies on equals() and on consistency of equals() with hashcode(). This pipe makes use of temporary disk space. The configuration of partitionCount is critical for bounding memory usage. The more partitions are used, the less memory is required. It's recommended to set this number to {estimated total input data volume} / {max memory allowed for this pipe to use}.- Author:
- Eyal Schneider
-
-
Method Summary
-
Methods inherited from class org.pipecraft.pipes.sync.inter.reduct.HashReductorPipe
close, createPipeline
-
Methods inherited from class org.pipecraft.pipes.sync.inter.CompoundPipe
getProgress, next, peek, start
-
-
-
-
Constructor Detail
-
DedupPipe
public DedupPipe(Pipe<T> input, CodecFactory<T> inputCodec, int partitionCount, File tmpFolder)
Constructor- Parameters:
input- The input pipe to wrapinputCodec- A codec allowing writing/reading input recordspartitionCount- The number of partitions to split input into. Assuming a good hash function on item keys, and assuming that the families defined by the discriminator are even in size, the caller can assume the partitions are more-less balanced in size. This number determines the amount of memory to be used, so it should be defined with caution. The more partitions are used, the less total memory is required. However, note that for each partition the class maintains an open file on disk.tmpFolder- The folder where to store temporary data
-
-