Performs a region join between two RDDs (shuffle join).
Performs a region join between two RDDs (shuffle join).
This implementation is shuffle-based, so does not require collecting one side into memory like BroadcastRegionJoin. It basically performs a global sort of each RDD by genome position and then does a sort-merge join, similar to the chromsweep implementatio in bedtools. More specifically, it first defines a set of bins across the genome, then assigns each object in the RDDs to each bin that they overlap (replicating if necessary), performs the shuffle, and sorts the object in each bin. Finally, each bin independently performs a chromsweep sort-merge join.
type of leftRDD
type of rightRDD
A SparkContext for the cluster that will perform the join
The 'left' side of the join, a set of values which correspond (through an implicit ReferenceMapping) to regions on the genome.
The 'right' side of the join, a set of values which correspond (through an implicit ReferenceMapping) to regions on the genome
A SequenceDictionary -- every region corresponding to either the leftRDD or rightRDD values must be mapped to a chromosome with an entry in this dictionary.
The size of the genome bin in nucleotides. Controls the parallelism of the join.
implicit reference mapping for leftRDD regions
implicit reference mapping for rightRDD regions
implicit type of leftRDD
implicit type of rightRDD
An RDD of pairs (x, y), where x is from leftRDD, y is from rightRDD, and the region corresponding to x overlaps the region corresponding to y.