CombinedFilter is (really) a hack -- what we _need_ is a way to translate 'filters' on records
that form an RDD into three roughly equivalent forms:
1. an UnboundRecordFilter that can be used in interactions with a Parquet file directly,
2. a predicate (RecordType => Boolean) function that can be used as an argument to RDD.filter(), and
3. an IndexEntryPredicate, which filters only those row groups of a Parquet file which (according to
an index) _could_ contain records which satisfy either of the two other forms.
These three should be "equivalent," in the sense that (1) and (2) should accept exactly the same
records, and that (3) should accept index entry which indexes a row group that contains a record
satisfying either (1) or (2).
The "right" way to do this, in the long run, is to have some kind of FILTERING DSL, which can be
translated into any of these three forms (or any other fourth form, that we decide we need in the
future) as necessary.
In the meantime, however, we're simply providing an implementation as a FilterTuple below -- this
is just a triple of corresponding filters/predicates, and we leave it up to the user to ensure that
the semantics of the three arguments match the requirements outline above.
RecordType
The type of the record in the Parquet file or RDD to be filtered
IndexEntryType
The type of the entry in the index which can be filtered.
CombinedFilter is (really) a hack -- what we _need_ is a way to translate 'filters' on records that form an RDD into three roughly equivalent forms: 1. an UnboundRecordFilter that can be used in interactions with a Parquet file directly, 2. a predicate (RecordType => Boolean) function that can be used as an argument to RDD.filter(), and 3. an IndexEntryPredicate, which filters only those row groups of a Parquet file which (according to an index) _could_ contain records which satisfy either of the two other forms.
These three should be "equivalent," in the sense that (1) and (2) should accept exactly the same records, and that (3) should accept index entry which indexes a row group that contains a record satisfying either (1) or (2).
The "right" way to do this, in the long run, is to have some kind of FILTERING DSL, which can be translated into any of these three forms (or any other fourth form, that we decide we need in the future) as necessary.
In the meantime, however, we're simply providing an implementation as a FilterTuple below -- this is just a triple of corresponding filters/predicates, and we leave it up to the user to ensure that the semantics of the three arguments match the requirements outline above.
The type of the record in the Parquet file or RDD to be filtered
The type of the entry in the index which can be filtered.