Materializes _all_ the row groups for a particular Parquet file, given as a root locator (the parent directory, for instance) and a relative path (the location of the file).
Materializes _all_ the row groups for a particular Parquet file, given as a root locator (the parent directory, for instance) and a relative path (the location of the file).
The particular form of this method and its arguments it driven by two concerns: 1. recursive listing of Parquet files in the local filesystem, see the materialize(fullPath : String) method above. 2. when reading files from S3, we often store the locations of our files in terms of a root path (bucket + root key) and relative offsets (which are the names read from the index itself).
A base locator, relative to which the relativePath will point to a valid parquet file
A relative path to a Parquet file
An iterator over T values from the named parquet file
Given a full path to the local filesystem, checks whether the path is a file -- in which case, it materializes the row groups from that file -- or if it's a directory.
Given a full path to the local filesystem, checks whether the path is a file -- in which case, it materializes the row groups from that file -- or if it's a directory. If the path is a directory, it lists the row groups of all the files within the directory.
The path on the local filesystem, corresponding either to a Parquet file or a directory filled with parquet files.
An iterator over the values requested from the file (or, in case fullPath is a directory, _all_ the files in some arbitrary order)
ParquetLister materializes the records within a Parquet row group as an Iterator[T].
This is used for two purposes at the moment: 1. we've added a 'PrintParquet' command (which works only for ADAMFlatGenotype parquet files, at the moment), whose purpose is to print out a number of entries from within a Parquet file for debugging purposes. PrintParquet uses ParquetLister to get those entries and materialize them. 2. The index generators (for the Range and IDRange indices) need to read through a Parquet files entries in order to index it -- if you're not building the index at the time you write the file, that is. So they have to materialize the records in the file, and they use the ParquetLister to do that.
Materializing records from a Parquet file turns out to be a somewhat-complicated operation, with a few parameters floating around. The original version was written as part of the materialize methods in the two Parquet RDDs, but we factored it out into this class when we realized we needed it for the two additional purposes (listed above).
ParquetLister doesn't yet support UnboundRecordFilter (see the comment below) but it should be adapted to do so. When it does, we should replace the original materialize methods in the ParquetRDDs with use of this class instead, so that there's only one implementation of the materialize code floating around.
The type of the record to be read.