Packages

package read

Type Members

  1. trait Batch extends AnyRef

    A physical representation of a data source scan for batch queries.

    A physical representation of a data source scan for batch queries. This interface is used to provide physical information, like how many partitions the scanned data has, and how to read records from the partitions.

    Since

    3.0.0

  2. trait InputPartition extends Serializable

    A serializable representation of an input partition returned by Batch#planInputPartitions() and the corresponding ones in streaming .

    A serializable representation of an input partition returned by Batch#planInputPartitions() and the corresponding ones in streaming .

    Note that InputPartition will be serialized and sent to executors, then PartitionReader will be created by PartitionReaderFactory#createReader(InputPartition) or PartitionReaderFactory#createColumnarReader(InputPartition) on executors to do the actual reading. So InputPartition must be serializable while PartitionReader doesn't need to be.

    Since

    3.0.0

  3. trait PartitionReader[T] extends Closeable

    A partition reader returned by PartitionReaderFactory#createReader(InputPartition) or PartitionReaderFactory#createColumnarReader(InputPartition).

    A partition reader returned by PartitionReaderFactory#createReader(InputPartition) or PartitionReaderFactory#createColumnarReader(InputPartition). It's responsible for outputting data for a RDD partition.

    Note that, Currently the type T can only be org.apache.spark.sql.catalyst.InternalRow for normal data sources, or org.apache.spark.sql.vectorized.ColumnarBatch for columnar data sources(whose PartitionReaderFactory#supportColumnarReads(InputPartition) returns true).

    Since

    3.0.0

  4. trait PartitionReaderFactory extends Serializable

    A factory used to create PartitionReader instances.

    A factory used to create PartitionReader instances.

    If Spark fails to execute any methods in the implementations of this interface or in the returned PartitionReader (by throwing an exception), corresponding Spark task would fail and get retried until hitting the maximum retry times.

    Since

    3.0.0

  5. trait Scan extends AnyRef

    A logical representation of a data source scan.

    A logical representation of a data source scan. This interface is used to provide logical information, like what the actual read schema is.

    This logical representation is shared between batch scan, micro-batch streaming scan and continuous streaming scan. Data sources must implement the corresponding methods in this interface, to match what the table promises to support. For example, #toBatch() must be implemented, if the Table that creates this Scan returns TableCapability#BATCH_READ support in its Table#capabilities().

    Since

    3.0.0

  6. trait ScanBuilder extends AnyRef

    An interface for building the Scan.

    An interface for building the Scan. Implementations can mixin SupportsPushDownXYZ interfaces to do operator pushdown, and keep the operator pushdown result in the returned Scan.

    Since

    3.0.0

  7. trait Statistics extends AnyRef

    An interface to represent statistics for a data source, which is returned by SupportsReportStatistics#estimateStatistics().

    An interface to represent statistics for a data source, which is returned by SupportsReportStatistics#estimateStatistics().

    Since

    3.0.0

  8. trait SupportsPushDownFilters extends ScanBuilder

    A mix-in interface for ScanBuilder.

    A mix-in interface for ScanBuilder. Data sources can implement this interface to push down filters to the data source and reduce the size of the data to be read.

    Since

    3.0.0

  9. trait SupportsPushDownRequiredColumns extends ScanBuilder

    A mix-in interface for ScanBuilder.

    A mix-in interface for ScanBuilder. Data sources can implement this interface to push down required columns to the data source and only read these columns during scan to reduce the size of the data to be read.

    Since

    3.0.0

  10. trait SupportsReportPartitioning extends Scan

    A mix in interface for Scan.

    A mix in interface for Scan. Data sources can implement this interface to report data partitioning and try to avoid shuffle at Spark side.

    Note that, when a Scan implementation creates exactly one InputPartition, Spark may avoid adding a shuffle even if the reader does not implement this interface.

    Since

    3.0.0

  11. trait SupportsReportStatistics extends Scan

    A mix in interface for Scan.

    A mix in interface for Scan. Data sources can implement this interface to report statistics to Spark.

    As of Spark 3.0, statistics are reported to the optimizer after operators are pushed to the data source. Implementations may return more accurate statistics based on pushed operators which may improve query performance by providing better information to the optimizer.

    Since

    3.0.0

Ungrouped