package write
Type Members
-
trait
BatchWrite extends AnyRef
An interface that defines how to write the data to data source for batch processing.
An interface that defines how to write the data to data source for batch processing.
The writing procedure is:
- Create a writer factory by
#createBatchWriterFactory(PhysicalWriteInfo), serialize and send it to all the partitions of the input data(RDD). 2. For each partition, create the data writer, and write the data of the partition with this writer. If all the data are written successfully, callDataWriter#commit(). If exception happens during the writing, callDataWriter#abort(). 3. If all writers are successfully committed, call#commit(WriterCommitMessage[]). If some writers are aborted, or the job failed with an unknown reason, call#abort(WriterCommitMessage[]).
While Spark will retry failed writing tasks, Spark won't retry failed writing jobs. Users should do it manually in their Spark applications if they want to retry.
Please refer to the documentation of commit/abort methods for detailed specifications.
- Since
3.0.0
- Create a writer factory by
-
trait
DataWriter[T] extends Closeable
A data writer returned by
long)and is responsible for writing data for an input RDD partition.A data writer returned by
long)and is responsible for writing data for an input RDD partition.One Spark task has one exclusive data writer, so there is no thread-safe concern.
#write(Object)is called for each record in the input RDD partition. If one record fails the#write(Object),#abort()is called afterwards and the remaining records will not be processed. If all records are successfully written,#commit()is called.Once a data writer returns successfully from
#commit()or#abort(), Spark will call#close()to let DataWriter doing resource cleanup. After calling#close(), its lifecycle is over and Spark will not use it again.If this data writer succeeds(all records are successfully written and
#commit()succeeds), aWriterCommitMessagewill be sent to the driver side and pass toBatchWrite#commit(WriterCommitMessage[])with commit messages from other data writers. If this data writer fails(one record fails to write or#commit()fails), an exception will be sent to the driver side, and Spark may retry this writing task a few times. In each retry,long)will receive a differenttaskId. Spark will callBatchWrite#abort(WriterCommitMessage[])when the configured number of retries is exhausted.Besides the retry mechanism, Spark may launch speculative tasks if the existing writing task takes too long to finish. Different from retried tasks, which are launched one by one after the previous one fails, speculative tasks are running simultaneously. It's possible that one input RDD partition has multiple data writers with different
taskIdrunning at the same time, and data sources should guarantee that these data writers don't conflict and can work together. Implementations can coordinate with driver during#commit()to make sure only one of these data writers can commit successfully. Or implementations can allow all of them to commit successfully, and have a way to revert committed data writers without the commit message, because Spark only accepts the commit message that arrives first and ignore others.Note that, Currently the type
Tcan only beorg.apache.spark.sql.catalyst.InternalRow.- Since
3.0.0
-
trait
DataWriterFactory extends Serializable
A factory of
DataWriterreturned byBatchWrite#createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing the actual data writer at executor side.A factory of
DataWriterreturned byBatchWrite#createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing the actual data writer at executor side.Note that, the writer factory will be serialized and sent to executors, then the data writer will be created on executors and do the actual writing. So this interface must be serializable and
DataWriterdoesn't need to be.- Since
3.0.0
-
trait
LogicalWriteInfo extends AnyRef
This interface contains logical write information that data sources can use when generating a
WriteBuilder.This interface contains logical write information that data sources can use when generating a
WriteBuilder.- Since
3.0.0
-
trait
PhysicalWriteInfo extends AnyRef
This interface contains physical write information that data sources can use when generating a
DataWriterFactoryor aStreamingDataWriterFactory.This interface contains physical write information that data sources can use when generating a
DataWriterFactoryor aStreamingDataWriterFactory.- Since
3.0.0
-
trait
SupportsDynamicOverwrite extends WriteBuilder
Write builder trait for tables that support dynamic partition overwrite.
Write builder trait for tables that support dynamic partition overwrite.
A write that dynamically overwrites partitions removes all existing data in each logical partition for which the write will commit new data. Any existing logical partition for which the write does not contain data will remain unchanged.
This is provided to implement SQL compatible with Hive table operations but is not recommended. Instead, use the
overwrite by filter APIto explicitly replace data.- Since
3.0.0
-
trait
SupportsOverwrite extends WriteBuilder with SupportsTruncate
Write builder trait for tables that support overwrite by filter.
Write builder trait for tables that support overwrite by filter.
Overwriting data by filter will delete any data that matches the filter and replace it with data that is committed in the write.
- Since
3.0.0
-
trait
SupportsTruncate extends WriteBuilder
Write builder trait for tables that support truncation.
Write builder trait for tables that support truncation.
Truncation removes all data in a table and replaces it with data that is committed in the write.
- Since
3.0.0
-
trait
WriteBuilder extends AnyRef
An interface for building the
BatchWrite.An interface for building the
BatchWrite. Implementations can mix in some interfaces to support different ways to write data to data sources.Unless modified by a mixin interface, the
BatchWriteconfigured by this builder is to append data without affecting existing data.- Since
3.0.0
-
trait
WriterCommitMessage extends Serializable
A commit message returned by
DataWriter#commit()and will be sent back to the driver side as the input parameter ofBatchWrite#commit(WriterCommitMessage[])orWriterCommitMessage[]).A commit message returned by
DataWriter#commit()and will be sent back to the driver side as the input parameter ofBatchWrite#commit(WriterCommitMessage[])orWriterCommitMessage[]).This is an empty interface, data sources should define their own message class and use it when generating messages at executor side and handling the messages at driver side.
- Since
3.0.0