package physical
- Alphabetic
- Public
- All
Type Members
-
case class
BroadcastDistribution(mode: BroadcastMode) extends Distribution with Product with Serializable
Represents data where tuples are broadcasted to every node.
Represents data where tuples are broadcasted to every node. It is quite common that the entire set of tuples is transformed into different data structure.
-
trait
BroadcastMode extends AnyRef
Marker trait to identify the shape in which tuples are broadcasted.
Marker trait to identify the shape in which tuples are broadcasted. Typical examples of this are identity (tuples remain unchanged) or hashed (tuples are converted into some hash index).
-
case class
BroadcastPartitioning(mode: BroadcastMode) extends Partitioning with Product with Serializable
Represents a partitioning where rows are collected, transformed and broadcasted to each node in the cluster.
-
case class
ClusteredDistribution(clustering: Seq[Expression], requiredNumPartitions: Option[Int] = None) extends Distribution with Product with Serializable
Represents data where tuples that share the same values for the
clusteringExpressions will be co-located in the same partition. -
sealed
trait
Distribution extends AnyRef
Specifies how tuples that share common expressions will be distributed when a query is executed in parallel on many machines.
Specifies how tuples that share common expressions will be distributed when a query is executed in parallel on many machines.
Distribution here refers to inter-node partitioning of data. That is, it describes how tuples are partitioned across physical machines in a cluster. Knowing this property allows some operators (e.g., Aggregate) to perform partition local operations instead of global ones.
-
case class
HashClusteredDistribution(expressions: Seq[Expression], requiredNumPartitions: Option[Int] = None) extends Distribution with Product with Serializable
Represents data where tuples have been clustered according to the hash of the given
expressions.Represents data where tuples have been clustered according to the hash of the given
expressions. The hash function is defined asHashPartitioning.partitionIdExpression, so only HashPartitioning can satisfy this distribution.This is a strictly stronger guarantee than ClusteredDistribution. Given a tuple and the number of partitions, this distribution strictly requires which partition the tuple should be in.
-
case class
HashPartitioning(expressions: Seq[Expression], numPartitions: Int) extends Expression with Partitioning with Unevaluable with Product with Serializable
Represents a partitioning where rows are split up across partitions based on the hash of
expressions.Represents a partitioning where rows are split up across partitions based on the hash of
expressions. All rows whereexpressionsevaluate to the same values are guaranteed to be in the same partition. -
case class
OrderedDistribution(ordering: Seq[SortOrder]) extends Distribution with Product with Serializable
Represents data where tuples have been ordered according to the
orderingExpressions.Represents data where tuples have been ordered according to the
orderingExpressions. Its requirement is defined as the following:- Given any 2 adjacent partitions, all the rows of the second partition must be larger than or
equal to any row in the first partition, according to the
orderingexpressions.
In other words, this distribution requires the rows to be ordered across partitions, but not necessarily within a partition.
- Given any 2 adjacent partitions, all the rows of the second partition must be larger than or
equal to any row in the first partition, according to the
-
trait
Partitioning extends AnyRef
Describes how an operator's output is split across partitions.
Describes how an operator's output is split across partitions. It has 2 major properties:
- number of partitions. 2. if it can satisfy a given distribution.
-
case class
PartitioningCollection(partitionings: Seq[Partitioning]) extends Expression with Partitioning with Unevaluable with Product with Serializable
A collection of Partitionings that can be used to describe the partitioning scheme of the output of a physical operator.
A collection of Partitionings that can be used to describe the partitioning scheme of the output of a physical operator. It is usually used for an operator that has multiple children. In this case, a Partitioning in this collection describes how this operator's output is partitioned based on expressions from a child. For example, for a Join operator on two tables
AandBwith a join conditionA.key1 = B.key2, assuming we use HashPartitioning schema, there are two Partitionings can be used to describe how the output of this Join operator is partitioned, which areHashPartitioning(A.key1)andHashPartitioning(B.key2). It is also worth noting thatpartitioningsin this collection do not need to be equivalent, which is useful for Outer Join operators. -
case class
RangePartitioning(ordering: Seq[SortOrder], numPartitions: Int) extends Expression with Partitioning with Unevaluable with Product with Serializable
Represents a partitioning where rows are split across partitions based on some total ordering of the expressions specified in
ordering.Represents a partitioning where rows are split across partitions based on some total ordering of the expressions specified in
ordering. When data is partitioned in this manner, it guarantees: Given any 2 adjacent partitions, all the rows of the second partition must be larger than any row in the first partition, according to theorderingexpressions.This is a strictly stronger guarantee than what
OrderedDistribution(ordering)requires, as there is no overlap between partitions.This class extends expression primarily so that transformations over expression will descend into its child.
-
case class
RoundRobinPartitioning(numPartitions: Int) extends Partitioning with Product with Serializable
Represents a partitioning where rows are distributed evenly across output partitions by starting from a random target partition number and distributing rows in a round-robin fashion.
Represents a partitioning where rows are distributed evenly across output partitions by starting from a random target partition number and distributing rows in a round-robin fashion. This partitioning is used when implementing the DataFrame.repartition() operator.
- case class UnknownPartitioning(numPartitions: Int) extends Partitioning with Product with Serializable
Value Members
-
object
AllTuples extends Distribution with Product with Serializable
Represents a distribution that only has a single partition and all tuples of the dataset are co-located.
-
object
IdentityBroadcastMode extends BroadcastMode with Product with Serializable
IdentityBroadcastMode requires that rows are broadcasted in their original form.
- object SinglePartition extends Partitioning with Product with Serializable
-
object
UnspecifiedDistribution extends Distribution with Product with Serializable
Represents a distribution where no promises are made about co-location of data.