package optimizer
- Alphabetic
- Public
- All
Type Members
-
case class
Cost(card: BigInt, size: BigInt) extends Product with Serializable
This class defines the cost model for a plan.
This class defines the cost model for a plan.
- card
Cardinality (number of rows).
- size
Size in bytes.
-
case class
GetCurrentDatabase(catalogManager: CatalogManager) extends Rule[LogicalPlan] with Product with Serializable
Replaces the expression of CurrentDatabase with the current database name.
-
case class
JoinGraphInfo(starJoins: Set[Int], nonStarJoins: Set[Int]) extends Product with Serializable
Helper class that keeps information about the join graph as sets of item/plan ids.
Helper class that keeps information about the join graph as sets of item/plan ids. It currently stores the star/non-star plans. It can be extended with the set of connected/unconnected plans.
- case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes with Product with Serializable
-
abstract
class
Optimizer extends RuleExecutor[LogicalPlan]
Abstract class all optimizers should inherit of, contains the standard batches (extending Optimizers can override this.
-
case class
OrderedJoin(left: LogicalPlan, right: LogicalPlan, joinType: JoinType, condition: Option[Expression]) extends BinaryNode with Product with Serializable
This is a mimic class for a join node that has been ordered.
- class SimpleTestOptimizer extends Optimizer
Value Members
-
object
BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper
Simplifies boolean expressions: 1.
Simplifies boolean expressions: 1. Simplifies expressions whose answer can be determined without evaluating both sides. 2. Eliminates / extracts common factors. 3. Merge same expressions 4. Removes
Notoperator. -
object
CheckCartesianProducts extends Rule[LogicalPlan] with PredicateHelper
Check if there any cartesian products between joins of any type in the optimized plan tree.
Check if there any cartesian products between joins of any type in the optimized plan tree. Throw an error if a cartesian product is found without an explicit cross join specified. This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true.
This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be collected before checking if it is a cartesian product. If you have SELECT * from R, S where R.r = S.s, the join between R and S is not a cartesian product and therefore should be allowed. The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule.
This rule must be run AFTER the batch "LocalRelation", since a join with empty relation should not be a cartesian product.
-
object
CollapseProject extends Rule[LogicalPlan]
Combines two Project operators into one and perform alias substitution, merging the expressions into one single expression for the following cases.
Combines two Project operators into one and perform alias substitution, merging the expressions into one single expression for the following cases. 1. When two Project operators are adjacent. 2. When two Project operators have LocalLimit/Sample/Repartition operator between them and the upper project consists of the same number of columns which is equal or aliasing.
GlobalLimit(LocalLimit)pattern is also considered. -
object
CollapseRepartition extends Rule[LogicalPlan]
Combines adjacent RepartitionOperation operators
-
object
CollapseWindow extends Rule[LogicalPlan]
Collapse Adjacent Window Expression.
Collapse Adjacent Window Expression. - If the partition specs and order specs are the same and the window expression are independent and are of the same window function type, collapse into the parent.
-
object
ColumnPruning extends Rule[LogicalPlan]
Attempts to eliminate the reading of unneeded columns from the query plan.
Attempts to eliminate the reading of unneeded columns from the query plan.
Since adding Project before Filter conflicts with PushPredicatesThroughProject, this rule will remove the Project p2 in the following pattern:
p1 @ Project(_, Filter(_, p2 @ Project(_, child))) if p2.outputSet.subsetOf(p2.inputSet)
p2 is usually inserted by this rule and useless, p1 could prune the columns anyway.
-
object
CombineConcats extends Rule[LogicalPlan]
Combine nested Concat expressions.
-
object
CombineFilters extends Rule[LogicalPlan] with PredicateHelper
Combines two adjacent Filter operators into one, merging the non-redundant conditions into one conjunctive predicate.
-
object
CombineLimits extends Rule[LogicalPlan]
Combines two adjacent Limit operators into one, merging the expressions into one single expression.
-
object
CombineTypedFilters extends Rule[LogicalPlan]
Combines two adjacent TypedFilters, which operate on same type object in condition, into one, merging the filter functions into one conjunctive function.
-
object
CombineUnions extends Rule[LogicalPlan]
Combines all adjacent Union operators into a single Union.
-
object
ComputeCurrentTime extends Rule[LogicalPlan]
Computes the current date and time to make sure we return the same result in a single query.
-
object
ConstantFolding extends Rule[LogicalPlan]
Replaces Expressions that can be statically evaluated with equivalent Literal values.
-
object
ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper
Substitutes Attributes which can be statically evaluated with their corresponding value in conjunctive Expressions eg.
Substitutes Attributes which can be statically evaluated with their corresponding value in conjunctive Expressions eg.
SELECT * FROM table WHERE i = 5 AND j = i + 3 ==> SELECT * FROM table WHERE i = 5 AND j = 8
Approach used: - Populate a mapping of attribute => constant value by looking at all the equals predicates - Using this mapping, replace occurrence of the attributes with the corresponding constant values in the AND node.
-
object
ConvertToLocalRelation extends Rule[LogicalPlan]
Converts local operations (i.e.
Converts local operations (i.e. ones that don't require data exchange) on
LocalRelationto anotherLocalRelation. -
object
CostBasedJoinReorder extends Rule[LogicalPlan] with PredicateHelper
Cost-based join reorder.
Cost-based join reorder. We may have several join reorder algorithms in the future. This class is the entry of these algorithms, and chooses which one to use.
-
object
DecimalAggregates extends Rule[LogicalPlan]
Speeds up aggregates on fixed-precision decimals by executing them on unscaled Long values.
Speeds up aggregates on fixed-precision decimals by executing them on unscaled Long values.
This uses the same rules for increasing the precision and scale of the output as org.apache.spark.sql.catalyst.analysis.DecimalPrecision.
-
object
EliminateDistinct extends Rule[LogicalPlan]
Remove useless DISTINCT for MAX and MIN.
Remove useless DISTINCT for MAX and MIN. This rule should be applied before RewriteDistinctAggregates.
-
object
EliminateMapObjects extends Rule[LogicalPlan]
Removes MapObjects when the following conditions are satisfied
Removes MapObjects when the following conditions are satisfied
- Mapobject(... lambdavariable(..., false) ...), which means types for input and output are primitive types with non-nullable 2. no custom collection class specified representation of data item.
-
object
EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper
Elimination of outer joins, if the predicates can restrict the result sets so that all null-supplying rows are eliminated
Elimination of outer joins, if the predicates can restrict the result sets so that all null-supplying rows are eliminated
- full outer -> inner if both sides have such predicates - left outer -> inner if the right side has such predicates - right outer -> inner if the left side has such predicates - full outer -> left outer if only the left side has such predicates - full outer -> right outer if only the right side has such predicates
This rule should be executed before pushing down the Filter
-
object
EliminateResolvedHint extends Rule[LogicalPlan]
Replaces ResolvedHint operators from the plan.
Replaces ResolvedHint operators from the plan. Move the HintInfo to associated Join operators, otherwise remove it if no Join operator is matched.
-
object
EliminateSerialization extends Rule[LogicalPlan]
Removes cases where we are unnecessarily going between the object and serialized (InternalRow) representation of data item.
Removes cases where we are unnecessarily going between the object and serialized (InternalRow) representation of data item. For example back to back map operations.
-
object
EliminateSorts extends Rule[LogicalPlan]
Removes Sort operation.
Removes Sort operation. This can happen: 1) if the sort order is empty or the sort order does not have any reference 2) if the child is already sorted 3) if there is another Sort operator separated by 0...n Project/Filter operators 4) if the Sort operator is within Join separated by 0...n Project/Filter operators only, and the Join conditions is deterministic 5) if the Sort operator is within GroupBy separated by 0...n Project/Filter operators only, and the aggregate function is order irrelevant
-
object
ExtractPythonUDFFromJoinCondition extends Rule[LogicalPlan] with PredicateHelper
PythonUDF in join condition can't be evaluated if it refers to attributes from both join sides.
PythonUDF in join condition can't be evaluated if it refers to attributes from both join sides. See
ExtractPythonUDFsfor details. This rule will detect un-evaluable PythonUDF and pull them out from join condition. -
object
FoldablePropagation extends Rule[LogicalPlan]
Replace attributes with aliases of the original foldable expressions if possible.
Replace attributes with aliases of the original foldable expressions if possible. Other optimizations will take advantage of the propagated foldable expressions. For example, this rule can optimize
SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, 3
to
SELECT 1.0 x, 'abc' y, Now() z ORDER BY 1.0, 'abc', Now()
and other rules can further optimize it and remove the ORDER BY operator.
-
object
InferFiltersFromConstraints extends Rule[LogicalPlan] with PredicateHelper with ConstraintHelper
Generate a list of additional filters from an operator's existing constraint but remove those that are either already part of the operator's condition or are part of the operator's child constraints.
Generate a list of additional filters from an operator's existing constraint but remove those that are either already part of the operator's condition or are part of the operator's child constraints. These filters are currently inserted to the existing conditions in the Filter operators and on either side of Join operators.
Note: While this optimization is applicable to a lot of types of join, it primarily benefits Inner and LeftSemi joins.
-
object
JoinReorderDP extends PredicateHelper with Logging
Reorder the joins using a dynamic programming algorithm.
Reorder the joins using a dynamic programming algorithm. This implementation is based on the paper: Access Path Selection in a Relational Database Management System. http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf
First we put all items (basic joined nodes) into level 0, then we build all two-way joins at level 1 from plans at level 0 (single items), then build all 3-way joins from plans at previous levels (two-way joins and single items), then 4-way joins ... etc, until we build all n-way joins and pick the best plan among them.
When building m-way joins, we only keep the best plan (with the lowest cost) for the same set of m items. E.g., for 3-way joins, we keep only the best plan for items {A, B, C} among plans (A J B) J C, (A J C) J B and (B J C) J A. We also prune cartesian product candidates when building a new plan if there exists no join condition involving references from both left and right. This pruning strategy significantly reduces the search space. E.g., given A J B J C J D with join conditions A.k1 = B.k1 and B.k2 = C.k2 and C.k3 = D.k3, plans maintained for each level are as follows: level 0: p({A}), p({B}), p({C}), p({D}) level 1: p({A, B}), p({B, C}), p({C, D}) level 2: p({A, B, C}), p({B, C, D}) level 3: p({A, B, C, D}) where p({A, B, C, D}) is the final output plan.
For cost evaluation, since physical costs for operators are not available currently, we use cardinalities and sizes to compute costs.
-
object
JoinReorderDPFilters extends PredicateHelper
Implements optional filters to reduce the search space for join enumeration.
Implements optional filters to reduce the search space for join enumeration.
1) Star-join filters: Plan star-joins together since they are assumed to have an optimal execution based on their RI relationship. 2) Cartesian products: Defer their planning later in the graph to avoid large intermediate results (expanding joins, in general). 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing intermediate results.
Filters (2) and (3) are not implemented.
-
object
LikeSimplification extends Rule[LogicalPlan]
Simplifies LIKE expressions that do not need full regular expressions to evaluate the condition.
Simplifies LIKE expressions that do not need full regular expressions to evaluate the condition. For example, when the expression is just checking to see if a string starts with a given pattern.
-
object
LimitPushDown extends Rule[LogicalPlan]
Pushes down LocalLimit beneath UNION ALL and beneath the streamed inputs of outer joins.
-
object
NestedColumnAliasing
This aims to handle a nested column aliasing pattern inside the
ColumnPruningoptimizer rule.This aims to handle a nested column aliasing pattern inside the
ColumnPruningoptimizer rule. If a project or its child references to nested fields, and not all the fields in a nested attribute are used, we can substitute them by alias attributes; then a project of the nested fields as aliases on the children of the child will be created. -
object
NormalizeFloatingNumbers extends Rule[LogicalPlan]
We need to take care of special floating numbers (NaN and -0.0) in several places:
We need to take care of special floating numbers (NaN and -0.0) in several places:
- When compare values, different NaNs should be treated as same,
-0.0and0.0should be treated as same. 2. In aggregate grouping keys, different NaNs should belong to the same group, -0.0 and 0.0 should belong to the same group. 3. In join keys, different NaNs should be treated as same,-0.0and0.0should be treated as same. 4. In window partition keys, different NaNs should belong to the same partition, -0.0 and 0.0 should belong to the same partition.
Case 1 is fine, as we handle NaN and -0.0 well during comparison. For complex types, we recursively compare the fields/elements, so it's also fine.
Case 2, 3 and 4 are problematic, as Spark SQL turns grouping/join/window partition keys into binary
UnsafeRowand compare the binary data directly. Different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.This rule normalizes NaN and -0.0 in window partition keys, join keys and aggregate grouping keys.
Ideally we should do the normalization in the physical operators that compare the binary
UnsafeRowdirectly. We don't need this normalization if the Spark SQL execution engine is not optimized to run on binary data. This rule is created to simplify the implementation, so that we have a single place to do normalization, which is more maintainable.Note that, this rule must be executed at the end of optimizer, because the optimizer may create new joins(the subquery rewrite) and new join conditions(the join reorder).
- When compare values, different NaNs should be treated as same,
-
object
NullPropagation extends Rule[LogicalPlan]
Replaces Expressions that can be statically evaluated with equivalent Literal values.
Replaces Expressions that can be statically evaluated with equivalent Literal values. This rule is more specific with Null value propagation from bottom to top of the expression tree.
-
object
ObjectSerializerPruning extends Rule[LogicalPlan]
Prunes unnecessary object serializers from query plan.
Prunes unnecessary object serializers from query plan. This rule prunes both individual serializer and nested fields in serializers.
-
object
OptimizeIn extends Rule[LogicalPlan]
Optimize IN predicates: 1.
Optimize IN predicates: 1. Converts the predicate to false when the list is empty and the value is not nullable. 2. Removes literal repetitions. 3. Replaces (value, seq[Literal]) with optimized version (value, HashSet[Literal]) which is much faster.
-
object
OptimizeLimitZero extends Rule[LogicalPlan]
Replaces GlobalLimit 0 and LocalLimit 0 nodes (subtree) with empty Local Relation, as they don't return any rows.
-
object
PropagateEmptyRelation extends Rule[LogicalPlan] with PredicateHelper with CastSupport
Collapse plans consisting empty local relations generated by PruneFilters.
Collapse plans consisting empty local relations generated by PruneFilters. 1. Binary(or Higher)-node Logical Plans
- Union with all empty children.
- Join with one or two empty children (including Intersect/Except). 2. Unary-node Logical Plans
- Project/Filter/Sample/Join/Limit/Repartition with all empty children.
- Aggregate with all empty children and at least one grouping expression.
- Generate(Explode) with all empty children. Others like Hive UDTF may return results.
-
object
PruneFilters extends Rule[LogicalPlan] with PredicateHelper
Removes filters that can be evaluated trivially.
Removes filters that can be evaluated trivially. This can be done through the following ways: 1) by eliding the filter for cases where it will always evaluate to
true. 2) by substituting a dummy empty relation when the filter will always evaluate tofalse. 3) by eliminating the always-true conditions given the constraints on the child's output. -
object
PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
Pull out all (outer) correlated predicates from a given subquery.
Pull out all (outer) correlated predicates from a given subquery. This method removes the correlated predicates from subquery Filters and adds the references of these predicates to all intermediate Project and Aggregate clauses (if they are missing) in order to be able to evaluate the predicates at the top level.
TODO: Look to merge this rule with RewritePredicateSubquery.
-
object
PushDownLeftSemiAntiJoin extends Rule[LogicalPlan] with PredicateHelper
This rule is a variant of PushPredicateThroughNonJoin which can handle pushing down Left semi and Left Anti joins below the following operators.
This rule is a variant of PushPredicateThroughNonJoin which can handle pushing down Left semi and Left Anti joins below the following operators. 1) Project 2) Window 3) Union 4) Aggregate 5) Other permissible unary operators. please see PushPredicateThroughNonJoin.canPushThrough.
-
object
PushDownPredicates extends Rule[LogicalPlan] with PredicateHelper
The unified version for predicate pushdown of normal operators and joins.
The unified version for predicate pushdown of normal operators and joins. This rule improves performance of predicate pushdown for cascading joins such as: Filter-Join-Join-Join. Most predicates can be pushed down in a single pass.
-
object
PushLeftSemiLeftAntiThroughJoin extends Rule[LogicalPlan] with PredicateHelper
This rule is a variant of PushPredicateThroughJoin which can handle pushing down Left semi and Left Anti joins below a join operator.
This rule is a variant of PushPredicateThroughJoin which can handle pushing down Left semi and Left Anti joins below a join operator. The allowable join types are: 1) Inner 2) Cross 3) LeftOuter 4) RightOuter
TODO: Currently this rule can push down the left semi or left anti joins to either left or right leg of the child join. This matches the behaviour of
PushPredicateThroughJoinwhen the lefi semi or left anti join is in expression form. We need to explore the possibility to push the left semi/anti joins to both legs of join if the join condition refers to both left and right legs of the child join. -
object
PushPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper
Pushes down Filter operators where the
conditioncan be evaluated using only the attributes of the left or right side of a join.Pushes down Filter operators where the
conditioncan be evaluated using only the attributes of the left or right side of a join. Other Filter conditions are moved into theconditionof the Join.And also pushes down the join filter, where the
conditioncan be evaluated using only the attributes of the left or right side of sub query when applicable.Check https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior for more details
-
object
PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelper
Pushes Filter operators through many operators iff: 1) the operator is deterministic 2) the predicate is deterministic and the operator will not change any of rows.
Pushes Filter operators through many operators iff: 1) the operator is deterministic 2) the predicate is deterministic and the operator will not change any of rows.
This heuristic is valid assuming the expression evaluation cost is minimal.
-
object
PushProjectionThroughUnion extends Rule[LogicalPlan] with PredicateHelper
Pushes Project operator to both sides of a Union operator.
Pushes Project operator to both sides of a Union operator. Operations that are safe to pushdown are listed as follows. Union: Right now, Union means UNION ALL, which does not de-duplicate rows. So, it is safe to pushdown Filters and Projections through it. Filter pushdown is handled by another rule PushDownPredicates. Once we add UNION DISTINCT, we will not be able to pushdown Projections.
-
object
ReassignLambdaVariableID extends Rule[LogicalPlan]
Reassigns per-query unique IDs to
LambdaVariables, whose original IDs are globally unique.Reassigns per-query unique IDs to
LambdaVariables, whose original IDs are globally unique. This can help Spark to hit codegen cache more often and improve performance. -
object
RemoveDispensableExpressions extends Rule[LogicalPlan]
Removes nodes that are not necessary.
-
object
RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan]
Removes literals from group expressions in Aggregate, as they have no effect to the result but only makes the grouping key bigger.
-
object
RemoveNoopOperators extends Rule[LogicalPlan]
Remove no-op operators from the query plan that do not make any modifications.
-
object
RemoveRedundantAliases extends Rule[LogicalPlan]
Remove redundant aliases from a query plan.
Remove redundant aliases from a query plan. A redundant alias is an alias that does not change the name or metadata of a column, and does not deduplicate it.
-
object
RemoveRepetitionFromGroupExpressions extends Rule[LogicalPlan]
Removes repetition from group expressions in Aggregate, as they have no effect to the result but only makes the grouping key bigger.
-
object
ReorderAssociativeOperator extends Rule[LogicalPlan]
Reorder associative integral-type operators and fold all constants into one.
-
object
ReorderJoin extends Rule[LogicalPlan] with PredicateHelper
Reorder the joins and push all the conditions into join, so that the bottom ones have at least one condition.
Reorder the joins and push all the conditions into join, so that the bottom ones have at least one condition.
The order of joins will not be changed if all of them already have at least one condition.
If star schema detection is enabled, reorder the star join plans based on heuristics.
-
object
ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan]
Replaces logical Deduplicate operator with an Aggregate operator.
-
object
ReplaceDistinctWithAggregate extends Rule[LogicalPlan]
Replaces logical Distinct operator with an Aggregate operator.
Replaces logical Distinct operator with an Aggregate operator.
SELECT DISTINCT f1, f2 FROM t ==> SELECT f1, f2 FROM t GROUP BY f1, f2 -
object
ReplaceExceptWithAntiJoin extends Rule[LogicalPlan]
Replaces logical Except operator with a left-anti Join operator.
Replaces logical Except operator with a left-anti Join operator.
SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
Note: 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the generated join conditions will be incorrect.
-
object
ReplaceExceptWithFilter extends Rule[LogicalPlan]
If one or both of the datasets in the logical Except operator are purely transformed using Filter, this rule will replace logical Except operator with a Filter operator by flipping the filter condition of the right child.
If one or both of the datasets in the logical Except operator are purely transformed using Filter, this rule will replace logical Except operator with a Filter operator by flipping the filter condition of the right child.
SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND (a1 is null OR a1 <> 5)
Note: Before flipping the filter condition of the right node, we should: 1. Combine all it's Filter. 2. Update the attribute references to the left node; 3. Add a Coalesce(condition, False) (to take into account of NULL values in the condition).
-
object
ReplaceExpressions extends Rule[LogicalPlan]
Finds all the expressions that are unevaluable and replace/rewrite them with semantically equivalent expressions that can be evaluated.
Finds all the expressions that are unevaluable and replace/rewrite them with semantically equivalent expressions that can be evaluated. Currently we replace two kinds of expressions: 1) RuntimeReplaceable expressions 2) UnevaluableAggregate expressions such as Every, Some, Any, CountIf This is mainly used to provide compatibility with other databases. Few examples are: we use this to support "nvl" by replacing it with "coalesce". we use this to replace Every and Any with Min and Max respectively.
TODO: In future, explore an option to replace aggregate functions similar to how RuntimeReplaceable does.
-
object
ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan]
Replaces logical Intersect operator with a left-semi Join operator.
Replaces logical Intersect operator with a left-semi Join operator.
SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
Note: 1. This rule is only applicable to INTERSECT DISTINCT. Do not use it for INTERSECT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the generated join conditions will be incorrect.
-
object
ReplaceNullWithFalseInPredicate extends Rule[LogicalPlan]
A rule that replaces
Literal(null, BooleanType)withFalseLiteral, if possible, in the search condition of the WHERE/HAVING/ON(JOIN) clauses, which contain an implicit Boolean operator "(search condition) = TRUE".A rule that replaces
Literal(null, BooleanType)withFalseLiteral, if possible, in the search condition of the WHERE/HAVING/ON(JOIN) clauses, which contain an implicit Boolean operator "(search condition) = TRUE". The replacement is only valid whenLiteral(null, BooleanType)is semantically equivalent toFalseLiteralwhen evaluating the whole search condition.Please note that FALSE and NULL are not exchangeable in most cases, when the search condition contains NOT and NULL-tolerant expressions. Thus, the rule is very conservative and applicable in very limited cases.
For example,
Filter(Literal(null, BooleanType))is equal toFilter(FalseLiteral).Another example containing branches is
Filter(If(cond, FalseLiteral, Literal(null, _))); this can be optimized toFilter(If(cond, FalseLiteral, FalseLiteral)), and eventuallyFilter(FalseLiteral).Moreover, this rule also transforms predicates in all If expressions as well as branch conditions in all CaseWhen expressions, even if they are not part of the search conditions.
For example,
Project(If(And(cond, Literal(null)), Literal(1), Literal(2)))can be simplified intoProject(Literal(2)). -
object
RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan]
This rule rewrites correlated ScalarSubquery expressions into LEFT OUTER joins.
-
object
RewriteDistinctAggregates extends Rule[LogicalPlan]
This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group.
This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group. The results are then combined in a second aggregate.
First example: query without filter clauses (in scala):
val data = Seq( ("a", "ca1", "cb1", 10), ("a", "ca1", "cb2", 5), ("b", "ca1", "cb1", 13)) .toDF("key", "cat1", "cat2", "value") data.createOrReplaceTempView("data") val agg = data.groupBy($"key") .agg( countDistinct($"cat1").as("cat1_cnt"), countDistinct($"cat2").as("cat2_cnt"), sum($"value").as("total"))
This translates to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [COUNT(DISTINCT 'cat1), COUNT(DISTINCT 'cat2), sum('value)] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) LocalTableScan [...]This rule rewrites this logical plan to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [count(if (('gid = 1)) 'cat1 else null), count(if (('gid = 2)) 'cat2 else null), first(if (('gid = 0)) 'total else null) ignore nulls] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) Aggregate( key = ['key, 'cat1, 'cat2, 'gid] functions = [sum('value)] output = ['key, 'cat1, 'cat2, 'gid, 'total]) Expand( projections = [('key, null, null, 0, cast('value as bigint)), ('key, 'cat1, null, 1, null), ('key, null, 'cat2, 2, null)] output = ['key, 'cat1, 'cat2, 'gid, 'value]) LocalTableScan [...]
Second example: aggregate function without distinct and with filter clauses (in sql):
SELECT COUNT(DISTINCT cat1) as cat1_cnt, COUNT(DISTINCT cat2) as cat2_cnt, SUM(value) FILTER (WHERE id > 1) AS total FROM data GROUP BY keyThis translates to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [COUNT(DISTINCT 'cat1), COUNT(DISTINCT 'cat2), sum('value) with FILTER('id > 1)] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) LocalTableScan [...]This rule rewrites this logical plan to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [count(if (('gid = 1)) 'cat1 else null), count(if (('gid = 2)) 'cat2 else null), first(if (('gid = 0)) 'total else null) ignore nulls] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) Aggregate( key = ['key, 'cat1, 'cat2, 'gid] functions = [sum('value) with FILTER('id > 1)] output = ['key, 'cat1, 'cat2, 'gid, 'total]) Expand( projections = [('key, null, null, 0, cast('value as bigint), 'id), ('key, 'cat1, null, 1, null, null), ('key, null, 'cat2, 2, null, null)] output = ['key, 'cat1, 'cat2, 'gid, 'value, 'id]) LocalTableScan [...]
The rule does the following things here: 1. Expand the data. There are three aggregation groups in this query:
- the non-distinct group; ii. the distinct 'cat1 group; iii. the distinct 'cat2 group. An expand operator is inserted to expand the child data for each group. The expand will null out all unused columns for the given group; this must be done in order to ensure correctness later on. Groups can by identified by a group id (gid) column added by the expand operator. 2. De-duplicate the distinct paths and aggregate the non-aggregate path. The group by clause of this aggregate consists of the original group by clause, all the requested distinct columns and the group id. Both de-duplication of distinct column and the aggregation of the non-distinct group take advantage of the fact that we group by the group id (gid) and that we have nulled out all non-relevant columns the given group. 3. Aggregating the distinct groups and combining this with the results of the non-distinct aggregation. In this step we use the group id to filter the inputs for the aggregate functions. The result of the non-distinct group are 'aggregated' by using the first operator, it might be more elegant to use the native UDAF merge mechanism for this in the future.
This rule duplicates the input data by two or more times (# distinct groups + an optional non-distinct group). This will put quite a bit of memory pressure of the used aggregate and exchange operators. Keeping the number of distinct groups as low as possible should be priority, we could improve this in the current rule by applying more advanced expression canonicalization techniques.
-
object
RewriteExceptAll extends Rule[LogicalPlan]
Replaces logical Except operator using a combination of Union, Aggregate and Generate operator.
Replaces logical Except operator using a combination of Union, Aggregate and Generate operator.
Input Query :
SELECT c1 FROM ut1 EXCEPT ALL SELECT c1 FROM ut2
Rewritten Query:
SELECT c1 FROM ( SELECT replicate_rows(sum_val, c1) FROM ( SELECT c1, sum_val FROM ( SELECT c1, sum(vcol) AS sum_val FROM ( SELECT 1L as vcol, c1 FROM ut1 UNION ALL SELECT -1L as vcol, c1 FROM ut2 ) AS union_all GROUP BY union_all.c1 ) WHERE sum_val > 0 ) ) -
object
RewriteIntersectAll extends Rule[LogicalPlan]
Replaces logical Intersect operator using a combination of Union, Aggregate and Generate operator.
Replaces logical Intersect operator using a combination of Union, Aggregate and Generate operator.
Input Query :
SELECT c1 FROM ut1 INTERSECT ALL SELECT c1 FROM ut2
Rewritten Query:
SELECT c1 FROM ( SELECT replicate_row(min_count, c1) FROM ( SELECT c1, If (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt) AS min_count FROM ( SELECT c1, count(vcol1) as vcol1_cnt, count(vcol2) as vcol2_cnt FROM ( SELECT true as vcol1, null as , c1 FROM ut1 UNION ALL SELECT null as vcol1, true as vcol2, c1 FROM ut2 ) AS union_all GROUP BY c1 HAVING vcol1_cnt >= 1 AND vcol2_cnt >= 1 ) ) ) -
object
RewriteNonCorrelatedExists extends Rule[LogicalPlan]
Rewrite non correlated exists subquery to use ScalarSubquery WHERE EXISTS (SELECT A FROM TABLE B WHERE COL1 > 10) will be rewritten to WHERE (SELECT 1 FROM (SELECT A FROM TABLE B WHERE COL1 > 10) LIMIT 1) IS NOT NULL
-
object
RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper
This rule rewrites predicate sub-queries into left semi/anti joins.
This rule rewrites predicate sub-queries into left semi/anti joins. The following predicates are supported: a. EXISTS/NOT EXISTS will be rewritten as semi/anti join, unresolved conditions in Filter will be pulled out as the join conditions. b. IN/NOT IN will be rewritten as semi/anti join, unresolved conditions in the Filter will be pulled out as join conditions, value = selected column will also be used as join condition.
-
object
SimpleTestOptimizer extends SimpleTestOptimizer
An optimizer used in test code.
An optimizer used in test code.
To ensure extendability, we leave the standard rules in the abstract optimizer rules, while specific rules go to the subclasses
-
object
SimplifyBinaryComparison extends Rule[LogicalPlan] with PredicateHelper with ConstraintHelper
Simplifies binary comparisons with semantically-equal expressions: 1) Replace '<=>' with 'true' literal.
Simplifies binary comparisons with semantically-equal expressions: 1) Replace '<=>' with 'true' literal. 2) Replace '=', '<=', and '>=' with 'true' literal if both operands are non-nullable. 3) Replace '<' and '>' with 'false' literal if both operands are non-nullable.
-
object
SimplifyCaseConversionExpressions extends Rule[LogicalPlan]
Removes the inner case conversion expressions that are unnecessary because the inner conversion is overwritten by the outer one.
-
object
SimplifyCasts extends Rule[LogicalPlan]
Removes Casts that are unnecessary because the input is already the correct type.
-
object
SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper
Simplifies conditional expressions (if / case).
-
object
SimplifyExtractValueOps extends Rule[LogicalPlan]
Simplify redundant CreateNamedStruct, CreateArray and CreateMap expressions.
-
object
StarSchemaDetection extends PredicateHelper
Encapsulates star-schema detection logic.
-
object
TransposeWindow extends Rule[LogicalPlan]
Transpose Adjacent Window Expressions.
Transpose Adjacent Window Expressions. - If the partition spec of the parent Window expression is compatible with the partition spec of the child window expression, transpose them.