What join pruning does
Consider a large table joined against a smaller, filtered one:customers is the build side and orders is the probe side. Without join pruning, the engine scans orders in full and discards the rows whose o_custkey does not match a customer in EMEA. With join pruning, the optimizer recognizes that o_custkey is a column the orders scan can prune on, so at run time it collects the set of c_custkey values that pass c_region = 'EMEA' and pushes that set into the orders scan as an IN predicate. The set then feeds the normal scan pipeline: it prunes whole tablets and granule ranges through the primary index or partitioning, exactly as a literal WHERE o_custkey IN (...) would, except the values are discovered during execution rather than written in the query.
Pushing the set into the scan reuses the same machinery as predicate pushdown, so applying it is cheap.
Where it applies
Join pruning needs the probe side to lead down to a table scan on a column the scan can prune on: a primary index or partition column of the scanned table. The join key does not have to sit directly on top of that scan. The optimizer traces it down through the operators in between, including filters, projections and other computed columns, aggregations, and even further joins, as long as the key still maps to a prunable column of the scan it reaches. The build side can be any subplan, since all it has to do is produce the join-key values. The join must also keep only matching rows, so that dropping non-matching probe rows is safe. Join pruning applies to inner, right, and semi joins. It does not apply to anti joins or to the outer side of an outer join, where non-matching rows must be preserved.How it works
Join pruning is planned statically, but the filter is built at run time and the engine decides as the query runs whether to keep it. That run-time decision is what makes it safe to apply broadly. The filter is produced by an aggregate,make_pruning_set, that consumes the build side and accumulates a bounded set of its distinct join-key values:
- While the set stays under its limit, it holds the build side’s key values, and the scan uses them as the
INmembership test that drives tablet and primary-index pruning. - If the build side exceeds the limit, the set cancels: it discards the keys it has collected and produces nothing, so the scan runs unfiltered. The build side is large in this case, so scanning the probe side in full is the cheaper option anyway.
MaybeCache node in the plans below), so a join whose build side is cached is not re-run just to rebuild it.
Distributed execution
When the build side is replicated, for example a dimension table or a broadcast join, every node already holds the whole build side, builds the pruning set locally, and applies it to its own probe-side scan. When the join is hash-shuffled across nodes, no single node holds the whole build side, so the set is built like any distributed aggregate, in two phases:- A pre-aggregation on each node builds a partial set from that node’s share of the build side. In the plan this is the inner
make_pruning_set. - An aggregate merge broadcasts the partial sets and combines them, so every node ends up with the full pruning set for its scan. In the plan this is the outer
make_pruning_setmergeabove aBroadcastshuffle.
Limits
The pruning set is capped by the number of distinct keys it holds. The cap depends on whether the set has to cross the network, because that is what makes a large set expensive:- A semi join with a local build side (the
WHERE key IN (SELECT ...)shape against a replicated or broadcast build side) builds and uses the set on each node with no cross-stage broadcast, so it allows a large set: up to 5 million keys. - Everything else uses a much smaller cap, 100,000 keys: inner and right joins, and any join whose build side is hash-shuffled across nodes, including a hash-shuffled semi join. Here the set is broadcast across a stage boundary (see Distributed execution), so the lower cap bounds that network cost.
make_pruning_set.
How to observe whether join pruning was applied
RunEXPLAIN (PHYSICAL) and look for two things: a make_pruning_set aggregate that builds the set from the build-side key, and the set pushed into the probe-side scan as read_tablets(..., <key> in (SUBQUERY{N})). Both plans below are the output for the query above, lightly trimmed (per-node type and affinity annotations removed). On a single node the build side feeds one make_pruning_set; across nodes the set is built in the two phases from Distributed execution: a per-node make_pruning_set pre-aggregation, a Broadcast shuffle, and a make_pruning_setmerge.
make_pruning_set aggregate and the in (SUBQUERY{N}) on the probe scan, the probe side does not reach a prunable scan on the join key, or the join type is not eligible (Where it applies). A plan can also include join pruning while the run-time set cancels and prunes nothing, which is the intended, no-penalty behavior; the cancellation is visible only in the run-time metrics, not the plan shape.