Join pruning

In a join, Firebolt builds a hash table from one input (the build side) and streams the other input (the probe side) against it. Join pruning uses the build side to prune the probe side’s scan: it collects the join-key values from the build side at run time and pushes them into the probe-side table scan, so that scan skips data that cannot match the join. This is a form of sideways information passing, also known as semi-join reduction.

What join pruning does

Consider a large table joined against a smaller, filtered one:

SELECT o.*
FROM orders o
JOIN customers c ON o.o_custkey = c.c_custkey
WHERE c.c_region = 'EMEA';

Here customers is the build side and orders is the probe side. Without join pruning, the engine scans orders in full and discards the rows whose o_custkey does not match a customer in EMEA. With join pruning, the optimizer recognizes that o_custkey is a column the orders scan can prune on, so at run time it collects the set of c_custkey values that pass c_region = 'EMEA' and pushes that set into the orders scan as an IN predicate. The set then feeds the normal scan pipeline: it prunes whole tablets and granule ranges through the primary index or partitioning, exactly as a literal WHERE o_custkey IN (...) would, except the values are discovered during execution rather than written in the query. Pushing the set into the scan reuses the same machinery as predicate pushdown, so applying it is cheap.

Where it applies

Join pruning needs the probe side to lead down to a table scan on a column the scan can prune on: a primary index or partition column of the scanned table. The join key does not have to sit directly on top of that scan. The optimizer traces it down through the operators in between, including filters, projections and other computed columns, aggregations, and even further joins, as long as the key still maps to a prunable column of the scan it reaches. The build side can be any subplan, since all it has to do is produce the join-key values. The join must also keep only matching rows, so that dropping non-matching probe rows is safe. Join pruning applies to inner, right, and semi joins. It does not apply to anti joins or to the outer side of an outer join, where non-matching rows must be preserved.

How it works

Join pruning is planned statically, but the filter is built at run time and the engine decides as the query runs whether to keep it. That run-time decision is what makes it safe to apply broadly. The filter is produced by an aggregate, make_pruning_set, that consumes the build side and accumulates a bounded set of its distinct join-key values:

While the set stays under its limit, it holds the build side’s key values, and the scan uses them as the IN membership test that drives tablet and primary-index pruning.
If the build side exceeds the limit, the set cancels: it discards the keys it has collected and produces nothing, so the scan runs unfiltered. The build side is large in this case, so scanning the probe side in full is the cheaper option anyway.

Because the set is built and judged while the query runs, the optimizer never has to predict whether a join is selective. It plans join pruning wherever the key is prunable, and the run-time set decides whether it actually prunes. Building the set is also integrated with subresult reuse: the set is wrapped in a cache (the MaybeCache node in the plans below), so a join whose build side is cached is not re-run just to rebuild it.

Distributed execution

When the build side is replicated, for example a dimension table or a broadcast join, every node already holds the whole build side, builds the pruning set locally, and applies it to its own probe-side scan. When the join is hash-shuffled across nodes, no single node holds the whole build side, so the set is built like any distributed aggregate, in two phases:

A pre-aggregation on each node builds a partial set from that node’s share of the build side. In the plan this is the inner make_pruning_set.
An aggregate merge broadcasts the partial sets and combines them, so every node ends up with the full pruning set for its scan. In the plan this is the outer make_pruning_setmerge above a Broadcast shuffle.

The pre-aggregation is what keeps this affordable: each node ships at most its capped partial set, or, if its partial set already cancelled, just the signal that pruning was abandoned. A large build side therefore never floods the network. This bounded broadcast is the reason the limit is lower for shuffled joins than for the replicated case.

Limits

The pruning set is capped by the number of distinct keys it holds. The cap depends on whether the set has to cross the network, because that is what makes a large set expensive:

A semi join with a local build side (the WHERE key IN (SELECT ...) shape against a replicated or broadcast build side) builds and uses the set on each node with no cross-stage broadcast, so it allows a large set: up to 5 million keys.
Everything else uses a much smaller cap, 100,000 keys: inner and right joins, and any join whose build side is hash-shuffled across nodes, including a hash-shuffled semi join. Here the set is broadcast across a stage boundary (see Distributed execution), so the lower cap bounds that network cost.

In other words, a semi join gets the larger cap only when its build side is local; once the build side is shuffled it uses the 100,000-key cap like any other distributed join. These are built-in limits, not knobs you set. When the build side exceeds the cap, the set cancels and the scan runs unfiltered, so the limits bound the cost of pruning without ever changing a query’s result. The cap is visible in the plan as the second argument to make_pruning_set.

How to observe whether join pruning was applied

Run EXPLAIN (PHYSICAL) and look for two things: a make_pruning_set aggregate that builds the set from the build-side key, and the set pushed into the probe-side scan as read_tablets(..., <key> in (SUBQUERY{N})). Both plans below are the output for the query above, lightly trimmed (per-node type and affinity annotations removed). On a single node the build side feeds one make_pruning_set; across nodes the set is built in the two phases from Distributed execution: a per-node make_pruning_set pre-aggregation, a Broadcast shuffle, and a make_pruning_setmerge.

# One make_pruning_set feeds the orders scan: it reads only tablets/granules whose o_custkey is in the set.
[0] [Projection] orders.o_orderkey, orders.o_custkey, orders.o_totalprice, orders.o_orderdate
 \_[1] [MaybeCache]
    \_[2] [Join] Mode: Inner [(orders.o_custkey = customers.c_custkey)]
       \_[3] [TableFuncScan] orders.o_orderkey: $0.o_orderkey, orders.o_custkey: $0.o_custkey, orders.o_totalprice: $0.o_totalprice, orders.o_orderdate: $0.o_orderdate
       | |   $0 = read_tablets(table_name => orders, tablet, orders.o_custkey in (SUBQUERY{0}))
       |  \_[4] [TableFuncScan] tablet: $0.tablet
       |        $0 = list_tablets(table_name => orders)
       \_[5] [Shuffle] Loopback with disjoint readers
          \_[6] [Projection] customers.c_custkey
             \_[7] [Filter] (customers.c_region = 'EMEA')
                \_[8] [TableFuncScan] customers.c_custkey: $0.c_custkey, customers.c_region: $0.c_region
                  |   $0 = read_tablets(table_name => customers, tablet)
                   \_[9] [Projection] tablet
                      \_[10] [Filter] (((min_c_region > 'EMEA') or (max_c_region < 'EMEA')) IS DISTINCT FROM TRUE)
                         \_[11] [TableFuncScan] tablet: $0.tablet, min_c_region: $0.min_c_region, max_c_region: $0.max_c_region
                               $0 = list_tablets(table_name => customers)

SUBQUERY{0}:
[12] [MaybeCache]
 \_[13] [AggregateState] GroupBy: [] Aggregates: [make_pruning_set_0: make_pruning_set(customers.c_custkey, 100000)]
    \_[14] [Projection] customers.c_custkey
       \_Recurring Node --> [5]

# Per-node make_pruning_set -> Broadcast -> make_pruning_setmerge, then pushed into the orders scan.
[0] [MaybeCache]
 \_[1] [Shuffle] Gather
    \_[2] [Projection] orders.o_orderkey, orders.o_custkey, orders.o_totalprice, orders.o_orderdate
       \_[3] [Join] Mode: Inner [(orders.o_custkey = customers.c_custkey)]
          \_[4] [Shuffle] Hash by [orders.o_custkey]
          |  \_[5] [TableFuncScan] orders.o_orderkey: $0.o_orderkey, orders.o_custkey: $0.o_custkey, orders.o_totalprice: $0.o_totalprice, orders.o_orderdate: $0.o_orderdate
          |    |   $0 = read_tablets(table_name => orders, tablet, orders.o_custkey in (SUBQUERY{0}))
          |     \_[6] [TableFuncScan] tablet: $0.tablet
          |           $0 = list_tablets(table_name => orders)
          \_[7] [Shuffle] Hash by [customers.c_custkey]
             \_[8] [Shuffle] Loopback with disjoint readers
                \_[9] [Projection] customers.c_custkey
                   \_[10] [Filter] (customers.c_region = 'EMEA')
                      \_[11] [TableFuncScan] customers.c_custkey: $0.c_custkey, customers.c_region: $0.c_region
                        |   $0 = read_tablets(table_name => customers, tablet)
                         \_[12] [Projection] tablet
                            \_[13] [Filter] (((min_c_region > 'EMEA') or (max_c_region < 'EMEA')) IS DISTINCT FROM TRUE)
                               \_[14] [TableFuncScan] tablet: $0.tablet, min_c_region: $0.min_c_region, max_c_region: $0.max_c_region
                                     $0 = list_tablets(table_name => customers)

SUBQUERY{0}:
[15] [MaybeCache]
 \_[16] [AggregateState] GroupBy: [] Aggregates: [make_pruning_set_1: make_pruning_setmerge(make_pruning_set_0, 100000)]
    \_[17] [Shuffle] Broadcast
       \_[18] [AggregateState] GroupBy: [] Aggregates: [make_pruning_set_0: make_pruning_set(customers.c_custkey, 100000)]
          \_[19] [Projection] customers.c_custkey
             \_Recurring Node --> [8]

If you do not see the make_pruning_set aggregate and the in (SUBQUERY{N}) on the probe scan, the probe side does not reach a prunable scan on the join key, or the join type is not eligible (Where it applies). A plan can also include join pruning while the run-time set cancels and prunes nothing, which is the intended, no-penalty behavior; the cancellation is visible only in the run-time metrics, not the plan shape.

Overview

Performance and Observability

Security

Self-Managed

Managed service

Guides

SQL reference

Release notes

API reference

Legal

What join pruning does

Where it applies

How it works

Distributed execution

Limits

How to observe whether join pruning was applied

​What join pruning does

​Where it applies

​How it works

​Distributed execution

​Limits

​How to observe whether join pruning was applied

What join pruning does

Where it applies

How it works

Distributed execution

Limits

How to observe whether join pruning was applied