> ## Documentation Index
> Fetch the complete documentation index at: https://docs.firebolt.io/llms.txt
> Use this file to discover all available pages before exploring further.

> Use one side of a join to prune the table scan of the other at run time.

# Join pruning

In a join, Firebolt builds a hash table from one input (the **build side**) and streams the other input (the **probe side**) against it.
**Join pruning** uses the build side to prune the probe side's scan: it collects the join-key values from the build side at run time and pushes them into the probe-side table scan, so that scan skips data that cannot match the join.
This is a form of sideways information passing, also known as semi-join reduction.

## What join pruning does

Consider a large table joined against a smaller, filtered one:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
SELECT o.*
FROM orders o
JOIN customers c ON o.o_custkey = c.c_custkey
WHERE c.c_region = 'EMEA';
```

Here `customers` is the build side and `orders` is the probe side. Without join pruning, the engine scans `orders` in full and discards the rows whose `o_custkey` does not match a customer in EMEA. With join pruning, the optimizer recognizes that `o_custkey` is a column the `orders` scan can prune on, so at run time it collects the set of `c_custkey` values that pass `c_region = 'EMEA'` and pushes that set into the `orders` scan as an `IN` predicate. The set then feeds the normal [scan pipeline](/performance-and-observability/storage-and-indexing#the-scan-pipeline): it prunes whole tablets and granule ranges through the [primary index](/performance-and-observability/storage-and-indexing/primary-index) or [partitioning](/reference-sql/commands/data-definition/create-fact-dimension-table#partition-by), exactly as a literal `WHERE o_custkey IN (...)` would, except the values are discovered during execution rather than written in the query.

Pushing the set into the scan reuses the same machinery as predicate pushdown, so applying it is cheap.

## Where it applies

Join pruning needs the **probe side to lead down to a table scan** on a column the scan can prune on: a [primary index](/performance-and-observability/storage-and-indexing/primary-index) or [partition](/reference-sql/commands/data-definition/create-fact-dimension-table#partition-by) column of the scanned table.

The join key does not have to sit directly on top of that scan. The optimizer traces it down through the operators in between, including filters, projections and other computed columns, aggregations, and even further joins, as long as the key still maps to a prunable column of the scan it reaches. The build side can be any subplan, since all it has to do is produce the join-key values.

The join must also keep only matching rows, so that dropping non-matching probe rows is safe. Join pruning applies to **inner**, **right**, and **semi** joins. It does not apply to anti joins or to the outer side of an outer join, where non-matching rows must be preserved.

## How it works

Join pruning is planned statically, but the filter is built at run time and the engine decides as the query runs whether to keep it. That run-time decision is what makes it safe to apply broadly.

The filter is produced by an aggregate, `make_pruning_set`, that consumes the build side and accumulates a **bounded set** of its distinct join-key values:

* While the set stays under its [limit](#limits), it holds the build side's key values, and the scan uses them as the `IN` membership test that drives tablet and primary-index pruning.
* If the build side exceeds the limit, the set **cancels**: it discards the keys it has collected and produces nothing, so the scan runs unfiltered. The build side is large in this case, so scanning the probe side in full is the cheaper option anyway.

Because the set is built and judged while the query runs, the optimizer never has to predict whether a join is selective. It plans join pruning wherever the key is prunable, and the run-time set decides whether it actually prunes. Building the set is also integrated with [subresult reuse](/performance-and-observability/runtime/understand-query-performance-subresult): the set is wrapped in a cache (the `MaybeCache` node in the plans below), so a join whose build side is cached is not re-run just to rebuild it.

## Distributed execution

When the build side is replicated, for example a dimension table or a broadcast join, every node already holds the whole build side, builds the pruning set locally, and applies it to its own probe-side scan.

When the join is hash-shuffled across nodes, no single node holds the whole build side, so the set is built like any distributed aggregate, in two phases:

* A **pre-aggregation** on each node builds a partial set from that node's share of the build side. In the plan this is the inner `make_pruning_set`.
* An **aggregate merge** broadcasts the partial sets and combines them, so every node ends up with the full pruning set for its scan. In the plan this is the outer `make_pruning_setmerge` above a `Broadcast` shuffle.

The pre-aggregation is what keeps this affordable: each node ships at most its capped partial set, or, if its partial set already cancelled, just the signal that pruning was abandoned. A large build side therefore never floods the network. This bounded broadcast is the reason the [limit](#limits) is lower for shuffled joins than for the replicated case.

## Limits

The pruning set is capped by the number of distinct keys it holds. The cap depends on whether the set has to cross the network, because that is what makes a large set expensive:

* A **semi join with a local build side** (the `WHERE key IN (SELECT ...)` shape against a replicated or broadcast build side) builds and uses the set on each node with no cross-stage broadcast, so it allows a large set: up to **5 million** keys.
* **Everything else** uses a much smaller cap, **100,000** keys: inner and right joins, and any join whose build side is **hash-shuffled across nodes**, including a hash-shuffled semi join. Here the set is broadcast across a stage boundary (see [Distributed execution](#distributed-execution)), so the lower cap bounds that network cost.

In other words, a semi join gets the larger cap only when its build side is local; once the build side is shuffled it uses the 100,000-key cap like any other distributed join. These are built-in limits, not knobs you set. When the build side exceeds the cap, the set [cancels](#how-it-works) and the scan runs unfiltered, so the limits bound the cost of pruning without ever changing a query's result. The cap is visible in the plan as the second argument to `make_pruning_set`.

## How to observe whether join pruning was applied

Run `EXPLAIN (PHYSICAL)` and look for two things: a `make_pruning_set` aggregate that builds the set from the build-side key, and the set pushed into the probe-side scan as `read_tablets(..., <key> in (SUBQUERY{N}))`. Both plans below are the output for the query above, lightly trimmed (per-node type and affinity annotations removed). On a single node the build side feeds one `make_pruning_set`; across nodes the set is built in the two phases from [Distributed execution](#distributed-execution): a per-node `make_pruning_set` pre-aggregation, a `Broadcast` shuffle, and a `make_pruning_setmerge`.

<CodeGroup>
  ```text Single node theme={"theme":{"light":"css-variables","dark":"css-variables"}}
  # One make_pruning_set feeds the orders scan: it reads only tablets/granules whose o_custkey is in the set.
  [0] [Projection] orders.o_orderkey, orders.o_custkey, orders.o_totalprice, orders.o_orderdate
   \_[1] [MaybeCache]
      \_[2] [Join] Mode: Inner [(orders.o_custkey = customers.c_custkey)]
         \_[3] [TableFuncScan] orders.o_orderkey: $0.o_orderkey, orders.o_custkey: $0.o_custkey, orders.o_totalprice: $0.o_totalprice, orders.o_orderdate: $0.o_orderdate
         | |   $0 = read_tablets(table_name => orders, tablet, orders.o_custkey in (SUBQUERY{0}))
         |  \_[4] [TableFuncScan] tablet: $0.tablet
         |        $0 = list_tablets(table_name => orders)
         \_[5] [Shuffle] Loopback with disjoint readers
            \_[6] [Projection] customers.c_custkey
               \_[7] [Filter] (customers.c_region = 'EMEA')
                  \_[8] [TableFuncScan] customers.c_custkey: $0.c_custkey, customers.c_region: $0.c_region
                    |   $0 = read_tablets(table_name => customers, tablet)
                     \_[9] [Projection] tablet
                        \_[10] [Filter] (((min_c_region > 'EMEA') or (max_c_region < 'EMEA')) IS DISTINCT FROM TRUE)
                           \_[11] [TableFuncScan] tablet: $0.tablet, min_c_region: $0.min_c_region, max_c_region: $0.max_c_region
                                 $0 = list_tablets(table_name => customers)

  SUBQUERY{0}:
  [12] [MaybeCache]
   \_[13] [AggregateState] GroupBy: [] Aggregates: [make_pruning_set_0: make_pruning_set(customers.c_custkey, 100000)]
      \_[14] [Projection] customers.c_custkey
         \_Recurring Node --> [5]
  ```

  ```text Multiple nodes theme={"theme":{"light":"css-variables","dark":"css-variables"}}
  # Per-node make_pruning_set -> Broadcast -> make_pruning_setmerge, then pushed into the orders scan.
  [0] [MaybeCache]
   \_[1] [Shuffle] Gather
      \_[2] [Projection] orders.o_orderkey, orders.o_custkey, orders.o_totalprice, orders.o_orderdate
         \_[3] [Join] Mode: Inner [(orders.o_custkey = customers.c_custkey)]
            \_[4] [Shuffle] Hash by [orders.o_custkey]
            |  \_[5] [TableFuncScan] orders.o_orderkey: $0.o_orderkey, orders.o_custkey: $0.o_custkey, orders.o_totalprice: $0.o_totalprice, orders.o_orderdate: $0.o_orderdate
            |    |   $0 = read_tablets(table_name => orders, tablet, orders.o_custkey in (SUBQUERY{0}))
            |     \_[6] [TableFuncScan] tablet: $0.tablet
            |           $0 = list_tablets(table_name => orders)
            \_[7] [Shuffle] Hash by [customers.c_custkey]
               \_[8] [Shuffle] Loopback with disjoint readers
                  \_[9] [Projection] customers.c_custkey
                     \_[10] [Filter] (customers.c_region = 'EMEA')
                        \_[11] [TableFuncScan] customers.c_custkey: $0.c_custkey, customers.c_region: $0.c_region
                          |   $0 = read_tablets(table_name => customers, tablet)
                           \_[12] [Projection] tablet
                              \_[13] [Filter] (((min_c_region > 'EMEA') or (max_c_region < 'EMEA')) IS DISTINCT FROM TRUE)
                                 \_[14] [TableFuncScan] tablet: $0.tablet, min_c_region: $0.min_c_region, max_c_region: $0.max_c_region
                                       $0 = list_tablets(table_name => customers)

  SUBQUERY{0}:
  [15] [MaybeCache]
   \_[16] [AggregateState] GroupBy: [] Aggregates: [make_pruning_set_1: make_pruning_setmerge(make_pruning_set_0, 100000)]
      \_[17] [Shuffle] Broadcast
         \_[18] [AggregateState] GroupBy: [] Aggregates: [make_pruning_set_0: make_pruning_set(customers.c_custkey, 100000)]
            \_[19] [Projection] customers.c_custkey
               \_Recurring Node --> [8]
  ```
</CodeGroup>

If you do not see the `make_pruning_set` aggregate and the `in (SUBQUERY{N})` on the probe scan, the probe side does not reach a prunable scan on the join key, or the join type is not eligible ([Where it applies](#where-it-applies)). A plan can also include join pruning while the run-time set [cancels](#how-it-works) and prunes nothing, which is the intended, no-penalty behavior; the cancellation is visible only in the run-time metrics, not the plan shape.
