Understand how the UNIQUE constraint affects query performance.
UNIQUE
constraints can be declared in CREATE TABLE
statements. However, the system does not enforce data uniqueness. The query optimizer uses these hints to make better decisions during query planning and optimization.
Data uniqueness is a crucial piece of information for the query optimizer. In Firebolt, it mainly affects query plan quality through two aspects:
test_table
has two unique keys: {a}
and {b, c}
.
The query optimizer tracks unique keys for every plan fragment in optimizing a query. Unique constraints defined on tables are known to the optimizer directly.
In addition, the query optimizer also derives unique keys from query semantics. For example, if a query groups by columns c, d, e
, the optimizer knows the result of this aggregation has columns {c, d, e}
as a unique key.
WHERE
clause is executed for each row of the join result between customer
and orders
in the FROM
clause.
This approach may compute the subquery more than necessary. A customer can have many orders. The result of joining customer
and orders
can therefore produce many rows per customer. However, the average order price for a customer is the same for all these rows. Computing the average for every row is not a good idea.
To avoid the redudant computation, one idea is to join customer
with orders o2
first and compute the average price grouping by c_custkey
. This rewrite is only safe if c_custkey
is unique. If there are duplicate entries in c_custkey
, the join effectively put all joined rows for all duplicate entries on the same value as one group to compute the aggregate function. For example, assume a customer 123
has 10 orders and c_custkey
has two entries of 123
. The first exeuction strategy runs the subquery on each row in c_custkey
. A single 123
row is joined with 10 rows and the aggregate function is computed on these 10 rows. This process is repeated twice. Using the join execution strategy, each 123
row is joined with 10 rows. Grouping by c_custkey
, all 20 rows for customer 123
are in the same group to compute the aggregate function. Depending on the semantic of the aggregate function, this may lead to wrong result.
What if c_custkey
is not declared as unique? The query optimizer can only play safe. The general solution introduces an extra step:
customer
by c_custkey
so the result is unique.c_custkey
. Now joining this result with the original customer
table to restore the original duplications.c_custkey
is actually unique, the safe execution stategy above can lead to significant overhead. The first aggregation on c_custkey
and the final restoring join are both a waste of time.
For this specific query though, another execution strategy exists. The query optimizer understands that the average order price is computed per customer. We can simply compute the average on orders
table per o_custkey
alone. Using this stategy, the optimizer does not care about whether c_custkey
is unique anymore. However, this does not mean the optimizer produces the same plan with and without the unique constraint. There is more to consider.
c_custkey
, the optimizer produces the following plan if c_custkey
is not declared as unique:
orders
table. In the next join at node [1], the optimizer follows the numbers and puts the smaller input as the build side of join.
We know this is a bad join order on TPC-H schema. The first join between customer
and orders
cannot produce more rows than orders
. After all, the query selects some orders from all orders. But without knowing that c_custkey
is unique, the optimizer cannot figure this out.
By declaring c_custkey
unique, the query optimizer fixes the join order immediately:
c_custkey
column fully distinct. With that, it knows for the 12248 rows on the other side of the join (node [6]), each will join at most 1 row. The join output row count is then estimated to be around 12248, a lot smaller than before.