HLL_COUNT_DISTINCT
Counts the approximate number of unique or not NULL values, to the precision specified. HLL_COUNT_DISTINCT
uses the HLL++ algorithm and allows you to control the sketch size set precision.
HLL_COUNT_DISTINCT
requires less memory than exact aggregation functions, but also introduces statistical uncertainty. The default precision is 12, with a maximum of 20.
Higher precision comes at a memory and performance cost.
Syntax
HLL_COUNT_DISTINCT ( <expression> [, <precision> ] )
Parameter | Description | Supported input types |
---|---|---|
<expression> | Valid values for the expression include column names or functions that return a column name. | Any type |
<precision> | Optional integer value to set precision. If not included, the default precision is 12. Precision range: 12-20. | INTEGER , BIGINT |
Return Type
BIGINT
APPROX_COUNT_DISTINCT(expression)
andHLL_COUNT_DISTINCT(expression, 17)
return the same results, asAPPROX_COUNT_DISTINCT
uses the HLL algorithm with the default parameter to control the sketch size set to 17.
Return Type
NUMERIC
Example
To understand the difference between COUNT(DISTINCT pk)
with exact precision enabled, APPROX_COUNT_DISTNCT(pk)
, and HLL_COUNT_DISTINCT(pk, <precision>)
, consider a table, count_test
with 8,388,608 unique pk
values.
SELECT
COUNT(DISTINCT pk) as count_distinct,
APPROX_COUNT_DISTINCT(pk) as approx_count
HLL_COUNT_DISTINCT(pk, 12) as hll12_count,
HLL_COUNT_DISTINCT(pk, 20) as hll20_count
FROM
count_test;
Returns:
Assuming 8,388,608 unique pk values, we will see results like:
' +----------------+--------------+-------------+-------------+
' | count_distinct | approx_count | hll12_count | hll20_count |
' +----------------+--------------+-------------+-------------+
' | 8,388,608 | 8,427,387 | 8,667,274 | 8,377,014 |
' +----------------+--------------+-------------+-------------+
where approx_count is using precision 17, hll12_count is using precision 12, and hll20_count is using precision 20, the most precise.