Reference material for APACHE_DATASKETCHES_HLL_BUILD
APACHE_DATASKETCHES_HLL_BUILD
function is used to create a new HLL (HyperLogLog) sketch from a dataset.
This is particularly useful for large datasets where exact counting is computationally expensive.
Multiple sketches can be merged to a single sketch using the aggregate
function APACHE_DATASKETCHES_HLL_MERGE.
To estimate the final distinct count value from a sketch, the scalar
function APACHE_DATASKETCHES_HLL_ESTIMATE can be
used.
APACHE_DATASKETCHES_HLL_BUILD
requires less memory than exact count distinct aggregation, but also introduces
statistical error. For more information
see Apache HyperLogLog sketch docs.
Parameter | Description | Supported input types |
---|---|---|
<expression> | Any column name or function that return a column name. | INTEGER , BIGINT , REAL , DOUBLE , TEXT |
<hll_precision> | Optional. Literal integer value for precision. If not set explicitly, a precision of 12 is used. Range: 12-20. This value represents the log (2 based) of the size of the sketch. For more information see Apache HyperLogLog sketch. | INTEGER |
<hll_type> | Optional. Literal text value to set hll type. If not included, the default is ‘HLL_4’. Valid values are: ‘HLL_4’, ‘HLL_6’, ‘HLL_8’. These values represent different levels of compression of the final HLL array where 4, 6 and 8 refer to the number of bits each bucket of the HLL array is compressed down to. HLL_4 is the most compressed variant but generally slightly slower than the other two, especially during merge operations. | TEXT |
<text_utf16_little_endian> | Optional. Literal boolean value to specify encoding of text if the <expression> is of TEXT datatype. By default set to false means UTF-8 encoding, if set to true it means UTF-16 little endian encoding. | BOOLEAN |
BYTEA
accurate_count (BIGINT) |
---|
3333334 |
accurate_count (BIGINT) |
---|
5000001 |
estimate (BIGINT) | sketch (BYTEA) |
---|---|
3333526 | \x0a0107120a180102… |
5001149 | \x0a0107120a180102… |
estimate (BIGINT) |
---|
6673219 |