Learn how to work with collations in Firebolt.
ICU_NORMALIZE
function.
While we don’t support a native COLLATE
clause yet, you can use ICU_NORMALIZE
to get the same behaviour.
This guide teaches you how to use this function to perform collation-aware aggregations, joins, and sorting.
We also provide best practices around maximizing performance when working with collations.
Any-Upper
collation as it’s easy to understand.
In real queries, use the UPPER(<x>)
function instead of ICU_NORMALIZE(<x>, 'Any-Upper')
.
It does the same thing but is much more efficient.ICU_NORMALIZE
<col_a> = <col_b>
on two text columns, Firebolt checks whether the underlying byte sequences are the same.
For some workloads, you might want to perform more sophisticated string comparisons.
You might want to perform a case-insensitive comparison, or treat special characters in a specific way (e.g. having Ü
and U
compare the same).
This can be done using collations, which define more complex equivalence classes on TEXT
values.
When using a collation, there can be many strings with different byte sequences that compare equal.
ICU_NORMALIZE
takes any text value from a collation’s equivalence class, and maps it to the same representative.
After normalizing, you can perform regular byte-based text comparison.
Doing byte-based comparison of the normalized text values now behaves the same as collation-based comparison of the original text values.
Let’s look at this in a simple example. We use a case-insensitive collation:
SELECT
query above returns the following result set:
collation_eq
is true in more cases than byte_eq
.
This is because we now compare the normalized text values, which is equivalent to case-insensitive
comparison.
ICU_NORMALIZE
is your basic building block to perform collation-aware processing.
You can use it to perform more complex operations.
ICU_NORMALIZE
as a GROUP BY
key to perform collation-aware aggregations.
In the above example, you might want to know the minimum and maximum id for every equivalence class.
You can simply express this in SQL using:
min/max
that every result group contains the correct two input rows.
ICU_NORMALIZE
to perform collation-aware joins.
For each row in test_examples
, let’s count how many rows in the table compare equal in a case-insensitive way.
We can write the following query to compute this:
ex1.t
while using the normalized version for the join condition.
This is a common pattern when working with ICU_NORMALIZE
- you’ll often want to use the normalized values for comparisons and joins, but return the original text in your results.
ICU_NORMALIZE
to perform collation-aware sorting.
As with equality comparisons, sorting on text values happens by comparing the UTF-8 byte sequences.
This means that for example 'B'
is considered less than 'a'
.
The following examples perform byte-based and case-insensitive sorting:
ICU_NORMALIZE
is expensive.
The following computes the Latin-ASCII
collation on ten million short strings:
EXPLAIN (ANALYZE)
for the above query.
Note that results can vary on different hardware:
ICU_NORMALIZE
.
This means we only have about 300k rows/s of throughput for this function.
If you know that you mostly care about specific collations on your data, store a normalized version on ingest.
Depending on your use case, you might have to store the original data as well, as normalization cannot be reversed.
If you use multiple collations, you can store multiple columns with different normalizations.
When updating your source data, make sure to update the normalized values as well.
Let’s look at this in an example:
ICU_NORMALIZE
on read.