Collations
Learn how to work with collations in Firebolt.
Firebolt supports using collations through the ICU_NORMALIZE
function.
While we don’t support a native COLLATE
clause yet, you can use ICU_NORMALIZE
to get the same behaviour.
This guide teaches you how to use this function to perform collation-aware aggregations, joins, and sorting.
We also provide best practices around maximizing performance when working with collations.
Throughout this guide we use the Any-Upper
collation as it’s easy to understand.
In real queries, use the UPPER(<x>)
function instead of ICU_NORMALIZE(<x>, 'Any-Upper')
.
It does the same thing but is much more efficient.
Basics on collations and ICU_NORMALIZE
When you calculate <col_a> = <col_b>
on two text columns, Firebolt checks whether the underlying byte sequences are the same.
For some workloads, you might want to perform more sophisticated string comparisons.
You might want to perform a case-insensitive comparison, or treat special characters in a specific way (e.g. having Ü
and U
compare the same).
This can be done using collations, which define more complex equivalence classes on TEXT
values.
When using a collation, there can be many strings with different byte sequences that compare equal.
ICU_NORMALIZE
takes any text value from a collation’s equivalence class, and maps it to the same representative.
After normalizing, you can perform regular byte-based text comparison.
Doing byte-based comparison of the normalized text values now behaves the same as collation-based comparison of the original text values.
Let’s look at this in a simple example. We use a case-insensitive collation:
The SELECT
query above returns the following result set:
You can see that collation_eq
is true in more cases than byte_eq
.
This is because we now compare the normalized text values, which is equivalent to case-insensitive
comparison.
ICU_NORMALIZE
is your basic building block to perform collation-aware processing.
You can use it to perform more complex operations.
Collation-aware aggregations
You can use ICU_NORMALIZE
as a GROUP BY
key to perform collation-aware aggregations.
In the above example, you might want to know the minimum and maximum id for every equivalence class.
You can simply express this in SQL using:
This SQL query returns the following result:
You can see that you get one result row for every equivalence class.
We can tell from the values in min/max
that every result group contains the correct two input rows.
Collation-aware joins
You can also use ICU_NORMALIZE
to perform collation-aware joins.
For each row in test_examples
, let’s count how many rows in the table compare equal in a case-insensitive way.
We can write the following query to compute this:
This SQL query returns the following results:
As expected, every row has one other row (and itself) that compares equal using the collation.
In the query above, we select the original text ex1.t
while using the normalized version for the join condition.
This is a common pattern when working with ICU_NORMALIZE
- you’ll often want to use the normalized values for comparisons and joins, but return the original text in your results.
Collation-aware sorting
You can also use ICU_NORMALIZE
to perform collation-aware sorting.
As with equality comparisons, sorting on text values happens by comparing the UTF-8 byte sequences.
This means that for example 'B'
is considered less than 'a'
.
The following examples perform byte-based and case-insensitive sorting:
Making collations fast
Computing ICU_NORMALIZE
is expensive.
The following computes the Latin-ASCII
collation on ten million short strings:
Let’s look at the telemetry when running EXPLAIN (ANALYZE)
for the above query.
Note that results can vary on different hardware:
We spend about 30 seconds of CPU time computing the scalar ICU normalization and hash.
Hashing is cheap, so most time is spent in ICU_NORMALIZE
.
This means we only have about 300k rows/s of throughput for this function.
If you know that you mostly care about specific collations on your data, store a normalized version on ingest. Depending on your use case, you might have to store the original data as well, as normalization cannot be reversed. If you use multiple collations, you can store multiple columns with different normalizations. When updating your source data, make sure to update the normalized values as well.
Let’s look at this in an example:
By storing the normalized values in your base table, you move overhead from query time to ingest time.
In the above example, querying the materialized collations is ~300 times faster than computing ICU_NORMALIZE
on read.