Using primary indexes
- How you create a primary index
- Primary indexes can’t be modified
- How to choose primary index columns
- Using partitions with primary indexes
- Primary index examples
Firebolt uses primary indexes to physically sort data into the Firebolt File Format (F3). The index also colocates similar values, which allows data to be pruned at query runtime. When you query a table, rather than scanning the whole data set, Firebolt uses the table’s index to prune the data. Unnecessary ranges of data are never loaded from disk. Firebolt reads only the relevant ranges of data to produce query results.
Primary indexes in Firebolt are a type of sparse index. Unlike a dense index that maps every search key value in a file, a sparse index is a smaller construct that holds only one entry per data block (a compressed range of rows). By using the primary index to read a much smaller and highly compressed range of data from F3 into the engine cache at query runtime, Firebolt produces query results much faster with less disk I/O.
The video below explains sparse indexing. Eldad Farkash is the CEO of Firebolt.
How you create a primary index
To define a primary index, you use the
PRIMARY INDEX clause within a
CREATE TABLE statement. Although they are optional, we strongly recommend them.
The basic syntax of a
PRIMARY INDEX clause within a
CREATE TABLE statement is shown in the example below.
CREATE [FACT|DIMENSION] TABLE <table_name> ( <colname_1> <datatype>, <colname_2> <datatype>, <colname_3> <datatype>, ... ) PRIMARY INDEX <colname_1> [, <...colname_N>];
Primary indexes can’t be modified
After you create a table, you can’t modify the primary index. To change the index, you must drop the table and recreate it.
How to choose primary index columns
The columns that you choose for the primary index and the order in which you specify them are important. Use the following recommendations to guide your choices. To see these guidelines in action, see Primary index examples.
Include columns used in WHERE clauses
Include all columns that are used in query
WHERE clauses to filter query results.
Consider including columns used in GROUP BY clauses
Consider adding columns that you use in
GROUP BY statements with aggregate functions.
Order columns in the index definition by cardinality
Specify columns in order of how frequently they’re used in
WHERE clauses and in descending order of cardinality. In other words, in the first position (
<colname_1> in the syntax above) specify the column that filters results the most. Then specify remaining columns in descending order of how much they filter.
Avoid specifying a column of the highest cardinality—that is, a column that has truly unique values or the primary key—unless you use that column in query
WHERE clauses. Also avoid specifying columns of low cardinality that won’t adequately filter results.
Include as many columns as you need
The number of columns that you specify in the index won’t negatively affect query performance. Additional columns might slow down ingestion very slightly, but the benefit for flexibility and performance of analytics queries will almost certainly outweigh any impact to ingestion performance.
Consider how you alter values in WHERE clauses
The primary index isn’t effective if Firebolt can’t determine the values in the index column. If the
WHERE clause in your query contains a function that transforms the column values, Firebolt can’t use the index. Consider a table with the primary index definition shown below, where
asset_id is a
TEXT data type in a table named
PRIMARY INDEX asset_id
In the example analytics query over the
events_log table, Firebolt can’t use the primary index with the
WHERE clause. This is because the function with
asset_id is on the left side of the comparison. To satisfy the conditions of comparison, Firebolt must read all values of
asset_id to apply the
SELECT asset_id FROM events_log WHERE UPPER(asset_id) LIKE ‘AA%’;
In contrast, Firebolt can use the primary index in the following example:
SELECT asset_id FROM events_log WHERE asset_id LIKE ‘AAA%’;
If you know that you will use a function in a predicate ahead of time, consider creating a virtual column to store the result of the function. You can then use that virtual column in your index and queries. This is particularly useful for hashing columns.
With a star schema, include join key columns in the fact table index
If you have a star schema with a fact table referring to many dimension tables, include the join keys (the foreign key columns) in the primary index of the fact table. This helps accelerate queries because the Firebolt query planner uses join keys as a predicate.
Conversely, on the dimension table side, there is no benefit to including the join key in the dimension table primary index unless you use it as a filter on the dimension table itself.
Using partitions with primary indexes
In most cases, partitioning isn’t necessary because of the efficiency of primary indexes (and aggregating indexes). If you use partitions, the partition column is the first stage of sorting. Firebolt divides the table data into file segments according to the
PARTITION BY definition. Then, within each of those segments, Firebolt applies the primary index to prune and sort the data into even smaller data ranges as described above.
For more information, see Working with partitions.
Primary index examples
This section demonstrates different primary indexes created on a fact table,
site_sales, created with the DDL and sample values shown below.
Example fact table
The examples in this section are based on the fact table below. The table is a web log with hundreds of millions of rows. Each record stores an
event_count of each
event_type for HTML elements identified by
asset_id and the
customer_id of the visitor that initiated the event. The table is denormalized. No single column has unique values.
CREATE FACT TABLE events_log ( visit_date DATE, asset_id TEXT, customer_id TEXT NOT NULL, event_type TEXT, event_count INTEGER NOT NULL ) PRIMARY INDEX <see examples below>;
Table contents (excerpt)
+------------+--------------------------------------+-------------+------------+-------------+ | visit_date | asset_id | customer_id | event_type | event_count | +------------+--------------------------------------+-------------+------------+-------------+ | 2018-05-30 | a974ff70-3367-4460-bd2e-e26de9439469 | 78152 | click | 137 | | 2020-11-13 | 3d58b0a0-f838-428b-8f1c-2ff30aa9b9ea | 57328 | mouseover | 104 | | 2020-07-11 | e8c533a4-b039-44df-b3a1-61fdc3d6c21d | 44963 | mouseout | 111 | | 2019-09-06 | 02333518-5a39-4c11-a1a7-ed5e4163cb04 | 70147 | click | 49 | | 2019-05-04 | 83f3a7bc-f6ca-4511-a8fb-74683e2b25cc | 58458 | mouseover | 127 | | 2021-03-19 | 1664be68-d09e-4beb-a5b7-1889ebd06cfb | 40360 | mouseout | 43 | | 2018-05-01 | 37858981-bbea-46bf-8d85-0e9a73062137 | 47880 | mouseover | 101 | | 2018-06-19 | c5e7882b-3639-42ee-93f0-8bf5c6e99b14 | 74728 | mouseover | 141 | | 2018-02-28 | 97ac6f85-3bbc-4894-811f-7584304c84f9 | 84802 | mouseout | 15 | | ... | | | | | +------------+--------------------------------------+-------------+------------+-------------+
Cardinality of columns
COUNT DISTINCT query on each column returns the following. A higher number indicates higher cardinality.
+----------------+-----------------+--------------------+-----------------+ | distinct_dates | distinct_assets | distinct_customers | distinct_events | +----------------+-----------------+--------------------+-----------------+ | 1461 | 300 | 89664 | 3 | +----------------+-----------------+--------------------+-----------------+
Example query pattern—date-based queries
Consider the two example queries below that return values with date-based filters.
SELECT * FROM events_log WHERE visit_date BETWEEN '2020-01-01' AND '2020-01-02' AND customer_id = "11386" AND event_type = 'click'
SELECT count(*), visit_date FROM events_log WHERE EXTRACT(YEAR FROM visit_date) = ‘2021’
For both queries, the best primary index is:
PRIMARY INDEX (visit_date, customer_id, event_type)
visit_datein the first position in the primary index, Firebolt sorts and compresses records most efficiently for these date-based queries.
- The addition of
customer_idin the second position and
event_typein the third position further compresses data and accelerates query response.
customer_idis in the second position because it has higher cardinality than
event_type, which has only three possible values.
asset_idis not used in query filters, so it is omitted.
For query 2, you can improve performance further by partitioning the table according to year as shown in the query excerpt below.
PRIMARY INDEX (visit_date, customer_id, event_type) PARTITION BY (EXTRACT (YEAR FROM visit_date))
Without the partition, Firebolt likely must scan across file segments to return results for the year 2021. With the partition, segments exist for each year, and Firebolt can read all results from a single segment. If the query runs on a multi-node engine, the benefit may be greater. Firebolt can avoid pulling data from multiple engine nodes for results.
Example query pattern—customer-based query
Consider the example query below that returns the sum of
click values for a particular customer.
SELECT asset_id, customer_id, event_type, SUM(event_value) FROM events WHERE customer_id = "14493" AND event_type = 'click' AND event_value > 0 GROUP BY 1, 2, 3;
For this query, the best primary index is:
PRIMARY INDEX (customer_id, asset_id, event_type)
customer_idis in the first position because it has the highest cardinality and sorts and prunes data most efficiently for this query.
- The addition of
asset_idwon’t accelerate this particular query, but adding it is not detrimental.
event_typehas low cardinality, because it’s contained in the
WHEREclause, adding it to the primary index has some benefit.
Example—using virtual columns
Virtual columns are most often used in a primary index to:
- Accommodate functions that alter column values.
- Calculate hash values for columns that contain long strings.
A virtual-column example for a function that transforms a column value is shown below.
Step 1—create the fact table with the virtual column in the index
The example DDL below creates a fact table similar to the one earlier in this section. However, it adds the
upper_customer_id column. The table creates this virtual column to store the result of an
UPPER function that upper-cases
customer_id values during the
INSERT INTO operation.
PRIMARY INDEX clause uses the
upper_customer_id column because that column is used in analytics queries.
CREATE FACT TABLE events_log ( visit_date DATE, asset_id TEXT, customer_id TEXT NOT NULL, event_type TEXT, event_count INTEGER NOT NULL, uppder_customer_id TEXT NOT NULL ) PRIMARY INDEX visit_date, upper _customer_id;
Step 2—use the function during ingestion (
INSERT INTO statement)
INSERT INTO events_log SELECT visit_date, asset_id, customer_id, event_type, event_count, UPPER(customer_id) AS upper_customer_id FROM ext_tbl_events;
Step 3—query using the virtual column in predicates
SELECT query below uses the virtual column to produce query results and benefits from the index.
SELECT customer_id FROM events_log WHERE upper_customer_id LIKE ‘AAA%’;