Primary indexes

‌Firebolt uses primary indexes to physically sort data into the Firebolt File Format (F3). The index colocates similar values, which allows data to be pruned at query runtime. When you query a table, rather than scanning the whole data set, Firebolt uses the table’s index to prune the data. Unnecessary ranges of data are never loaded from disk. Firebolt reads only the relevant ranges of data to produce query results.

Primary indexes in Firebolt are a type of sparse index. Unlike a dense index that maps every search key value in a file, a sparse index is a smaller construct that holds only one entry per data block (a compressed range of rows). By using the primary index to read a much smaller and highly compressed range of data from F3 into the engine cache at query runtime, Firebolt produces query results much faster with less disk I/O.

The video below explains sparse indexing. Eldad Farkash is the CEO of Firebolt.

How you create a primary index

To define a primary index, you use the PRIMARY INDEX clause within a CREATE TABLE statement. Although they are optional, we strongly recommend them.

The basic syntax of a PRIMARY INDEX clause within a CREATE TABLE statement is shown in the example below.

CREATE [FACT|DIMENSION] TABLE <table_name> (
  <column1> <datatype>,
  <column2> <datatype>,
  <column3> <datatype>,
     ...
)
PRIMARY INDEX <column1> [, <...column2>];

Primary indexes can’t be modified

After you create a table, you can’t modify the primary index. To change the index, you must drop the table and recreate it.

How to choose primary index columns

The columns that you choose for the primary index and the order in which you specify them are important. By choosing a good a primary index, scan sizes can be reduced drastically. This in turn leads to much faster query execution.

If you have already defined a workload that you want to run in Firebolt, try out the CALL RECOMMEND_DDL command to find suitable primary index and partition key configurations.

Include columns used in selective predicates

The most commonly used columns that are used in a query’s WHERE clause can help to prune data. Depending on the predicates the columns are used in, Firebolt can prune more or less data.

For a predicate like WHERE event_ts < now(), almost all rows might satisfy the filter condition. This is called a predicate with low selectivity. Meanwhile, a predicate such as WHERE user_id = 49327 is often highly selective. There can be millions of users, and only very few rows might correspond to the user with id 49327.

Columns that are used in predicates with highly selective filters will allow Firebolt to prune the most data. This in turn leads to very fast query execution. When creating a primary index, choose columns used in highly selective predicates.

Add low-cardinality columns to the beginning of the Primary Index

In Firebolt’s F3 file format, the primary index for a table defines the sort order of data blocks on S3. F3 sorts the data in dictionary order.

If you define a primary index (a, b), this means that the data blocks on S3 are ordered by a. Rows with the same values of a are consecutive and ordered by b.

If column a has low cardinality (i.e. few distinct values) there will be long ordered runs of b in the F3 files. If column a has high cardinality (i.e. many distinct values), there will be almost no ordered runs of b.

Firebolt’s pruning is most effective on long runs of ordered data. To get good pruning for both a and b, it’s important that a has few distinct values. If a is a high-cardinality column with lots of distinct values, predicates on b will not be able to prune effectively.

For compound primary indexes, start with low-cardinality columns that have at most a few dozen distinct values. Once you get to high-cardinality columns, choose the one which is used in many queries with highly selective predicates.

Include as many columns as you need

The number of columns that you specify in the index won’t negatively affect query performance. Additional columns might slow down ingestion very slightly, but the benefit for flexibility and performance of analytics queries will almost certainly outweigh any impact to ingestion performance.

Consider how you alter values in WHERE clauses

The primary index isn’t effective if Firebolt can’t determine whether the data in the sparse index matches a predicate. If the WHERE clause in your query contains a complex expression that transforms the column values, Firebolt might not be able to use the index. Consider a table with the primary index definition shown below, where playerid is a INTEGER data type in a table named players.

  PRIMARY INDEX playerid

In the example analytics query over the players table, Firebolt can’t use the primary index with the WHERE clause. This is because playerid is wrapped in a call to UPPER on the left side of the comparison. To satisfy the conditions of comparison, Firebolt must read all values of playerid to apply the UPPER function.

SELECT
  playerid
FROM
  players
WHERE
  UPPER(playerid) LIKE AA%;

In contrast, Firebolt can use the primary index in the following example:

SELECT
  playerid
FROM
  players
WHERE
  playerid LIKE AAA%;

If you know that you will use a function in a predicate ahead of time, consider creating a virtual column to store the result of the function. You can then use that virtual column in your index and queries. This is particularly useful for hashing columns.

With a star schema, include join key columns in the fact table index

If you have a star schema with a fact table referring to many dimension tables, include the join keys (the foreign key columns) in the primary index of the fact table. This helps accelerate queries because Firebolt can use the join keys for data pruning in the F3 format.

Conversely, on the dimension table side, there is no benefit to including the join key in the dimension table primary index unless you use it as a filter on the dimension table itself.

Using partitions with primary indexes

In most cases, partitioning isn’t necessary because of the efficiency of primary indexes (and aggregating indexes). If you use partitions, the partition column is the first stage of sorting. Firebolt divides the table data into file segments according to the PARTITION BY definition. Then, within each of those segments, Firebolt applies the primary index to prune and sort the data into even smaller data ranges as described above.

For more information, see Working with partitions.

Primary index examples

This section demonstrates different primary indexes created on a fact table, totalscore, created with the DDL and sample values shown below.

Example fact table

The examples in this section are based on the fact table below. Each record stores the current level and high score of a player, identified by their playerid and nickname.

Table DDL

CREATE FACT TABLE player_information (
  registeredon DATE,
  playerid TEXT,
  nickname TEXT NOT NULL,
  currentscore BIGINT,
  level INTEGER NOT NULL
)
PRIMARY INDEX <see examples below>;

Table contents (excerpt)

registeredon level playerid nickname currentscore
2018-05-30 1 78152 kennethpark 137
2020-11-13 2 57328 sabrina21 104
2020-07-11 3 44963 rileyjon 111
2019-09-06 4 70147 ymatthews 49

Cardinality of columns

A COUNT DISTINCT query on each column returns the following. A higher number indicates higher cardinality.

distinct_dates distinct_levels distinct_currentscores
1461 35 89664

Example query pattern—date-based queries

Consider the two example queries below that return values with date-based filters.

Query 1

SELECT
  *
FROM
  player_registry
WHERE
  registeredon BETWEEN '2020-01-01' AND '2020-01-02'
  AND playerid = "11386"
  AND event_type = 'click'

Query 2

SELECT
  count(*),
  registeredon
FROM
  players
WHERE
  EXTRACT(YEAR FROM registeredon) = 2021

For both queries, the best primary index is:

PRIMARY INDEX (registeredon, playerid, event_type)
  • With visit_date in the first position in the primary index, Firebolt sorts and compresses records most efficiently for these date-based queries.
  • The addition of playerid in the second position and event_type in the third position further compresses data and accelerates query response.
  • playerid is in the second position because it has higher cardinality than event_type, which has only three possible values.

For query 2, you can improve performance further by partitioning the table according to year as shown in the query excerpt below.

PRIMARY INDEX (registeredon, playerid, event_type)
PARTITION BY (EXTRACT (YEAR FROM registeredon))

Without the partition, Firebolt likely must scan across file segments to return results for the year 2021. With the partition, segments exist for each year, and Firebolt can read all results from a single segment. If the query runs on a multi-node engine, the benefit may be greater. Firebolt can avoid pulling data from multiple engine nodes for results.

Example query pattern—customer-based query

Consider the example query below that returns the sum of click values for a particular player registry.

SELECT
  playerid,
  nickname,
  event_type,
  SUM(event_value)
FROM
  events
WHERE
  playerid = "14493"
  AND event_type = 'click'
  AND event_value > 0
GROUP BY
	1,
  2,
  3;

For this query, the best primary index is:

PRIMARY INDEX (playerid, asset_id, event_type)
  • playerid is in the first position because it has the highest cardinality and sorts and prunes data most efficiently for this query.
  • The addition of asset_id won’t accelerate this particular query, but adding it is not detrimental.
  • Although event_type has low cardinality, because it’s contained in the WHERE clause, adding it to the primary index has some benefit.

Example—using virtual columns

Virtual columns are most often used in a primary index to:

  • Accommodate functions that alter column values.
  • Calculate hash values for columns that contain long strings.

A virtual-column example for a function that transforms a column value is shown below.

Step 1—create the fact table with the virtual column in the index

The example DDL below creates a fact table similar to the one earlier in this section. However, it adds the upper_customer_id column. The table creates this virtual column to store the result of an UPPER function that upper-cases customer_id values during the INSERT INTO operation.

The PRIMARY INDEX clause uses the upper_customer_id column because that column is used in analytics queries.

CREATE FACT TABLE events_log (
  visit_date DATE,
  asset_id TEXT,
  customer_id TEXT NOT NULL,
  event_type TEXT,
  event_count INTEGER NOT NULL,
  uppder_customer_id TEXT NOT NULL
)
PRIMARY INDEX visit_date, upper _customer_id;

Step 2—use the function during ingestion (INSERT INTO statement)

INSERT INTO
  player_registry 
SELECT
  registeredon,
  asset_id,
  playerid,
  event_type,
  event_count,
  UPPER(playerid) AS upper_player_id
FROM
  players;

Step 3—query using the virtual column in predicates

The example SELECT query below uses the virtual column to produce query results and benefits from the index.

SELECT
  playerid
FROM
  players
WHERE
  upper_player_id LIKE AAA%;