Understand how to organize data from efficient retrieval in Firebolt
Firebolt optimizes data storage and retrieval by using indexing, partitioning, and compute scaling to enhance query efficiency. Design your data model and use databases, tables, and indexes to optimize query performance.
Firebolt’s indexing and partitioning strategies are designed to take advantage of a cloud-based architecture that scales to handle very large data sets. Data is queried using multiple nodes for parallel processing. Data is also stored by columns, which allows for:
Firebolt also separates compute resources from storage resources so you can scale either up or down depending on your use case. Optimize your resources based on your changing workloads, and pay only for what you use.
Firebolt’s data modeling strategies work best with Firebolt’s managed tables, which leverage Firebolt’s performance optimizations. External tables, which allow users to access data without loading it into Firebolt, provide flexibility for querying external data sources like Amazon S3, but they do not offer the same performance benefits as managed tables.
The following sections show you how to use the previous data modeling strategies to decrease the number of bytes scanned to improve query performance, reduce storage costs, and optimize compute resources.
Topics:
Logical structure
In Firebolt, a database is a logical structure that organizes and stores data for efficient querying and management. Databases are created using the CREATE DATABASE
statement and can be modified or deleted using ALTER
and DROP
.
Compute and storage separation Firebolt separates compute and storage layers, ensuring databases are efficiently managed without impacting processing power.
Data Security Databases integrate with Firebolt’s data security layers, enforcing access controls and user permissions to safeguard sensitive information.
Performance and integration
Database topics:
Use CREATE DATABASE, which requires only the name of the database, and an engine to create a database.
Firebolt provides two types of engines:
The following code example creates a test_library
database with an optional description:
After you create a database, you can create additional objects and run queries within it. You can also modify the database parameters.
The following code example modifies the description of a database:
You can delete a database with DROP DATABASE, which permanently deletes the database and all its contents. Since this action cannot be undone, use it carefully. Create a backup before dropping the database to prevent data loss.
The following code example uses the optional IF EXISTS
clause to see if the test_library
database exists, and then deletes it:
Active queries will continue using the latest version of the database, but new queries submitted after the database is dropped will fail. To minimize disruptions, monitor active queries and notify users in advance.
Efficient database design in Firebolt is key to optimizing query performance, managing scalability, and ensuring workload isolation. By using databases to logically separate data, workloads from each other. You can reduce the impact of schema changes, minimize query latency, and enhance access control through role-based policies. For example, when managing data for applications that operate across multiple regions, understanding when to use a single database versus separate databases can significantly affect performance and operational efficiency. The following best practices outline how to structure databases in Firebolt to achieve these goals:
The maker of the fictional “Ultra Fast Gaming Inc.” company plans to expand into a second region, where the game will generate data with the same structure and format as the first region but potentially overlapping primary keys.
The developers have two options:
Recommended solution Create a separate database for the new region for the following reasons:
By creating a dedicated database for each region, the developers ensure optimal performance, scalability, and data management tailored to the needs of the UltraFast game expansion.
You can use the code in the Firebolt DDL Performance GitHub repository to measure the efficiency of your database design. Use the code to generate a set of pdf plots for latency to evaluate your choices including the following:
A schema serves as a blueprint for how data is organized within a database, acting as a logical framework that groups tables, views, and indexes. Understanding how to structure and manage schemas is crucial for maintaining an organized database environment, making it easier to optimize queries and manage access. While schemas don’t directly improve query performance, the way you organize your data within them can significantly impact how efficiently queries run, especially as your data grows.
To enhance query performance, simplify access control, and ensure scalability as your data grows, use the following best practices:
In Firebolt, tables are the key components for organizing and storing data. They consist of rows (records) and columns (attributes), making them an integral tool in data modeling. Firebolt supports two main types of tables:
Table topics:
Tables that are fully controlled by Firebolt make the best use of Firebolt’s optimization strategies. There are two types of managed tables, which serve different but complementary purposes:
By designing tables with both fact and dimension roles, users can optimize data management for reporting and analytics.
Fact tables
The following code example creates a fact table:
The internal structure of a fact table facilitates fast access to data even for very large datasets.
Dimension tables
Dimension tables are often replicated across all nodes, which enhances query performance and ensures quick access to frequently referenced information.
The following code example creates a dimension table:
In the previous example, the books
table stores details about the publication and is ideal for joining with fact tables like borrowedbooks
to analyze borrowing patterns or book popularity.
External tables allow users to access and query data stored outside the database, such as in Amazon S3, without loading it into Firebolt. This capability is particularly useful when working with large datasets stored externally that don’t require frequent access, enabling cost-efficient querying and avoiding data duplication. External tables generally have poorer performance compared to Firebolt managed tables because the data is not stored within Firebolt’s optimized infrastructure.
After you create a table, you can modify your table ownership, delete it including all of its dependencies. Use the following SQL commands to manage your table:
Change table ownership
To change the ownership of a table, use the ALTER TABLE statement with the OWNER TO clause.
The following command assigns a new owner to the borrowedbooks
table:
Delete a managed table
Dropping a Firebolt managed table permanently deletes it along with all its data.
The following code checks to see if the borrowedbooks
table exists and then deletes it:
Delete a managed table that has dependencies
If the table that you want to delete has dependent objects such as views or aggregating indexes, you can use the CASCADE
option to remove all dependencies together.
The following code deletes the borrowedbooks
table and all related objects:
Drop an external table
If you drop an external table, you only remove its definition from Firebolt. The actual data remains in the external source. The following code drops the external_books
external table:
Primary indexes optimize query performance by logically organizing data for efficient access based on key columns, using sparse indexing to minimize unnecessary scans during queries. This allows Firebolt’s engine to prune unnecessary data during queries, minimizing the amount of data read from disk, and accelerating query execution. By selecting columns frequently used in filters, the primary index enables fast, efficient data retrieval, which is crucial for large-scale, data-intensive applications. Properly configuring primary indexes ensures that Firebolt can maintain high performance, even with complex queries and large datasets.
How Firebolt uses a primary index to optimize performance
When new data is inserted into a table, Firebolt organizes it into tablets, which are logical partitions of the table. Each tablet holds a portion of the data, sorted according to the primary index. During query processing, Firebolt uses these indexes to eliminate blocks of rows not matching query predicate and scan only the necessary data, minimizing input and output operations, and optimizing performance.
For updates, Firebolt follows a “delete-and-insert” approach: the original row is marked for deletion, and the updated row is inserted into a new tablet. Deleted rows are not removed immediately but are flagged and later cleaned up during maintenance tasks.
A primary index can only be defined when creating a new table, so if you need to modify the index, you’ll have to create a new table. To define a primary index, use the PRIMARY INDEX
clause to specify your columns, as shown in the following example:
In the previous example, CREATE TABLE creates a fact table by default, and the primary index acts as a composite index of both the book_id
and checkout_date
columns.
You can also create a primary index on a dimension table as shown in the following code example:
The primary index should include columns that are often used in queries that filter or aggregate data, including those that use WHERE
and GROUP BY
clauses, so that Firebolt can use these columns for efficient retrieval. Firebolt physically clusters the data in the table by the columns of the primary index. The primary index significantly improves query performance because it reduces the time spent scanning irrelevant rows and allows the query engine to retrieve only the necessary data.
To optimize query performance and maximize indexing efficiency, use the following best practices for designing composite primary indexes:
Aggregating indexes precompute and store aggregation results from aggregation functions like SUM
, COUNT
, and AVG
, as well as more complex calculations. When the underlying data is changed, Firebolt recalculates aggregate indexes automatically, so that they are always updated when you query them. Firebolt handles the following scenarios:
When you utilize an aggregating index through a query, Firebolt utilizes the pre-calculated values instead of computing them in real-time. This reduces the computational burden at runtime and significantly speeds up query response times, especially for large datasets with high concurrency demands. Aggregating indexes are especially useful for frequently run queries that involve repeated aggregation operations, such as calculating totals, averages, or counts.
Aggregating indexes do require additional storage, because the precomputed data needs to be maintained. In write-heavy environments, frequent updates, inserts, or deletes can lead to increased computational overhead as the indexes must be recalculated and kept up to date. This can result in higher compute costs. It’s important to consider the performance benefits of faster queries against the additional storage and processing costs, especially for frequently changing datasets.
Aggregating indexes can be created at the time a new table is made or afterward. You can define it as needed based on query patterns and performance.
The following code example shows how to create an aggregating index to precompute the number of transactions per borrower and their average late fee on the existing borrowedbooks
table created in the previous Fact tables section in Firebolt managed-tables:
When Firebolt runs a query that accesses either transaction_id
or late_fee
, it retrieves the precomputed results from the aggregating index, rather than computing them.
The following code example shows how to create an aggregating index to precompute the total late fees accumulated for each book:
Additionally, aggregating indexes integrate seamlessly with Firebolt’s partitioning strategies, further improving query performance by allowing the query engine to access only the relevant partitions. This reduces the amount of data scanned and processed, particularly when dealing with large, partitioned datasets. The combination of partition pruning and aggregate indexing helps achieve superior query performance in data-intensive environments, allowing for quicker insights and more efficient use of system resources.
To optimize query performance and manage resources effectively, follow these best practices for creating and maintaining aggregating indexes:
SUM
, COUNT
, and AVG
that are run repeatedly.borrower_id
or book_id
from the examples in this section.customer_id
or order_id
from the examples in this section. Avoid creating multiple indexes with overlapping aggregations on the same columns to minimize unnecessary overhead and costs.Firebolt’s table partitioning improves query performance and data lifecycle management by physically organizing data according to values in one or more partition key columns — commonly date, region, or category. Effective partitioning reduces the volume of data scanned during queries by enabling partition pruning, where irrelevant partitions are skipped entirely. This improves performance, and also makes operations like vacuuming and archiving more efficient.
To optimize query performance and data lifecycle management, follow these best practices when defining and maintaining table partitions:
DATE
or INTEGER
make it easier to reason about and control partition cardinality — such as maintaining one partition per day over the last three years, or one partition per month over the last 80 years. In contrast, variable-length types like STRING
can evolve unpredictably. For example, a product_category
column that starts with a dozen values may grow to thousands of distinct entries over time, unintentionally ballooning the number of partitions and hurting performance.Partitioning is defined during table creation. This ensures that data is physically organized from the outset.
The following code example creates a new table with a primary index, and partitions the table by month:
In this example, the table is partitioned by the month of checkout_date
, allowing Firebolt to skip entire months of data when queries filter by date ranges.
To drop an existing partition, use ALTER TABLE ... DROP PARTITION
. For example, to remove all records for January 2023:
This removes the physical partition corresponding to data from that month.
If you understand your data and query patterns, you should select high-cardinality columns frequently used in WHERE
, JOIN
, or GROUP BY
clauses to minimize the amount of data scanned and improve query efficiency.
If you don’t know how to effectively select a primary index or partition your data, you can use Firebolt’s RECOMMEND_DDL tool to provide automated insights. RECOMMEND_DDL
will make recommendations to optimize database performance by analyzing your query patterns and suggesting the most efficient configurations for primary indexes and partitioning. By examining historical query data, the tool identifies columns frequently used in filtering, grouping, or aggregating operations and recommends appropriate primary indexes and partition keys. These suggestions help reduce the amount of data scanned during queries, enabling faster execution and improved resource utilization.
RECOMMEND_DDL
is particularly useful in complex environments where query patterns evolve over time. By reviewing historical query data, Firebolt identifies columns that are frequently used in filtering or aggregation and recommends appropriate primary index and partitioning strategies.
The following code example uses RECOMMEND_DDL
to analyze query patterns on the books table, created in the **Dimension tables ** section under Firebolt managed-tables, based on queries run in the past week:
If the books
table is frequently queried based on genre
and book_id
, Firebolt’s RECOMMEND_DDL
command might provide the following example output:
recommended_partition_key | recommended_primary_index | average_pruning_improvement | analyzed_queries |
---|---|---|---|
DATE_TRUNC(‘month’, borrow_date) | book_id, borrower_id | 0.35 | 200 |
The example output under recommended_partition_key
suggests partitioning the borrowedbooks
table by month based on the borrow_date
column. The recommended_primary_index
suggests creating a primary index on the book_id
and borrower_id
columns. An average pruning improvement of 0.35 indicates 35% less data will be scanned on average by applying these recommendations. The analyzed queries column shows that 200 queries were analyzed to generate these suggestions.