Data modeling

Firebolt optimizes data storage and retrieval by using indexing, partitioning, and compute scaling to enhance query efficiency. Design your data model and use databases, tables, and indexes to optimize query performance.

Minimizing data scans Firebolt indexes retrieve only the specific data ranges needed to satisfy a query, reducing the amount of data scanned.
Automatic index maintenance During data loading, Firebolt automatically sorts, compresses, and incrementally updates indexes to keep them optimized.
Tablet-based optimization Data and indexes are committed as tablets, which Firebolt automatically merges and optimizes as the data evolves.
Best practices for performance This guide provides recommendations for organizing tables, indexes, and data to achieve fast query results and peak performance.

How it works

Firebolt’s indexing and partitioning strategies are designed to take advantage of a cloud-based architecture that scales to handle very large data sets. Data is queried using multiple nodes for parallel processing. Data is also stored by columns, which allows for:

Optimized read operations
Less disk space for storage
Vectorized processing

Firebolt also separates compute resources from storage resources so you can scale either up or down depending on your use case. Optimize your resources based on your changing workloads, and pay only for what you use. Firebolt’s data modeling strategies work best with Firebolt’s managed tables, which leverage Firebolt’s performance optimizations. External tables, which allow users to access data without loading it into Firebolt, provide flexibility for querying external data sources like Amazon S3, but they do not offer the same performance benefits as managed tables. The following sections show you how to use the previous data modeling strategies to decrease the number of bytes scanned to improve query performance, reduce storage costs, and optimize compute resources. Topics:

Data modeling
- How it works

Databases

Logical structure In Firebolt, a database is a logical structure that organizes and stores data for efficient querying and management. Databases are created using the CREATE DATABASE statement and can be modified or deleted using ALTER and DROP. Compute and storage separation Firebolt separates compute and storage layers, ensuring databases are efficiently managed without impacting processing power. Data Security Databases integrate with Firebolt’s data security layers, enforcing access controls and user permissions to safeguard sensitive information. Performance and integration

Fast analytics: Firebolt databases enable quick querying of large datasets, supporting fast analytics.
Third-Party integration: They are easily integrated with third-party tools through APIs and drivers for programming languages like Python and .NET.

Database topics:

Create a database – Use a system or user engine to create a database.
Manage a database – How to edit and delete databases.
Database best practices – How to organize your databases for the best performance.
Evaluate your database for performance.

Create a database

Use CREATE DATABASE, which requires only the name of the database, and an engine to create a database. Firebolt provides two types of engines:

System engines handle administrative tasks like creating and managing databases and tables without any need for configuration. They are always available but cannot process user data. User engines perform both administrative tasks and handle queries that access and process user data. User engines provide the flexibility needed for data processing and analysis.

The following code example creates a test_library database with an optional description:

CREATE DATABASE IF NOT EXISTS test_library
    WITH DESCRIPTION = 'database about library';

Manage a database

After you create a database, you can create additional objects and run queries within it. You can also modify the database parameters. The following code example modifies the description of a database:

ALTER DATABASE library SET DESCRIPTION = 'database about library';

You can delete a database with DROP DATABASE, which permanently deletes the database and all its contents. Since this action cannot be undone, use it carefully. Create a backup before dropping the database to prevent data loss. The following code example uses the optional IF EXISTS clause to see if the test_library database exists, and then deletes it:

DROP DATABASE IF EXISTS test_library;

Active queries will continue using the latest version of the database, but new queries submitted after the database is dropped will fail. To minimize disruptions, monitor active queries and notify users in advance.

Database best practices

Efficient database design in Firebolt is key to optimizing query performance, managing scalability, and ensuring workload isolation. By using databases to logically separate data, workloads from each other. You can reduce the impact of schema changes, minimize query latency, and enhance access control through role-based policies. For example, when managing data for applications that operate across multiple regions, understanding when to use a single database versus separate databases can significantly affect performance and operational efficiency. The following best practices outline how to structure databases in Firebolt to achieve these goals:

Use databases to logically separate workloads – Organize data into separate databases to isolate workloads and reduce the impact of schema changes on unrelated queries.
Group related tables in the same database – Only include related tables within the same database to minimize the overhead caused by metadata synchronization during schema changes.
Separate data for different regions or use cases – For multi-region applications, create separate databases for each region to avoid conflicts such as overlapping primary keys, and to prevent schema changes in one region from affecting query performance in another.
Minimize the number of tables in a single database – Avoid adding too many unique tables to a single database, as schema changes require the metadata service to synchronize all associated schemas, potentially increasing query latency.
Leverage logical separation for access control – Use separate databases to enforce role-based access control policies, ensuring users can only access data relevant to their roles and restricting queries across databases.
Consider metadata sync impacts on query latency – Metadata synchronization can introduce slight query latencies after schema changes, especially as the number of tables in a database grows.

Database best practice example

The maker of the fictional “Ultra Fast Gaming Inc.” company plans to expand into a second region, where the game will generate data with the same structure and format as the first region but potentially overlapping primary keys. The developers have two options:

Store the new region’s data in the existing database.
Create a separate database specifically for the new region.

Recommended solution Create a separate database for the new region for the following reasons:

Optimal Query Performance – Firebolt’s decoupled architecture requires metadata synchronization for schema changes. As the number of tables in a database grows, this synchronization can introduce slight query latency. Storing data for each region in separate databases eliminates this risk and ensures efficient query performance.
Improved data isolation – Queries for one region are not impacted by schema changes in another. Separate databases allow role-based access control, restricting access to data by region, which enhances security and operational control.

By creating a dedicated database for each region, the developers ensure optimal performance, scalability, and data management tailored to the needs of the UltraFast game expansion.

Evaluate your database for performance

You can use the code in the Firebolt DDL Performance GitHub repository to measure the efficiency of your database design. Use the code to generate a set of pdf plots for latency to evaluate your choices including the following:

How increasing the number of tables impacts schema synchronization times.
Analyze whether separating unrelated tables into different databases improves performance.
Validate the benefits of isolating data for different use cases or regions in separate databases.
Understand the trade-offs of schema changes on query latency and how to mitigate them through better database organization.

Schema

A schema serves as a blueprint for how data is organized within a database, acting as a logical framework that groups tables, views, and indexes. Understanding how to structure and manage schemas is crucial for maintaining an organized database environment, making it easier to optimize queries and manage access. While schemas don’t directly improve query performance, the way you organize your data within them can significantly impact how efficiently queries run, especially as your data grows.

Schema best practices

To enhance query performance, simplify access control, and ensure scalability as your data grows, use the following best practices:

Logical Organization – Grouping related tables, views, and indexes within a schema, to create a logical structure that makes it easier to navigate and optimize your queries. Well-organized data allows for clearer queries, reducing complexity and improving readability.
Simplified Access Control – Using schemas to manage access to certain tables or views ensures that only relevant data is queried by specific users or roles. This prevents unnecessary data scans or joins, which can improve query execution time.
Scalability – As your database grows, a well-structured schema becomes even more important. Organizing data in a way that scales efficiently prevents bottlenecks, ensuring that queries can continue to run quickly even with larger datasets.

Tables

In Firebolt, tables are the key components for organizing and storing data. They consist of rows (records) and columns (attributes), making them an integral tool in data modeling. Firebolt supports two main types of tables: Table topics:

Firebolt-managed tables – Fact and dimension tables make the best use of Firebolt’s optimization strategies.
External tables – Users can access and query data without loading it into Firebolt. External tables generally have poorer performance compared to Firebolt-managed tables.
Editing and deleting tables – You can edit and delete an existing table.
Primary indexes in tables – Select the most efficient primary index for your tables based on your query patterns and data characteristics.
Aggregating indexes in tables – Precompute and update aggregation results for fast calculations and to save compute resources.
Partitions in tables – Organize data based on date ranges, regions, customer types or other categories.
Suggested indexes and partitions – Use Firebolt’s RECOMMEND_DDL tool to suggest indexes and partitions based on your query history.

Firebolt-managed tables

Tables that are fully controlled by Firebolt make the best use of Firebolt’s optimization strategies. There are two types of managed tables, which serve different but complementary purposes:

Fact tables – store large volumes of quantitative data, such as sales metrics or user interactions, and are used for analysis and reporting.
Dimension tables – which hold descriptive data for enriching analyses, such as customer details or product categories. These tables are often duplicated across all nodes, and can dramatically improve query performance by providing fast access to frequently referenced information.

By designing tables with both fact and dimension roles, users can optimize data management for reporting and analytics. Fact tables The following code example creates a fact table:

CREATE TABLE borrowedbooks (
	transaction_id INT,
	book_id INT,
	member_id INT,
	checkout_date DATE,
	return_date DATE,
	late_fees NUMERIC
);

The internal structure of a fact table facilitates fast access to data even for very large datasets. Dimension tables Dimension tables are often replicated across all nodes, which enhances query performance and ensures quick access to frequently referenced information. The following code example creates a dimension table:

CREATE DIMENSION TABLE books (
	book_id INT,
	title TEXT,
	author TEXT,
	genre TEXT,
	publication_year INT
);

In the previous example, the books table stores details about the publication and is ideal for joining with fact tables like borrowedbooks to analyze borrowing patterns or book popularity.

External tables

External tables allow users to access and query data stored outside the database, such as in Amazon S3, without loading it into Firebolt. This capability is particularly useful when working with large datasets stored externally that don’t require frequent access, enabling cost-efficient querying and avoiding data duplication. External tables generally have poorer performance compared to Firebolt managed tables because the data is not stored within Firebolt’s optimized infrastructure.

Editing and deleting tables

After you create a table, you can modify your table ownership, delete it including all of its dependencies. Use the following SQL commands to manage your table: Change table ownership To change the ownership of a table, use the ALTER TABLE statement with the OWNER TO clause. The following command assigns a new owner to the borrowedbooks table:

ALTER TABLE borrowedbooks OWNER TO 'new_owner';

Delete a managed table Dropping a Firebolt managed table permanently deletes it along with all its data. The following code checks to see if the borrowedbooks table exists and then deletes it:

DROP TABLE IF EXISTS borrowedbooks;

Delete a managed table that has dependencies If the table that you want to delete has dependent objects such as views or aggregating indexes, you can use the CASCADE option to remove all dependencies together. The following code deletes the borrowedbooks table and all related objects:

DROP TABLE borrowedbooks CASCADE;

Drop an external table If you drop an external table, you only remove its definition from Firebolt. The actual data remains in the external source. The following code drops the external_books external table:

DROP TABLE external_books;

Primary indexes in tables

Primary indexes optimize query performance by logically organizing data for efficient access based on key columns, using sparse indexing to minimize unnecessary scans during queries. This allows Firebolt’s engine to prune unnecessary data during queries, minimizing the amount of data read from disk, and accelerating query execution. By selecting columns frequently used in filters, the primary index enables fast, efficient data retrieval, which is crucial for large-scale, data-intensive applications. Properly configuring primary indexes ensures that Firebolt can maintain high performance, even with complex queries and large datasets. How Firebolt uses a primary index to optimize performance When new data is inserted into a table, Firebolt organizes it into tablets, which are logical partitions of the table. Each tablet holds a portion of the data, sorted according to the primary index. During query processing, Firebolt uses these indexes to eliminate blocks of rows not matching query predicate and scan only the necessary data, minimizing input and output operations, and optimizing performance. For updates, Firebolt follows a “delete-and-insert” approach: the original row is marked for deletion, and the updated row is inserted into a new tablet. Deleted rows are not removed immediately but are flagged and later cleaned up during maintenance tasks.

Create a primary index

A primary index can only be defined when creating a new table, so if you need to modify the index, you’ll have to create a new table. To define a primary index, use the PRIMARY INDEX clause to specify your columns, as shown in the following example:

CREATE TABLE borrowedbooks_pi(
  transaction_id INT,
  book_id INT,
  member_id INT,
  checkout_date DATE,
  return_date DATE,
  late_fees NUMERIC)
PRIMARY INDEX (book_id, checkout_date);

In the previous example, CREATE TABLE creates a fact table by default, and the primary index acts as a composite index of both the book_id and checkout_date columns. You can also create a primary index on a dimension table as shown in the following code example:

CREATE dimension TABLE books_pi(
  book_id INT,
  title TEXT,
  author TEXT,
  genre TEXT,
  publication_year INT
)
PRIMARY INDEX (book_id, title);

The primary index should include columns that are often used in queries that filter or aggregate data, including those that use WHERE and GROUP BY clauses, so that Firebolt can use these columns for efficient retrieval. Firebolt physically clusters the data in the table by the columns of the primary index. The primary index significantly improves query performance because it reduces the time spent scanning irrelevant rows and allows the query engine to retrieve only the necessary data.

Primary index best practices

To optimize query performance and maximize indexing efficiency, use the following best practices for designing composite primary indexes:

Column selection – Carefully select which columns to include in a composite primary index based on query patterns, cardinality, and the specific use cases for your table. Adding unnecessary columns can negatively impact performance.
Column order – The order of columns in a composite primary index is critical. Generally, place the lowest cardinality columns first in a composite primary index. Firebolt will search based on the order of columns in the primary index. A Primary Index column can only be used for pruning if all values of previous primary Index columns within a granule are equal. Therefore, placing low-cardinality columns first enables more efficient pruning on following columns. However, high-cardinality columns can be beneficial when placed at the end of the primary index definition, or even first if they always have selective filters in your query patterns.

Aggregating indexes in tables

Aggregating indexes precompute and store aggregation results from aggregation functions like SUM, COUNT, and AVG, as well as more complex calculations. When the underlying data is changed, Firebolt recalculates aggregate indexes automatically, so that they are always updated when you query them. Firebolt handles the following scenarios:

Inserts – When new rows are added to a table included in an aggregating index, Firebolt recalculates the value to include the new data.
Updates – If existing rows are updated, Firebolt adjusts the aggregation values to reflect the changes.
Deletions – When rows are deleted, the aggregating index is recalculated to remove the row from the precomputed aggregated values.

When you utilize an aggregating index through a query, Firebolt utilizes the pre-calculated values instead of computing them in real-time. This reduces the computational burden at runtime and significantly speeds up query response times, especially for large datasets with high concurrency demands. Aggregating indexes are especially useful for frequently run queries that involve repeated aggregation operations, such as calculating totals, averages, or counts. Aggregating indexes do require additional storage, because the precomputed data needs to be maintained. In write-heavy environments, frequent updates, inserts, or deletes can lead to increased computational overhead as the indexes must be recalculated and kept up to date. This can result in higher compute costs. It’s important to consider the performance benefits of faster queries against the additional storage and processing costs, especially for frequently changing datasets.

Create an aggregating index

Aggregating indexes can be created at the time a new table is made or afterward. You can define it as needed based on query patterns and performance. The following code example shows how to create an aggregating index to precompute the number of transactions per borrower and their average late fee on the existing borrowedbooks table created in the previous Fact tables section in Firebolt managed-tables:

CREATE AGGREGATING INDEX agg_borrower_statistics
   ON borrowedbooks (
  	borrower_id,
  	COUNT(transaction_id),
  	AVG(late_fee)
   );

When Firebolt runs a query that accesses either transaction_id or late_fee, it retrieves the precomputed results from the aggregating index, rather than computing them. The following code example shows how to create an aggregating index to precompute the total late fees accumulated for each book:

CREATE AGGREGATING INDEX agg_total_late_fees
   ON borrowedbooks (
                   	book_id,
                   	SUM(late_fee)
               	);

Additionally, aggregating indexes integrate seamlessly with Firebolt’s partitioning strategies, further improving query performance by allowing the query engine to access only the relevant partitions. This reduces the amount of data scanned and processed, particularly when dealing with large, partitioned datasets. The combination of partition pruning and aggregate indexing helps achieve superior query performance in data-intensive environments, allowing for quicker insights and more efficient use of system resources.

Best practices for aggregate indexes

To optimize query performance and manage resources effectively, follow these best practices for creating and maintaining aggregating indexes:

Focus on frequently run queries – Create aggregating indexes for queries that use frequent aggregations including SUM, COUNT, and AVG that are run repeatedly.
Choose columns used in aggregations – Select columns commonly used in GROUP BY and aggregation functions such as borrower_id or book_id from the examples in this section.
Choose columns strategically – Build indexes on columns that are frequently queried for aggregations such as customer_id or order_id from the examples in this section. Avoid creating multiple indexes with overlapping aggregations on the same columns to minimize unnecessary overhead and costs.
Consider Data Freshness – Ensure the performance gains of precomputed values outweigh the index maintenance costs for frequently changing data.

Partitions in tables

Firebolt’s table partitioning improves query performance and data lifecycle management by physically organizing data according to values in one or more partition key columns — commonly date, region, or category. Effective partitioning reduces the volume of data scanned during queries by enabling partition pruning, where irrelevant partitions are skipped entirely. This improves performance, and also makes operations like vacuuming and archiving more efficient.

Best practices for partitioning

To optimize query performance and data lifecycle management, follow these best practices when defining and maintaining table partitions:

Focus on frequent filters – Partitions are only effective when your queries consistently filter on the partition key columns. Without the opportunity for pruning, partitions merely fragment your data.
Avoid over-partitioning – Limit the number of partitions to under 1,000. A reasonable number of partitions will ensure that the physical data is not overly spread out, and amortize the overhead of processing separate physical batches. Use fixed-size types for keys – Fixed-size types like DATE or INTEGER make it easier to reason about and control partition cardinality — such as maintaining one partition per day over the last three years, or one partition per month over the last 80 years. In contrast, variable-length types like STRING can evolve unpredictably. For example, a product_category column that starts with a dozen values may grow to thousands of distinct entries over time, unintentionally ballooning the number of partitions and hurting performance.
Balance partition sizes – Strive for even data distribution across partitions. If most of your data resides in a single partition, pruning will not be effective, and performance gains will be minimal.
Consider future growth – Choose a partitioning strategy that will scale with your data volume. Re-partitioning an existing tables will require full re-ingest. Prefer primary indexes when in doubt – If your queries consistently filter on certain columns, defining a primary index is generally safer and more flexible than forcing a partition — offering similar performance benefits without the rigidity or potential downsides of over-partitioning.

Create a partitioned table

Partitioning is defined during table creation. This ensures that data is physically organized from the outset. The following code example creates a new table with a primary index, and partitions the table by month:

CREATE TABLE librarybooks (
	transaction_id INT,
	book_id INT,
	borrower_id INT,
	checkout_date DATE,
	due_date DATE,
	return_date DATE,
	late_fee DECIMAL(10, 2)
)
PRIMARY INDEX transaction_id
PARTITION BY DATE_TRUNC('month', checkout_date);

In this example, the table is partitioned by the month of checkout_date, allowing Firebolt to skip entire months of data when queries filter by date ranges. To drop an existing partition, use ALTER TABLE ... DROP PARTITION. For example, to remove all records for January 2023:

ALTER TABLE borrowedbooks
DROP PARTITION '2023-01-01';

This removes the physical partition corresponding to data from that month.

Suggested indexes and partitions

If you understand your data and query patterns, you should select columns with high selectivity that are frequently used in WHERE, JOIN, or GROUP BY clauses to minimize the amount of data scanned and improve query efficiency. Generally, low-cardinality columns are preferred for the leading positions in composite primary indexes, but high-cardinality columns can be beneficial when they have selective filters. If you don’t know how to effectively select a primary index or partition your data, you can use Firebolt’s RECOMMEND_DDL tool to provide automated insights. RECOMMEND_DDL will make recommendations to optimize database performance by analyzing your query patterns and suggesting the most efficient configurations for primary indexes and partitioning. By examining historical query data, the tool identifies columns frequently used in filtering, grouping, or aggregating operations and recommends appropriate primary indexes and partition keys. These suggestions help reduce the amount of data scanned during queries, enabling faster execution and improved resource utilization. RECOMMEND_DDL is particularly useful in complex environments where query patterns evolve over time. By reviewing historical query data, Firebolt identifies columns that are frequently used in filtering or aggregation and recommends appropriate primary index and partitioning strategies. The following code example uses RECOMMEND_DDL to analyze query patterns on the books table, created in the **Dimension tables ** section under Firebolt managed-tables, based on queries run in the past week:

CALL recommend_ddl(
  books,
  (SELECT query_text FROM information_schema.engine_query_history WHERE query_start_ts > NOW() - INTERVAL '1 week')
);

If the books table is frequently queried based on genre and book_id, Firebolt’s RECOMMEND_DDL command might provide the following example output:

recommended_partition_key	recommended_primary_index	average_pruning_improvement	analyzed_queries
DATE_TRUNC(‘month’, borrow_date)	book_id, borrower_id	0.35	200

The example output under recommended_partition_key suggests partitioning the borrowedbooks table by month based on the borrow_date column. The recommended_primary_index suggests creating a primary index on the book_id and borrower_id columns. An average pruning improvement of 0.35 indicates 35% less data will be scanned on average by applying these recommendations. The analyzed queries column shows that 200 queries were analyzed to generate these suggestions.

Additional resources

RECOMMEND_DDL – Information on syntax, parameters and examples of using Firebolt’s tool to automatically recommend optimal primary index and partition strategies.

What is Firebolt?

Overview

Guides

SQL reference

General reference

API reference

How it works

Databases

Create a database

Manage a database

Database best practices

Database best practice example

Evaluate your database for performance

Schema

Schema best practices

Tables

Firebolt-managed tables

External tables

Editing and deleting tables

Primary indexes in tables

Create a primary index

Primary index best practices

Aggregating indexes in tables

Create an aggregating index

Best practices for aggregate indexes

Partitions in tables

Best practices for partitioning

Create a partitioned table

Suggested indexes and partitions

Additional resources

What is Firebolt?

Overview

Guides

SQL reference

General reference

API reference

​How it works

​Databases

​Create a database

​Manage a database

​Database best practices

Database best practice example

​Evaluate your database for performance

​Schema

​Schema best practices

​Tables

​Firebolt-managed tables

​External tables

​Editing and deleting tables

​Primary indexes in tables

Create a primary index

Primary index best practices

​Aggregating indexes in tables

Create an aggregating index

Best practices for aggregate indexes

​Partitions in tables

Best practices for partitioning

Create a partitioned table

​Suggested indexes and partitions

​Additional resources

How it works

Databases

Create a database

Manage a database

Database best practices

Evaluate your database for performance

Schema

Schema best practices

Tables

Firebolt-managed tables

External tables

Editing and deleting tables

Primary indexes in tables

Aggregating indexes in tables

Partitions in tables

Suggested indexes and partitions

Additional resources