Learn how to migrate your workflow from Redshift
SORT
and DIST
keys to Firebolt’s dynamic indexing, denormalization, and flexible JSON handling, which replace manual distribution keys and rigid schemas for faster, simpler queries.
SORTKEY
and DISTKEY
tuning with dynamic primary and aggregating indexes for faster query performance and automatic data pruning.TEXT
and query nested fields using JSON functions without rigid schemas.TEXT
for strings and JSON data, and ensure numeric precision and date compatibility.Feature | Redshift | Firebolt | Recommendations |
---|---|---|---|
Compute and storage | Monolithic cluster with tightly coupled compute and storage. | Decoupled architecture with virtualized compute engines and separate storage. | Use multiple engines to separate workloads and improve flexibility. |
Scalability and elasticity | Resizing requires downtime and offers limited concurrency scaling. | Elastic scaling with independent compute and storage. | Assign separate engines for ingestion, analytics, and ETL to optimize resource use. |
Workload isolation | Shared cluster causes resource contention. | Fully isolated workloads with multiple independent engines. | Use dedicated engines to prevent interference between queries. |
Cost model | Billed for cluster uptime, with additional charges for concurrency scaling. | Pay-as-you-go pricing based on engine runtime. | Pause idle engines and right-size compute to control costs. |
TYPE
is critical to optimizing performance and cost efficiency. Unlike Redshift, where all workloads run on the same cluster, Firebolt lets you assign separate engines for ingestion, analytics, and transformation tasks. To maximize efficiency, engines should be right-sized based on workload needs, scaled independently of storage, and paused when not in use to reduce costs.
Best practice for compute and storage | Impact | How to implement |
---|---|---|
Scale compute and storage independently. | Allows flexible resource allocation without over-provisioning. | Increase compute resources for analytics without expanding storage, and scale engines independently based on workload needs. |
Best practice for scalability and elasticity | Impact | How to implement |
---|---|---|
Use right-sized engines based on workload needs. | Optimizes performance and cost by allocating appropriate resources for each workload. | Start with small or medium engines for lighter workloads and scale up using ALTER ENGINE for high-concurrency or intensive tasks. |
Workload type | Recommended engine configuration |
---|---|
High-concurrency business intelligence (BI) | Use medium to large engines with high parallelism for supporting many concurrent queries efficiently. |
ETL and transformations | Use small to medium engines with moderate concurrency, optimized for data ingestion and transformations. |
Data science | Use small engines for iterative, ad-hoc queries or training datasets. |
Data-intensive applications | Use dedicated small engines with consistent SLA for low-latency queries. |
Ad-hoc analytics | Use on-demand engines that start quickly and scale based on query complexity. |
Best practice for workload isolation | Impact | How to implement |
---|---|---|
Use separate Firebolt engines for different workloads. | Prevents resource contention and improves workload isolation. | Assign dedicated engines for ingestion, analytics, and transformations to avoid competition for resources. |
Workload | Engine name | Configuration | Purpose |
---|---|---|---|
Data ingestion | ingestion_engine | TYPE = S, NODES = 1 | Continuous data loading. |
High-concurrency queries | analytics_engine | TYPE = M, NODES = 6 | Supports many concurrent queries. |
Real-time dashboards | dashboard_engine | TYPE = M, NODES = 2 | Low-latency SLA-sensitive queries. |
Ad-hoc analytics | ad_hoc_engine | TYPE = S, NODES = 2, AUTO_STOP | Cost-efficient for intermittent use. |
Best practice for scalability and elasticity | Impact | How to implement |
---|---|---|
Pause idle engines to reduce costs | Lowers operational costs while keeping data accessible. | Use auto-stop settings to pause unused engines and restart them on demand. |
SORT
and DIST
keys for performance tuning, Firebolt automatically optimizes queries using primary, and aggregating indexes. This eliminates the need for manual data distribution and sorting, reducing complexity while improving performance.
Additionally, semi-structured data handling differs. Redshift’s SUPER
type allows nested JSON storage, while Firebolt stores JSON as TEXT
, enabling flexible querying with JSON functions like JSON_VALUE
and JSON_EXTRACT
. Firebolt’s schema design also favors denormalization, reducing the need for complex joins and improving analytical query speed.
Feature | Redshift | Firebolt | Recommendations |
---|---|---|---|
Table design | Uses fact and dimension tables with explicit DISTKEY and SORTKEY properties for tuning. | Uses fact and dimension tables with indexing for optimized access. | Design fact tables for large datasets and dimension tables for reference data. |
Indexing | Uses SORTKEY and DISTKEY for performance tuning, requiring manual selection and optimization. | Uses primary and aggregating indexes instead. | Use primary indexes for efficient filtering and aggregating indexes for precomputed aggregations. |
Denormalization | Often normalized to reduce redundancy and optimize joins. | Encourages denormalization to reduce joins and improve query performance. | Store frequently joined reference columns in fact tables to minimize joins. |
Handling JSON | Supports SUPER data type for nested JSON storage, queried with JSON_EXTRACT_PATH_TEXT . | Stores JSON as TEXT and uses JSON_VALUE and JSON_POINTER_EXTRACT_TEXT for querying. | Store JSON data as TEXT and use JSON functions to extract values on demand. |
Data types | Supports a broad range of native types including SUPER . | Uses similar types, but JSON is stored as TEXT . | Convert SUPER columns to TEXT and adjust data types to match Firebolt-supported types. |
Best practice for table design | Impact | How to implement |
---|---|---|
Store frequently queried large datasets in fact tables. | Improves query speed and reduces scan times. | Create fact tables for large datasets used in analytical queries. |
Use dimension tables for reference data. | Simplifies lookups and schema organization. | Define dimension tables for static or lookup datasets. |
Define primary indexes on frequently filtered columns. | Enables efficient pruning and faster filtering. | Add primary indexes to columns commonly used in WHERE or JOIN conditions. |
orders
and customers
:
customer_id
primary index, scanning only relevant data:
SORTKEY
and DISTKEY
to improve data distribution and query performance. The SORTKEY
controls the physical order of stored data, optimizing filtering and range-based queries. The DISTKEY
distributes data across cluster nodes to balance workloads. These keys require careful, manual tuning and must be set during table creation. Changing them requires rebuilding tables and reloading data, making schema adjustments rigid and time-consuming.
Poor DISTKEY
choices in Redshift can lead to data skew, where some nodes store more data than others, creating performance bottlenecks. Additionally, scaling Redshift clusters requires redistributing data based on the defined DISTKEY
. This process can result in downtime and degraded performance while workloads rebalance, and often forces additional manual optimization.
Firebolt eliminates manual distribution keys and sort keys by using dynamic indexing to optimize query processing. Indexes in Firebolt can be added, modified, or replaced without reloading data. Primary indexes physically organize data to speed up filtering and scanning during query runtime. Aggregating indexes precompute aggregate results, allowing faster retrieval without dynamic calculations.
Firebolt’s indexing model works with tablet-based storage, reducing the need for manual tuning tasks such as VACUUM
and ANALYZE
in Redshift. The decoupled compute-storage architecture allows dynamic data access without requiring distribution keys or physical data redistribution when scaling. Indexes enable the engine to prune irrelevant data and improve query performance automatically.
Understanding how indexes and tablet-based storage replace Redshift’s DISTKEY
and SORTKEY
is essential for successful schema migration. Firebolt’s indexing approach simplifies schema design, reduces maintenance overhead, and delivers faster query performance with greater flexibility.
Best practice for indexing | Impact | How to implement |
---|---|---|
Use a primary index instead of a SORTKEY . | Reduces scan times and improves query filtering. | Create primary indexes on columns used for filtering and range queries. |
Replace DISTKEY with an aggregating index. | Speeds up aggregations without manual data distribution. | Create aggregating indexes for precomputed sums or counts on frequently grouped columns. |
sales
table that distributes data by customer_id (DISTKEY)
for optimized joins, sorts rows by sale_date (SORTKEY)
for faster range queries, and defines a customers
table with customer_id
as the primary key for relational integrity. In Redshift, the primary key is informational only and not enforced at the database level:
DISTKEY(customer_id)
is replaced with the aggregating index (customer_id, SUM(total_amount))
. This aggregation runs faster than Redshift because SUM(total_amount)
is precomputed in the aggregating index.SORTKEY(sale_date)
is replaced with the primary index (sale_date)
.DISTKEY
in Redshift partitions data across nodes by game_id
. Firebolt’s primary index on GameID
and PlayerID
allows the engine to quickly prune data during query processing, achieving similar optimization without manual distribution.
In Redshift, the following schema uses DISTKEY
and SORTKEY
for the playstats
table:
GameID
or PlayerID
so that the query scans over only this index:
Best practice for denormalization | Impact | How to implement |
---|---|---|
Denormalize frequently joined tables. | Reduces query complexity and improves performance. | Store commonly joined attributes directly in fact tables. |
Avoid excessive joins. | Speeds up analytical queries and reduces resource usage. | Move frequently accessed reference data into the same table. |
Balance redundancy and maintenance. | Prevents excessive data duplication while simplifying queries. | Only denormalize attributes used in frequent filters or aggregations. |
orders
and customers
tables as follows:
customer_name
directly in the orders
table:
customer_name
column is stored directly in the orders
table, eliminating the need for a join and reducing query complexity.
TEXT
, allowing dynamic field extraction without predefined schemas. Firebolt includes specialized JSON functions to extract, convert, and manipulate JSON data stored in TEXT
columns, enabling flexible and efficient querying without predefined structures. Unlike Redshift’s SUPER
type, which requires fixed structures and functions like JSON_EXTRACT_PATH_TEXT
, Firebolt enables flexible querying with JSON_VALUE, JSON_EXTRACT, and JSON_POINTER_EXTRACT_TEXT. These functions retrieve only the necessary JSON fields, reducing scan time and improving performance.
Firebolt’s approach removes the need for manual schema updates, simplifies query logic, and supports dynamic key lookups without restructuring the data model. Queries are easier to write and maintain because Firebolt avoids the nested function calls required in Redshift. By replacing SUPER
with TEXT
, users gain a more efficient and flexible way to query JSON data.
Best practice for handling JSON | Impact | How to implement |
---|---|---|
Store JSON data as TEXT | Enables flexible querying without predefined schemas. | Export JSON columns from Redshift as TEXT and define them as TEXT in Firebolt. |
Use JSON_VALUE for simple field extraction | Simplifies queries and reduces scan time. | Extract single fields directly with JSON_VALUE . |
Use JSON_POINTER_EXTRACT_TEXT for nested fields | Allows access to deeply nested fields in JSON documents. | Use JSON_POINTER_EXTRACT_TEXT with JSON pointer expressions. |
Use dynamic path construction with CONCAT when needed | Supports querying dynamic keys without fixed paths. | Build JSON pointer paths using CONCAT for dynamic extraction. |
SUPER
type as follows:
TEXT
, by unloading to Amazon S3 in JSON or CSV format. Then, define the column in Firebolt as TEXT
as follows:
JSON_VALUE
as follows:
CONCAT
:
SUPER
for semi-structured data and fixed-length CHAR
or VARCHAR
columns, while Firebolt stores JSON as TEXT
and uses TEXT
for variable-length strings without length restrictions. Understanding these differences helps ensure consistent schema conversion and accurate query results when migrating.
Best practice for data types | Impact | How to implement |
---|---|---|
Use TEXT instead of fixed-length CHAR or VARCHAR | Avoids truncation issues and simplifies schema management. | Define string columns as TEXT without length constraints. |
Convert SUPER columns to TEXT | Enables flexible JSON handling. | Store JSON data as TEXT and query using JSON functions. |
Review numeric precision and scale | Prevents rounding or overflow errors. | Map NUMERIC(p, s) in Redshift to DECIMAL(p, s) in Firebolt. |
Validate date and timestamp usage | Ensures compatibility for temporal queries. | Use DATE and TIMESTAMP consistently, without time zones. |
Redshift type | Firebolt equivalent | Notes |
---|---|---|
SMALLINT | SMALLINT | Identical. |
INTEGER | INT or INTEGER | Firebolt uses INT for integer types. |
BIGINT | BIGINT | Identical. |
REAL | FLOAT4 | Alias mapping for single-precision floats. |
DOUBLE PRECISION | DOUBLE | Use DOUBLE for double-precision floats. |
NUMERIC(p, s) | DECIMAL(p, s) | Equivalent; Firebolt uses DECIMAL. |
BOOLEAN | BOOLEAN | Identical. |
CHAR(n) or VARCHAR(n) | TEXT | Firebolt uses TEXT with no length restrictions. |
DATE | DATE | Identical. |
TIMESTAMP | TIMESTAMP | Firebolt supports TIMESTAMP without time zone. |
JSON | TEXT | JSON is stored as TEXT and queried using JSON functions. |
ARRAY | ARRAY(TEXT) | Firebolt arrays require explicit type definitions. |
players
table with columns for player_id
, nickname
, registration date, and score
using specified data types:
players
dimension table with columns for PlayerID
, Nickname
, registration date, and Score
, and defines a primary index on PlayerID
to optimize filtering and lookups.:
UNLOAD
command to export tables or query results to Amazon S3.
The following code example unloads the sales
table data in CSV format to the specified Amazon S3 bucket:
s3://your-redshift-data/sales/2024/01/
.tournaments
table in Firebolt, mapping specific columns from the file to the table’s fields and including metadata such as the source file name and timestamp:
ROW_NUMBER
, RANK
, and SUM
with the OVER
clause, but Firebolt can enhance performance through indexing, particularly primary indexes, which minimize the data that needs to be processed.
The following example shows how to calculate a running total of player scores by game in both Redshift and Firebolt, using the SUM
window function.
In Redshift, the query to calculate the running total of the currentscore
for each playerid
within each gamid
ordered by playerid
, using a window function:
PlayStats
table in Firebolt with columns for gameid
, playerid
, and currentscore
. It defines a primary index on gameid
and playerid
to optimize query performance, such as the running total calculation, by efficiently retrieving the relevant rows based on these columns:
SUM
, COUNT
, and AVG
, and the GROUP BY
clause. Firebolt, however, enhances performance by precomputing and storing aggregated results through aggregating indexes, which reduce the need for runtime calculations.
The following example shows how to calculate the total playtime by game in both Redshift and Firebolt using the SUM
aggregation function.
In Redshift, the following code calculates the total currentplaytime
for each gameid
in the PlayStats
table by grouping the data by gameid
and summing the currentplaytime
for each group:
gameid
, avoiding runtime aggregation.
The following code example creates an aggregating index on the PlayStats
table, precomputing the sum of currentplaytime
for each gameid
, which optimizes query performance by storing the aggregated results for faster retrieval:
TEXT
columns. Unlike Redshift’s SUPER
data type, which requires predefined structures, Firebolt allows users to work with raw JSON data directly, offering significant advantages in both performance and flexibility.
Firebolt’s JSON functions are divided into three main categories:
TEXT
or ARRAY
.TEXT
or ARRAY
. Key convert functions include the following:
id
key from the nested metadata object in the event_data
JSON column of the events
table and returns it as user_id
:
JSON_POINTER_EXTRACT_TEXT
to extract values from nested JSON objects using JSON pointer expressions as follows. The following code example extracts the value of the id
key from the nested metadata object in the event_data
JSON column of the events table and returns it as user_id
:
event_data
JSON column in the events
table, accessing keys with the tilde and slash special characters by using escape sequences and returning them as key_with_tilde
and key_with_slash
:
EXPLAIN
command provides a static query plan, showing the sequence of operations, estimated costs, and resource allocation. This plan lacks runtime performance metrics and detailed insights into data movement across nodes.
In Redshift, the following code example analyzes a join query:
EXPLAIN
command adds runtime metrics, such as CPU time, thread time, rows processed, and runtime for each operation, allowing for detailed performance analysis.
In Firebolt, the following code example analyzes the same query and provides runtime metrics:
playstats
table in both Redshift and Firebolt:
GROUP BY
in both Redshift and Firebolt:
NULL
values, large datasets, and complex joins. This ensures that data migration doesn’t introduce errors in rare or complex scenarios.
The following code example shows complex joins from multiple tables in Redshift and Firebolt:
NULL
values are handled in both Redshift and Firebolt:
UNION ALL
to increase the size of the playstats
table in Redshift and measures the runtime and resource usage:
EXPLAIN
and STV_RECENTS
to evaluate query performance and resource consumption.
The following code example measures query runtime, resource usage, and queue time in Firebolt. Rerun it using different cluster sizes:
aws_access_key_id
, aws_secret_access_key
, and s3_redshift_data_bucket
variables for your migration tasks. For more information, see the Apache Airflow documentation on variables.
Use Airflow to run Redshift’s UNLOAD
command on a scheduled basis to export data from Redshift tables to Amazon S3 as follows:
SORTKEY
and DISTKEY
into Firebolt’s primary and aggregating indexes. Create dbt models to define Firebolt-compatible schemas and apply them automatically.
The following code example uses dbt to create a Firebolt schema:
SUPER
to TEXT
for JSON handling as follows: