Load data using SQL
If the Load data wizard does not meet your needs, or you prefer to write directly in SQL, you can enter SQL and run it in the Firebolt Workspace, or use an API.
Before you can load data using a SQL script, you must register with Firebolt, and create a database and an engine.
A general workflow to load data using SQL is shown in the following diagram, with the highlighted path representing the SQL workflow and the muted path representing using the Load data wizard:
For more information on how to register, create a database and engine using the Firebolt Workspace, see the Get Started guide. To create an engine using SQL, use CREATE ENGINE. You can check how many engines are defined in your current account using SHOW ENGINES. For more information and examples of how to create engines, see Work with engines using DDL. To create a database, use CREATE DATABASE. You can check how many databases (i.e., catalogs) are defined in your current account using SHOW CATALOGS. Next, log into the Firebolt Workspace and enter SQL into the script tab in the SQL Editor.
The following code examples show different workflows based on need and complexity:
- The simplest COPY FROM workflow
- Define a schema, create a table, and load data
- Load multiple files into a table
- Filter data before loading using OFFSET and LIMIT
- Aggregating data during data load
- Update an existing table from an external table
- Load source file metadata into a table
- Continue loading even with errors
- Log errors during data load
The simplest COPY FROM workflow
Although there are many options to handle different data loading workflows, COPY FROM
requires only two parameters:
- The name of the table that you are loading data into.
- A location to load the data from.
An example of the simplest way to invoke COPY FROM
is:
COPY INTO tutorial FROM
's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv' WITH HEADER=TRUE;
The previous code creates a table named tutorial
, reads a CSV file with headers from a public Amazon S3 bucket, automatically generates a schema, and loads the data.
If the data is contained in an Amazon S3 bucket with restricted access, you will need to provide credentials. The following example shows how to provide credentials and read a file with headers, and automatically generate a schema:
COPY INTO tutorial
FROM 's3://your_s3_bucket/your_file.csv'
WITH
CREDENTIALS = (
AWS_KEY_ID = '<aws_key_id>'
AWS_SECRET_KEY = '<aws_secret_key>'
)
HEADER=TRUE AUTO_CREATE=TRUE;
To provide your credentials in the previous example, do the following:
- Replace the <aws_key_id> with an AWS access key that is associated with an AWS user or AWS IAM role. The AWS access key is a 20-character string such as ‘AKIAIOSFODNN7EXAMPLE’.
- Replace the <aws_secret_key> with an AWS secret access key associated with the user or role associated with the AWS access key. The AWS secret access key is a 40-character string such as ‘wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY’.
For more information about how to create an AWS access key and AWS secret key, see Creating Access Key and Secret ID in AWS.
Define a schema, create a table, and load data
You can also load data into an existing table using your own schema definition. Manually defining your own schema, can give you finer control over data ingestion. This example contains the following two steps:
-
Create the target table.
Create a table to load the data into, as shown in the following code example:
CREATE TABLE IF NOT EXISTS levels ( LevelID INT, Name TEXT, GameID INT, LevelType TEXT, MaxPoints INT, PointsPerLap DOUBLE, SceneDetails TEXT );
The previous code example creates a table named
levels
, and defines each of the columns with a name and data type. For more information about the data types that Firebolt supports, see Data types. -
Run COPY FROM.
Use COPY FROM to load the data from an Amazon S3 bucket into the levels table, as shown in the following code example:
COPY INTO levels FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv' WITH TYPE = CSV HEADER = TRUE;
The previous code example reads data from a Firebolt test data set from the fictional Ultra Fast Gaming Inc. company. The
levels
data set is in CSV format, but you can also useCOPY FROM
to read files inParquet
format. If you are reading in aCSV
file and specifyHEADER = TRUE
, then Firebolt expects the first line of your file to contain column names.
Load multiple files into a table
You can use the PATTERN
option in COPY FROM
to load several files at the same time from an Amazon S3 bucket. The PATTERN
option uses standard regular expressions. For more information about regular expressions, see the Wikipedia glob programming article.
COPY INTO nyc_restaurant_inspections FROM
's3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nyc_restaurant_inspections/parquet/'
WITH PATTERN="*.parquet" AUTO_CREATE=TRUE TYPE=PARQUET;
In the previous code example, the following apply:
- COPY INTO: Specifies the target table to load the data into.
- FROM: Specifies the S3 bucket location of the data.
- WITH PATTERN= “*.parquet”: Uses a regular expressions pattern with wildcards (*) to include all Parquet files in the directory.
- AUTO_CREATE=TRUE: Automatically creates the table and the schema if the table does not already exist. Parquet files include rich data, and typically have schema information for simple and high-fidelity schema creation. Specifying AUTO_CREATE to TRUE ensures the schema in the Parquet file is preserved after loading.
- TYPE = PARQUET: Specifies the data format as Parquet.
Filter data before loading using OFFSET and LIMIT
You can use COPY FROM
with the LIMIT
and OFFSET
clauses to filter out data before you load it into Firebolt. The following example demonstrates how to filter the source data by skipping the first five rows of data and inserting only the next three rows.
COPY offset_limit
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
OFFSET 5 LIMIT 3
WITH TYPE = CSV HEADER = TRUE;
In the previous code example, the following apply:
- OFFSET: Specifies a non-negative number of rows that are skipped before returning results from the query.
- LIMIT: Restricts the number of rows that are included in the result set.
- TYPE = CSV: Specifies the data format as CSV.
- HEADER: Specifies that the first row of the source file contains column headers.
For more information about OFFSET
and LIMIT
, see SELECT Query Syntax.
Aggregating data during data load
If you frequently use aggregation functions such as COUNT
, MAX
, or SUM
, you can perform these aggregations on top of an external table without loading the raw data into Firebolt. This approach allows you to avoid costs associated with importing and storing the dataset, particularly if you don’t need to store the originating data set.
The following example shows how to aggregate data using an external table. Then, define a table in the Firebolt database with the desired aggregations. Finally, insert data from the external table into the internal table. This example contains the following three steps:
-
Create an external table linked to files in an Amazon S3 bucket.
The following code creates an external table that links to files in Amazon S3 bucket. The table has a defined schema that matches the type and names of the originating data:
CREATE EXTERNAL TABLE IF NOT EXISTS ex_playstats ( GameID INTEGER, PlayerID INTEGER, StatTime TIMESTAMPNTZ, SelectedCar TEXT, CurrentLevel INTEGER, CurrentSpeed REAL, CurrentPlayTime BIGINT, CurrentScore BIGINT, Event TEXT, ErrorCode TEXT, TournamentID INTEGER) URL = 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' OBJECT_PATTERN = '*' TYPE = (PARQUET);
The previous code uses
OBJECT_PATTERN
to link all (*) files inside the specified directory contained inURL
, andTYPE
to specify the file format. -
Define a table in the Firebolt database with the desired aggregations, as shown in the following code example:
CREATE TABLE IF NOT EXISTS playstats_max_scores ( PlayerID INTEGER, TournamentID INTEGER, MaxCurrentLevel INTEGER, MaxCurrentSpeed REAL, MaxCurrentScore BIGINT ) PRIMARY INDEX TournamentID, PlayerID;
The previous code creates a table with the aggregate values
MaxCurrentLevel
,MaxCurrentSpeed
, andMaxCurrentScore
. -
Insert data from the external table into the internal table using the aggregate functions, as shown in the following code example:
INSERT INTO playstats_max_scores
SELECT PlayerID,
TournamentID,
MAX(CurrentLevel),
MAX(CurrentSpeed),
MAX(CurrentScore)
FROM ex_playstats
GROUP BY ALL
The previous code calculates the aggregate function MAX
before loading the data into the playstats_max_scores
table.
Update an existing table from an external table
Firebolt saves metadata including virtual columns, and the source file’s name, size and timestamp when mapping data from an Amazon S3 bucket to a Firebolt database. You can query this metadata directly for troubleshooting and analysis, or use it to find new data, as shown in this example.
To load only new and updated data from an Amazon S3 bucket into an existing table, use an external table and two temporary tables. This section guides you through creating a new table, which will serve as the existing table in a complete example. If you already have an existing table, its schema definition must include the file timestamp and file name metadata. For more information about these metadata columns, see Using metadata virtual columns in Work with external tables.
The full workflow involves creating an internal source data table, an external table linked to the source data, and two temporary tables for the latest timestamp and updated data. The updates_table
selects new data and uses an inner join to insert these records into your existing table, as illustrated in the diagram below:
This example contains the following nine steps:
-
Create a table
The following code example shows you how to create a
players
table from a sample players dataset, and then copy data from a parquet file in an Amazon S3 bucket into it:CREATE TABLE IF NOT EXISTS players ( PlayerID INTEGER, Nickname TEXT, Email TEXT, AgeCategory TEXT, Platforms ARRAY (TEXT NULL), RegisteredOn DATE, IsSubscribedToNewsletter BOOLEAN, InternalProbabilityToWin DOUBLE PRECISION, SOURCE_FILE_NAME TEXT, SOURCE_FILE_TIMESTAMP TIMESTAMPNTZ) PRIMARY INDEX agecategory, registeredon;
The previous code example defines the schema for the players table, which includes the metadata columns
SOURCE_FILE_NAME
andSOURCE_FILE_TIMESTAMP
. -
Create an external table
Use an external table to query the source data directly to compare it to data in your existing table. The advantages of using an external table to check for new data are as follows:
- An external table links to the data source without loading it into a database, which avoids costs associated with importing and storing it.
- Using an external table isolates data operations to reduce the risk of corrupting data contained in the main
players
table.
The following code example creates an external players_ext table linked to the source data:
CREATE EXTERNAL TABLE IF NOT EXISTS players_ext ( PlayerID INTEGER, Nickname TEXT, Email TEXT, AgeCategory TEXT, Platforms ARRAY (TEXT NULL), RegisteredOn PGDATE, IsSubscribedToNewsletter BOOLEAN, InternalProbabilityToWin DOUBLE PRECISION) URL = 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/players/' OBJECT_PATTERN = '*' TYPE = (PARQUET);
The previous code example defines the schema for parquet data and links to all parquet files in the Amazon S3 bucket that contains the source data.
If you are using an external table to link to data in parquet format, the order of the columns in the external table does not have to match the order of the columns in the source data. If you are reading data in csv format, the order must match the order in the source data.
-
Copy data from the
players_ext
external table into an internalplayers
table, as shown in the following code example:INSERT INTO players SELECT *, $SOURCE_FILE_NAME, $SOURCE_FILE_TIMESTAMP FROM players_ext
- Create a temporary table that contains the most recent timestamp from your existing table, as shown in the following code example:
CREATE TABLE IF NOT EXISTS control_maxdate AS ( SELECT MAX(source_file_timestamp) AS max_time FROM players );
The previous code example uses an aggregate function
MAX
to select the most recent timestamp from the existingplayers
table. - Create a temporary table to select and store data that has a newer timestamp than that contained in the control_maxdate table, as shown in the following code example:
CREATE TABLE IF NOT EXISTS updates_table AS ( WITH external_table AS ( SELECT *, $SOURCE_FILE_NAME AS source_file_name_new, $SOURCE_FILE_TIMESTAMP AS source_file_timestamp_new, FROM players_ext WHERE $source_file_timestamp > (SELECT max_time FROM control_maxdate) AND playerid IN (SELECT DISTINCT playerid FROM players) ) SELECT e.* FROM players f INNER JOIN external_table e ON f.playerid = e.playerid );
The previous code example creates an
updates_table
using aSELECT
statement to filter out data that is older than the previously recorded timestamp. The code includes a table aliase
, which refers toexternal_table
, and the table aliasf
, which refers to theplayers
table. TheINNER JOIN
usesplayerid
to match rows in the external table to those in theplayer
table, and then updates theplayers
table. - Delete records from the original players table that have been updated, based on matching player IDs in the
updates_table
, as shown in the following code example:DELETE FROM players WHERE playerid IN (SELECT playerid FROM updates_table);
- Insert updated records from the
updates_table
, including a new timestamp, into theplayers
table to replace the deleted records from the previous step, as shown in the following example:INSERT INTO players SELECT playerid, nickname, email, agecategory, platforms, registeredon, issubscribedtonewsletter, internalprobabilitytowin, source_file_name_new, source_file_timestamp_new FROM updates_table;
- Insert any entirely new, rather than updated, records into the
players
table, as shown in the following code example:INSERT INTO players SELECT *, $SOURCE_FILE_NAME, $SOURCE_FILE_TIMESTAMP FROM players_ext WHERE $SOURCE_FILE_TIMESTAMP > (SELECT max_time FROM control_maxdate) AND playerid NOT IN (SELECT playerid FROM players);
- Clean up resources. Remove the temporary tables used in the update process as shown in the following code example:
DROP TABLE IF EXISTS control_maxdate; DROP TABLE IF EXISTS updates_table;
Load source file metadata into a table
When you load data from an Amazon S3 bucket, Firebolt uses an external table which holds metadata about your source file to map into a Firebolt database. You can load metadata from the virtual columns contained in the external file into a table. You can use the name, timestamp and file size to determine the source of a row of data in a table. When adding data to an existing table, you can use this information to check whether new data is available, or to determine the vintage of the data. The external table associated with your source file contains the following fields:
source_file_name
- the name of your source file.source_file_timestamp
- the date that your source file was modified in the Amazon S3 bucket that it was read from.source_file_size
- the size of your source file in bytes.
The following code example shows you how to create and load metadata into a levels_meta
table, which contains only the metadata:
CREATE TABLE levels_meta (
day_of_creation date,
name_of_file text,
size_of_file int);
COPY levels_meta(
day_of_creation $source_file_timestamp,
name_of_file $source_file_name,
size_of_file $source_file_size)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH AUTO_CREATE = TRUE
HEADER = TRUE
TYPE = CSV;
The following code shows you how to read in the source data from the Amazon S3 bucket and add the metadata as new columns in that table:
CREATE TABLE IF NOT EXISTS levels_meta_plus (
"LevelID" INT,
"Name" TEXT,
"GameID" INT,
"LevelType" TEXT,
"MaxPoints" INT,
"PointsPerLap" DOUBLE,
"SceneDetails" TEXT,
day_of_creation date,
name_of_file text,
size_of_file int
);
COPY INTO levels_meta_plus (
"LevelID",
"GameID",
"Name",
"LevelType",
"MaxPoints",
"PointsPerLap",
"SceneDetails",
day_of_creation $source_file_timestamp,
name_of_file $source_file_name,
size_of_file $source_file_size
)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH
HEADER = TRUE
TYPE = CSV;
For more information about metadata, see Using metadata virtual columns in Work with external tables.
Continue loading even with errors
By default, if Firebolt runs into an error when loading your data, the job will stop loading and end in error. If you want to continue loading your data even in the presence of errors, set MAX_ERRORS_PER_FILE
to a percentage or integer larger than 0
. COPY FROM
will then continue to load data until it exceeds the specified percent based on the total number of rows in your data. If you enter an integer between 0
and 100
, COPY FROM
will interpret the integer as a percentage of rows. You can specify only 0%
or 100%
.
For example, if MAX_ERRORS_PER_FILE
is set to 0
or 0%
, COPY FROM
will load data until one row has an error, and then return an error. Setting MAX_ERRORS_PER_FILE
to either 100
or 100%
allows the loading process to continue even if every row has an error. If all rows have errors, no data will load into the target table.
The following code example loads a sample CSV data set with headers, and will finish the loading job even if every row contains an error.
COPY INTO new_levels_auto
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH AUTO_CREATE = TRUE
HEADER = TRUE
TYPE = CSV
MAX_ERRORS_PER_FILE = '100%';
In the previous code example, the following apply:
COPY INTO new_levels_auto
: Creates a new table namednew_levels_auto
. TheINTO
clause is optional. If the table already exists,COPY FROM
will add the rows to the existing table.FROM
: Specifies the S3 bucket location of the data. In this example, the dataset is located in a publicly accessible bucket, so you do not need to provide credentials.AUTO_CREATE=TRUE
: Creates a target table and automatically infers the schema.HEADER=TRUE
: Specifies that the first row of the source file contains column headers.TYPE
: Specifies the data format of the incoming data.MAX_ERRORS_PER_FILE
: Specified as an integer or literal text. In the previous example,MAX_ERRORS_PER_FILE
uses text.
Log errors during data load
COPY FROM
supports an option to generate error files that describe the errors encountered and note the rows with errors. To store these files in an Amazon S3 bucket, you must provide credentials to allow Firebolt to write to the bucket.
The following example sets an error handling threshold and specifies an Amazon S3 bucket as the source data and another to write the error file:
COPY INTO my_table
FROM 's3://my-bucket/data.csv'
WITH
CREDENTIALS = (
AWS_KEY_ID = 'YOUR_AWS_KEY_ID'
AWS_SECRET_KEY = 'YOUR_AWS_SECRET_KEY'
)
MAX_ERRORS_PER_FILE = '100%'
ERROR_FILE = 's3://my-bucket/error_logs/'
ERROR_FILE_CREDENTIALS = (
AWS_KEY_ID = 'YOUR_AWS_KEY_ID'
AWS_SECRET_KEY = 'YOUR_AWS_SECRET_KEY'
)
HEADER = TRUE
In the previous code example, the following apply:
COPY INTO
: Specifies the target table to load the data into.FROM
: Specifies the S3 bucket location of the data.CREDENTIALS
: Specifies AWS credentials to access information in the Amazon S3 bucket that contains the source data. For more Information about credentials and how to set them up, see the previous The simplest COPY FROM workflow section in this guide.- Error Handling:
MAX_ERRORS_PER_FILE = ‘100%’
: Allows errors in up to100%
of the rows per file before the load data job fails.ERROR_FILE
: Specifies the Amazon S3 bucket location to write the error file.
HEADER = TRUE
: Indicates that the first row of the CSV file contains column headers.
How to examine the error files
If you specify an S3 path with the necessary permissions for an error file and the COPY FROM
process encounters errors, two different files will be generated in your bucket. The following queries show you how to load these error files into new tables so that you can query and examine the error details and the corresponding rows.
The following query loads the error_reasons
csv file, which contains a header with column names:
COPY error_reasons FROM 's3://my-bucket/error_logs/'
WITH PATTERN='*error_reasons*.csv' HEADER=TRUE;
SELECT * from error_reasons;
The following query loads a file containing all rows that encountered errors. Although this file has no header, the table schema should match that of the source file where the errors occurred.
COPY rejected_rows FROM 's3://my-bucket/error_logs/'
WITH PATTERN='*rejected_rows*.csv' HEADER=FALSE;
SELECT * FROM rejected_rows;
Configure error handling parameters such as MAX_ERRORS_PER_FILE
, ERROR_FILE
, and ERROR_FILE_CREDENTIALS
to manage how errors are handled, ensure data integrity, and record errors for future review. For more information about ERROR_FILE
or ERROR_FILE_CREDENTIALS
, see the Parameters section of COPY FROM.