Understand options for loading data into Firebolt using SQL statements.
COPY FROM
requires only two parameters:
COPY FROM
is:
tutorial
, reads a CSV file with headers from a public Amazon S3 bucket, automatically generates a schema, and loads the data.
If the data is contained in an Amazon S3 bucket with restricted access, you will need to provide credentials. The following example shows how to provide credentials and read a file with headers, and automatically generate a schema:
<aws_access_key_id>
with an AWS access key ID associated with an AWS user or IAM role. The access key ID is a 20-character string (e.g., AKIAIOSFODNN7EXAMPLE).
Replace <aws_secret_access_key>
with the AWS secret access key associated with the AWS user or IAM role. The secret access key is a 40-character string (e.g., wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY). You can also specify an AWS_SESSION_TOKEN
.
Example:
<aws_role_arn>
with your role’s Amazon Resource Name (ARN) of the IAM role that you want Firebolt to assume. This method gives Firebolt temporary credentials to authenticate and access your Amazon S3 bucket.
Example:
levels
, and defines each of the columns with a name and data type. For more information about the data types that Firebolt supports, see Data types.
levels
data set is in CSV format, but you can also use COPY FROM
to read files in Parquet
format. If you are reading in a CSV
file and specify HEADER = TRUE
, then Firebolt expects the first line of your file to contain column names.
PATTERN
option in COPY FROM
to load several files at the same time from an Amazon S3 bucket. The PATTERN
option uses standard regular expressions. For more information about regular expressions, see the Wikipedia glob programming article.
COPY INTO
: Specifies the target table to load the data into.FROM
: Specifies the S3 bucket location of the data.PATTERN='*.parquet'
: Uses a regular expressions pattern with wildcards (*) to include all Parquet files in the directory.AUTO_CREATE=TRUE
: Automatically creates the table and the schema if the table does not already exist. Parquet files include rich data, and typically have schema information for simple and high-fidelity schema creation. Specifying AUTO_CREATE to TRUE ensures the schema in the Parquet file is preserved after loading.TYPE=PARQUET
: Specifies the data format as Parquet.COPY FROM
with the LIMIT
and OFFSET
clauses to filter out data before you load it into Firebolt. The following example demonstrates how to filter the source data by skipping the first five rows of data and inserting only the next three rows.
OFFSET
: Specifies a non-negative number of rows that are skipped before returning results from the query.LIMIT
: Restricts the number of rows that are included in the result set.TYPE=CSV
: Specifies the data format as CSV.HEADER
: Specifies that the first row of the source file contains column headers.OFFSET
and LIMIT
, see SELECT Query Syntax.
COUNT
, MAX
, or SUM
, you can perform these aggregations on top of an external table without loading the raw data into Firebolt. This approach allows you to avoid costs associated with importing and storing the dataset, particularly if you don’t need to store the originating data set.
The following example shows how to aggregate data using an external table. Then, define a table in the Firebolt database with the desired aggregations. Finally, insert data from the external table into the internal table. This example contains the following three steps:
OBJECT_PATTERN
to link all (*) files inside the specified directory contained in URL
, and TYPE
to specify the file format.
MaxCurrentLevel
, MaxCurrentSpeed
, and MaxCurrentScore
.
MAX
before loading the data into the playstats_max_scores
table.
updates_table
selects new data and uses an inner join to insert these records into your existing table, as illustrated in the diagram below:
players
table from a sample players dataset, and then copy data from a parquet file in an Amazon S3 bucket into it:
SOURCE_FILE_NAME
and SOURCE_FILE_TIMESTAMP
.
players
table.players_ext
external table into an internal players
table, as shown in the following code example:
MAX
to select the most recent timestamp from the existing players
table.
updates_table
using a SELECT
statement to filter out data that is older than the previously recorded timestamp. The code includes a table alias e
, which refers to external_table
, and the table alias f
, which refers to the players
table. The INNER JOIN
uses playerid
to match rows in the external table to those in the player
table, and then updates the players
table.
updates_table
, as shown in the following code example:
updates_table
, including a new timestamp, into the players
table to replace the deleted records from the previous step, as shown in the following example:
players
table, as shown in the following code example:
source_file_name
- the name of your source file.source_file_timestamp
- the date that your source file was modified in the Amazon S3 bucket that it was read from.source_file_size
- the size of your source file in bytes.levels_meta
table, which contains only the metadata:
MAX_ERRORS_PER_FILE
to a percentage or integer larger than 0
. COPY FROM
will then continue to load data until it exceeds the specified percent based on the total number of rows in your data. If you enter an integer between 0
and 100
, COPY FROM
will interpret the integer as a percentage of rows. You can specify only 0%
or 100%
.
For example, if MAX_ERRORS_PER_FILE
is set to 0
or 0%
, COPY FROM
will load data until one row has an error, and then return an error. Setting MAX_ERRORS_PER_FILE
to either 100
or 100%
allows the loading process to continue even if every row has an error. If all rows have errors, no data will load into the target table.
The following code example loads a sample CSV data set with headers, and will finish the loading job even if every row contains an error.
COPY INTO new_levels_auto
: Creates a new table named new_levels_auto
. The INTO
clause is optional. If the table already exists, COPY FROM
will add the rows to the existing table.FROM
: Specifies the S3 bucket location of the data. In this example, the dataset is located in a publicly accessible bucket, so you do not need to provide credentials.AUTO_CREATE=TRUE
: Creates a target table and automatically infers the schema.HEADER=TRUE
: Specifies that the first row of the source file contains column headers.TYPE
: Specifies the data format of the incoming data.MAX_ERRORS_PER_FILE
: Specified as an integer or literal text. In the previous example, MAX_ERRORS_PER_FILE
uses text.COPY FROM
supports an option to generate error files that describe the errors encountered and note the rows with errors. To store these files in an Amazon S3 bucket, you must provide credentials to allow Firebolt to write to the bucket.
The following example sets an error handling threshold and specifies an Amazon S3 bucket as the source data and another to write the error file:
COPY INTO
: Specifies the target table to load the data into.FROM
: Specifies the S3 bucket location of the data.CREDENTIALS
: Specifies AWS credentials to access information in the Amazon S3 bucket that contains the source data. AWS AssumeRole authentication is used for dynamic, temporary credentials. For more Information about credentials and how to set them up, see The simplest COPY FROM workflow.MAX_ERRORS_PER_FILE = ‘100%’
: Allows errors in up to 100%
of the rows per file before the load data job fails.ERROR_FILE
: Specifies the Amazon S3 bucket location to write the error file.HEADER = TRUE
: Indicates that the first row of the CSV file contains column headers.COPY FROM
process encounters errors, two different files will be generated in your bucket. The following queries show you how to load these error files into new tables so that you can query and examine the error details and the corresponding rows.
The following query loads the error_reasons
csv file, which contains a header with column names:
MAX_ERRORS_PER_FILE
, ERROR_FILE
, and ERROR_FILE_CREDENTIALS
to manage how errors are handled, ensure data integrity, and record errors for future review. For more information about ERROR_FILE
or ERROR_FILE_CREDENTIALS
, see the Parameters section of COPY FROM.