COPY FROM

Loads data from an AWS S3 bucket into Firebolt. COPY FROM supports different data loading workflows including the following:

  • Automatically discover the schema during data loading.
  • Filter by metadata during loading.
  • Load multiple files in parallel into a target table.
  • Automatically create a table if it doesn’t already exist.
  • Load metadata about your source files into a table.
  • Handle errors during data loading.

Table of contents

Syntax

COPY
[INTO] <table_name> [ <column_mapping> ] 
FROM <externalLocations>
[ LIMIT <count> ]
[ OFFSET <start> ]
[ WITH <options> ]
[ WHERE <condition> ]

<column_mapping>:
    ( <column_name> [DEFAULT <default_value>] [ { $<source_column_index> | <source_column_name> } ] [, ...] )

<options>:
    [ CREDENTIALS = ( <credentials> ) ] 
    [ PATTERN = <regex_pattern> ]
    [ TYPE = { **AUTO** | CSV | TSV | PARQUET } ]
    [ AUTO_CREATE = { **TRUE** | FALSE } ]
    [ ALLOW_COLUMN_MISMATCH = { **TRUE** | FALSE } ]
    [ ERROR_FILE = <directoryLocation> ]
    [ ERROR_FILE_CREDENTIALS = <credentials> ]
    [ MAX_ERRORS_PER_FILE = { integer | '<percentage>' } ]
    [ <csv_options> ]

<credentials>:
    { AWS_KEY_ID = '<aws_key_id>' AWS_SECRET_KEY = '<aws_secret_key>' }

<csv_options>:
    [ HEADER = { **FALSE** | TRUE  } ]
    [ DELIMITER = 'character' ]
    [ NEWLINE = 'string' ]
    [ QUOTE = { **DOUBLE_QUOTE** | 'character' |  SINGLE_QUOTE } ]
    [ ESCAPE = 'character' ]
    [ NULL_STRING = 'string' ]
    [ EMPTY_FIELD_AS_NULL = { **TRUE** | FALSE } ]
    [ SKIP_BLANK_LINES = { **FALSE** | TRUE } ]
    [ DATE_FORMAT = <date_format> ]
    [ TIMESTAMP_FORMAT = <timestamp_format> ]

Parameters

Parameter Description
<table_name> The name of the target table.
<column_mapping> This feature is only available if the target table already exists. You can use column_mapping to specify the mapping between the source and target schema. Select a column in the source file to map to the target file using either the name of the column or its index.
<column_name> The name of a target column in a table. If <source_column_index/name> is not specified, source columns will be automatically mapped to target columns based on name.
<default_value> A replacement value for any NULL value generated by mapping the source to the target. This data type of default_value must be compatible with the data type of the target column.
<source_column_index> The index position of the column in the source data to be mapped. If you are specifying multiple source files, source_column_index specifies the index for all source files. The index starts at 1, and can be as large as 2^64-1, which allows for a very large number of columns. If you prefix the index inside a command or query, precede the column index with a dollar sign ($) character. For example, prefix the index as shown in the following COPY INTO statement: CREATE TABLE t(a text, b text); COPY INTO t(a $1, b $2) FROM 's3://my_bucket/my_folder/my_file';.
<source_column_name> The name of the column in the source data to be mapped. If you are specifying multiple source files, source_column_name specifies the column name for all source files.
<externalLocations> One or multiple paths to a location inside an Amazon S3 bucket that contains source files. If externalLocations ends with a forward slash (/), Firebolt interprets its value as a folder. Otherwise, externalLocations is treated as the location for a single file. An example folder has the following format: s3://my_bucket/my_folder/. An example file has the following format: s3://my_bucket/my_folder/my_file.
<directoryLocation> The Amazon S3 path to a directory.
CREDENTIALS The Amazon S3 credentials for accessing the specified <externalLocations>. For more information, see CREDENTIALS.
PATTERN A string that represents a glob pattern, or regular expression, used to match filenames or other strings.
TYPE The file type that Firebolt should expect when loading files identified by PATTERN. If TYPE is unspecified, Firebolt automatically detects the file type using the file’s suffix. If a file matched by PATTERN does not match the specified TYPE, Firebolt will generate an error. The default value is AUTO.
AUTO_CREATE Specify whether Firebolt should automatically create a table if it doesn’t already exist. If AUTO_CREATE is set to FALSE, Firebolt will generate an error if the target table is missing. If the target table already exists, AUTO_CREATE is ignored. The default setting is TRUE, allowing automatic table creation.
ALLOW_COLUMN_MISMATCH Set to FALSE to specify that all required columns must appear in the source file. If ALLOW_COLUMN_MISMATCH is set to FALSE, all required columns must be present in the source file. Required columns are those listed in <column_mapping> by name or index. If no column mapping is provided, the target table’s columns are required. Missing required columns will cause row-based errors for CSV or TSV files, and file-based errors for Parquet files. When ALLOW_COLUMN_MISMATCH is set to TRUE, any missing columns in the source file are filled with NULL values.
ERROR_FILE The Amazon S3 location where error files will be written. See the previous <externalLocations> parameter definition for implementation. No error files are created by default or if the specified path doesn’t exist. If ERROR_FILE is specified, a subdirectory is created based on the time of load submission in the format: YearMonthDay-HourMinuteSecond + QueryID, such as: 20220330-173205). For CSV files, this directory will contain a rejected_rows.csv file containing erroneous data rows, and an error_reasons.csv file, containing the reasons that the errors were generated. Because Parquet doesn’t produce row-based error files, on error, only an error_reasons.csv file is generated.
ERROR_FILE_CREDENTIALS The Amazon S3 credentials required to write an error file to <externalLocations>. For more information, see CREDENTIALS.
MAX_ERRORS_PER_FILE Specify the maximum number of rows that can be rejected per file. MAX_ERRORS_PER_FILE can be an integer or percentage in the format “integer%”, such as 100%. The only valid percentage options are 0% and 100%. If you specify an integer value, the COPY FROM job will load the job until it encounters the number of errors specified, and then stop the loading job. For example, if you specify 3, then COPY_FROM will load data until it encounters 3 errors, and then end the job with an error. If the threshold is exceeded, COPY_FROM job will stop and return an error. By default, no errors are allowed. If MAX_ERRORS_PER_FILE is set to 100%, then all errors are allowed.

Parameters for CSV files

Parameter Description
HEADER Specify if the file contains a header line containing the column names. If HEADER is TRUE, the first line will be interpreted to contain column names. The default value is FALSE.
DELIMITER Specify the character used to separate fields. The default delimiter is a comma (,).
NEWLINE Specify the character used to delimit rows. The default newline character is \n. If NEWLINE is \n, then \r, \r\n, and \n\r are also treated as newline characters. Custom characters are allowed only if the target table already exists.
QUOTE Specify the character used for quoting fields. The default quote character is DOUBLE_QUOTE. You can specify either SINGLE_QUOTE, DOUBLE_QUOTE. You can also specify the single quote literal character (') or the double quote literal character(").
ESCAPE Specify the character used to escape special characters. The default escape character is the quote (') character.
NULL_STRING Specify the string used to represent NULL values. The default null string is an empty string, which means that no specific null string is defined.
EMPTY_FIELD_AS_NULL Specify whether empty fields should be interpreted as NULL values. The default value is TRUE.
SKIP_BLANK_LINES Specify whether to ignore blank lines. If SKIP_BLANK_LINES is TRUE, then COPY_FROM will ignore blank lines. The default value is FALSE.
DATE_FORMAT Specify the date format for parsing text into date columns. This format will apply to all columns loaded as date columns. For supported formats, see TO_DATE.
TIMESTAMP_FORMAT Specify the timestamp format for parsing text into timestamp columns. The format will apply to all columns loaded as timestamp columns. For supported formats, see TO_TIMESTAMP.

COPY FROM supports the following file formats:

  • Parquet
  • CSV/TSV

Examples

Automatic Schema Discovery

You can use the automatic schema discovery feature in COPY FROM to handle even very large data sources instead of manually defining it. The following apply:

  • Parquet files - Firebolt automatically reads metadata in Parquet files to create corresponding target tables.
  • CSV files - Firebolt infers column types based on the data content itself, which can streamline the initial data loading process significantly. Use WITH HEADER=TRUE if your CSV file contains column names in the first line.

Automatic schema discovery operates on a “best effort” basis, and attempts to balance accuracy with practical usability, but it may not always be error-free.

The following query reads levels.csv, a sample dataset from the fictional Ultra Fast Gaming Inc. The example implicitly uses automatic schema creation with AUTO_CREATE=TRUE, which defaults to TRUE, and also triggers automatic table creation:

COPY automatic_schema_table 
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH HEADER=TRUE;

Use the following example code to display the table contents, ordered by the fourth column, followed by the fifth column:

SELECT * FROM automatic_schema_table ORDER BY 4,5;

Use PATTERN to insert data into an existing table

You can use the PATTERN feature, which uses regular expressions, to select several files that match the specified pattern to populate a target table. The following example uses the *.csv pattern to read all files ending in .csv into the pattern_target table:

COPY pattern_target FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'
WITH TYPE=CSV HEADER=TRUE PATTERN='*.csv';

In the previous example, there are two CSV files in the firebolt_sample_dataset folder: levels.csv and tournaments.csv. These files have a different schema. COPY_FROM reads these files into a single table, and infers the schema from the first file. Any column mismatches are filled with NULL values.

Filter by metadata during loading

When loading data into tables, you can filter data using the following options:

  1. LIMIT: Restricts the number of rows loaded, which can be useful to preview or create sample datasets.

  2. OFFSET: Skips a specified number of rows in the final result set before ingestion. The OFFSET clause in COPY FROM behaves the same way as the OFFSET clause in SELECT queries behaves.

    • COPY FROM currently does not support the ORDER BY clause. Thus, using OFFSET may result in different outcomes every time you run the command.

    • Both LIMIT and OFFSET apply to the entire result set, not to individual files.

  3. WHERE: Filters data based on source file metadata, as follows:

    • $source_file_name - The full path of the source file in an Amazon S3 bucket, without the name of the bucket. For example, if your bucket is: s3://my_bucket/xyz/year=2018/month=01/part-00001.parquet, then $source_file_name is xyz/year=2018/month=01/part-00001.parquet.
    • $source_file_timestamp - The timestamp in UTC, to the second when the source file was last modified in an Amazon S3 bucket.
    • $source_file_size - The size in bytes of the source file.
    • $source_file_etag - The ETag of the file, often used for version control.

In the following code example, COPY FROM first reads all the files in the specified directory that were modified in the last three years. Then, it applies the offset and limit clause. As long as all source files modified in the last three years have at least 100 rows combined, the result set will have exactly 50 rows.

COPY tournament_results
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/rankings/TournamentID=1/'
LIMIT 50 OFFSET 50
WHERE $source_file_timestamp > NOW() - interval '3 YEARS';

The previous code example returns a table containing 50 rows of data that was modified in the last three years.

Load multiple files and directories in parallel

You can use COPY FROM to read multiple sources and from multiple directories into a single table, simultaneously. The following code example reads any file ending in .parquet from multiple directories into table_from_multiple_directories.

COPY table_from_multiple_directories FROM 
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/TournamentID=1/',
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/TournamentID=10/',
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/TournamentID=100/'
WITH pattern='*.parquet';

The following code example reads two CSV files into table_from_multiple_files:

COPY table_from_multiple_files FROM 
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv',
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/tournaments.csv'
WITH HEADER=TRUE;

In the previous example, there are two CSV files in the firebolt_sample_dataset folder: levels.csv and tournaments.csv. These files have a different schema. COPY_FROM reads these files into a single table, and infers the schema from the first file. Any column mismatches are filled with NULL values.

Load metadata into a table

You can load metadata information about your source file into your table so that you can track the source name, timestamp, size, and etag information for each row. You can use the following metadata columns:

  • $source_file_name - The full path of the source file in an Amazon S3 bucket, without the name of the bucket. For example, if your bucket is: s3://my_bucket/xyz/year=2018/month=01/part-00001.parquet, then $source_file_name is xyz/year=2018/month=01/part-00001.parquet.
  • $source_file_timestamp - The timestamp in UTC, to the second when the source file was last modified in an Amazon S3 bucket.
  • $source_file_size - The size in bytes of the source file.
  • $source_file_etag - The ETag of the file, often used for version control.

The following code creates the levels table, and populates it with information from the LevelID column and the timestamp from the source data:

CREATE TABLE levels ("LevelID" TEXT NOT NULL, date_of_creation TIMESTAMP);
COPY INTO levels("LevelID", date_of_creation $source_file_timestamp)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv' WITH HEADER=TRUE;

The following code example displays the contents of levels:

SELECT * FROM levels;

The first three rows of the sample output follow:

LevelID (TEXT) date_of_creation (TIMESTAMP)
1 2023-02-27 10:06:52
2 2023-02-27 10:06:52
3 2023-02-27 10:06:52

Allow column name mismatch

If you specify a column mapping during data loading, COPY FROM treats the columns listed in the <column_mapping> as required. If no column mapping is specified, the columns in the target table are considered required. To allow the data to continue loading when some required columns are missing from the source file, you can use ALLOW_COLUMN_MISMATCH, which is enabled by default.

For example, if you create a table with LevelID2 and Name columns and attempt to load data from the levels.csv dataset, which lacks a LevelID2 column, COPY FROM will populate the Name column as specified and fill the LevelID2 column with NULL values. This allows flexible handling of missing data without raising errors, as shown in the following code example:

CREATE TABLE col_mismatch ("LevelID2" int, "Name" text);
COPY col_mismatch
  FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH HEADER=TRUE MAX_ERRORS_PER_FILE='0%';

Error handling

The following sections show you how to handle errors for both CSV and Parquet files.

Row-based errors in CSV

COPY FROM generates a row-based error when there’s a mismatch between source and target table columns. In the following example, the col_mismatch_csv table includes a LevelID2 column defined as NOT NULL, that does not exist in the source levels.csv table:

CREATE TABLE col_mismatch_csv ("LevelID2" int NOT NULL);

COPY col_mismatch_csv FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH HEADER=TRUE MAX_ERRORS_PER_FILE='0%';

In the previous code example, when COPY FROM does not see the same column name in the target table col_mismatch_csv, it tries to fill the column with NULL values. Because LevelID2 is defined with a constraint that it cannot have NULL values, the query generates the following error: ERROR: The INSERT INTO statement failed because it tried to insert a NULL into the column LevelID2, which is NOT NULL. Please specify a value or redefine the column's logical constraints. and stops loading into the table.

Allow all row-based errors in CSV

The previous code example uses MAX_ERRORS_PER_FILE='0%', which causes the loading job to fail if there is a single error. You can change this behavior to allow errors. The following code example allows all errors, and the load job completes even if no data loads into the target table:

CREATE TABLE table_all_errors ("LevelID2" int NOT NULL);

COPY table_all_errors FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH HEADER=TRUE MAX_ERRORS_PER_FILE='100%';
Column data type mismatch in CSV

If you try to load data into a column in an existing table that has a different data type than the source data, COPY FROM will attempt to cast the data into the specified data type. If the cast fails, COPY FROM generates an error. To demonstrate this error, the following example intentionally creates a table that defines the LevelID column incorrectly as an integer, instead of as text, and then attempts to copy data into it:

CREATE TABLE col_mismatch_type_csv ("Name" int);

COPY col_mismatch_type_csv("Name" name)
  FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH TYPE=CSV HEADER=TRUE;

The previous code example generates the following error: Line 1, Column 8: Unable to cast text 'Thunderbolt Circuit' to integer.

In the previous code example, the query generates an error because the default value for MAX_ERRORS_PER_FILE is 0. You can set MAX_ERRORS_PER_FILE to 100% to allow all errors, as shown in the following section.

Allow all errors, and write them to file

You can also allow all errors, so that the loading job continues until it has attempted to load all rows in your dataset. Firebolt can write these errors to an Amazon S3 bucket as CSV files. If your specified S3 bucket requires access credentials, you must specify them so that Firebolt can write the files on your behalf. Data rows that load without error are ingested in row order. A loading job that specifies writing error files will write files with the following syntax to your Amazon S3 bucket:

  • error_reasons.csv - An error file that contains all the reasons that a row generated an error, and also file-based errors.
  • rejected_rows.csv - An error file that contains all the rejected rows in row order.

Producing an error while reading Parquet files doesn’t generate row-based error files. On error, only a error_reasons.csv file is generated.

The previous files will have an order appended to the name such as error_reasons_1.csv.

The following code example allows all errors, provides credentials, and writes two error files to an Amazon S3 bucket:

CREATE TABLE table_write_errors(TournamentID INT, Name INT);

COPY table_write_errors(TournamentID TournamentId, Name Name) FROM 's3://firebolt-publishing-public/help_center_assets/tournaments.csv'
WITH ERROR_FILE_CREDENTIALS = (
    AWS_KEY_ID = 'YOUR_AWS_KEY_ID'
    AWS_SECRET_KEY = 'YOUR_AWS_SECRET_KEY'
)
MAX_ERRORS_PER_FILE='100%' HEADER=TRUE ERROR_FILE='s3://bucket_name/error_directory/';

To provide your credentials in the previous example, do the following:

  • Replace the <aws_key_id> with an AWS access key that is associated with an AWS user or AWS IAM role. The AWS access key is a 20-character string such as AKIAIOSFODNN7EXAMPLE.
  • Replace the <aws_secret_key> with an AWS secret access key associated with the user or role associated with the AWS access key. The AWS secret access key is a 40-character string such as wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY.

Read errors from file

The previous COPY FROM example shows how to create two error files, one that describes the error reasons and one that contains the rows that had errors. This example shows how to load and view the contents of these files.

The following code example reads all files that begin with error_reasons and end with .csv into an error_reasons table:

COPY error_reasons FROM 's3://bucket_name/error_directory/' 
WITH PATTERN='*error_reasons*.csv' HEADER=TRUE;

The following code returns the contents of the error_reasons table:

SELECT * FROM error_reasons ORDER BY source_line_num LIMIT 1;

The following output shows an example of the contents of the error_reasons table:

file_name (TEXT) source_line_num (BIGINT) error_message (TEXT)
help_center_assets/tournaments.csv 1 Error while casting

The following code reads all files that begin with rejected_rows and ends with .csv into a rejected_rows table:

COPY rejected_rows FROM 's3://bucket_name/error_directory/'
WITH PATTERN='*rejected_rows*.csv' HEADER=FALSE;

Use SELECT to view the contents of a file. The following code returns the contents of the rejected_rows table:

SELECT * FROM rejected_rows ORDER BY f0 LIMIT 1;

The following output shows the contents of the rejected_rows table after running the previous SELECT statement:

f0 (TEXT) f1 (TEXT) f2 (TEXT) f3 (TEXT) f4 (TEXT) f5 (TEXT) f6 (TEXT)
1 The Snow Park Grand Prix 1 20903 2021-05-28 20:40:54 2021-05-29 04:51:43

Column mapping and default values

You can map a specific source column to a target column, and specify a default value to replace any NULL values generated during mapping.

The following code example maps the LevelID column from the levels.csv sample dataset, into a column LevelID_team_A in a target_default_mapping table. It also maps a non-existent Country column in the levels.csv dataset, into LevelsID_team_B:

CREATE TABLE target_default_mapping ("LevelID_team_A" text, "LevelID_team_B" text);
COPY target_default_mapping("LevelID_team_A" "LevelID", "LevelID_team_B" default 50 "Country")
  FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH HEADER=TRUE;

In the previous example, all rows under LevelsID_team_B will contain the value 50.

Type mismatch errors

If you read a column from a source file into a table with an incompatible data type, the mapping generates a casting error. A loading job that specifies writing error files will write files starting with the following prefixes to a specified Amazon S3 bucket:

  • error_reasons - An error file that contains all the reasons that a row generated an error, and also file-based errors.
  • rejected_rows - An error file that contains all the rejected rows in row order.

The following code uses the Firebolt sample players dataset which has a column PlayerID with a data type of INTEGER, and attempts to read it into an existing column with a DATE data type:

CREATE TABLE IF NOT EXISTS
 players (
   PlayerID DATE,
   Nickname TEXT,
   Email TEXT);

COPY players FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/players/'
WITH TYPE=PARQUET MAX_ERRORS_PER_FILE='100%'
ERROR_FILE='s3://bucket_name/parquet_error_directory/';

Use the following sample code to view a table that contains the contents of all error files that contain error_reasons:

COPY error_reasons FROM 's3://bucket_name/parquet_error_directory/'
WITH PATTERN='*error_reasons*' HEADER=TRUE;

SELECT * FROM error_reasons;
file_name (TEXT) source_line_num (BIGINT) error_message (TEXT)
gaming/parquet/players/11afd184-d2d4-4471-b23c-a14f4c0945a2_1_0_0.snappy.parquet 0 Can not assignment cast column playerid from type integer null to type date null

The type mismatch error in this example creates a file-based error, or one that affects the entire file during processing, such as errors caused by incorrect format. As a result, the query does not produce the rejected_rows error file.