COPY FROM

Loads data from an AWS S3 bucket into Firebolt. Use COPY FROM to load data from external sources, such as from Amazon S3, into Firebolt tables. COPY FROM supports various data loading workflows, including schema discovery, metadata filtering, and parallel processing, offering flexibility for different use cases. This page provides an overview of the COPY FROM syntax, parameters, and best practices.

Syntax

-- Using location object (recommended)
COPY
[INTO] <table_name> [ <column_mapping> ] 
FROM location_name
[ LIMIT <count> ]
[ OFFSET <start> ]
[ [ WITH ] ( <option> [, ...] )  ]
[ WHERE <condition> ]
|
-- Using static credentials
COPY
[INTO] <table_name> [ <column_mapping> ] 
FROM '<url>'
[ LIMIT <count> ]
[ OFFSET <start> ]
[[ WITH ] ( <option> [, ...] ) ]
[ WHERE <condition> ]

<column_mapping>:
    ( <column_name> [DEFAULT <default_value>] [ { $c<source_column_index> | <source_column_name> } ] [, ...] )

<option>:
    [ CREDENTIALS = ( <credentials> ) ] 
    [ PATTERN = <glob_pattern> ]
    [ TYPE = { AUTO | CSV | TSV | PARQUET } ]
    [ AUTO_CREATE = { TRUE | FALSE } ]
    [ ALLOW_COLUMN_MISMATCH = { TRUE | FALSE } ]
    [ ERROR_FILE = <directoryLocation> ]
    [ ERROR_FILE_CREDENTIALS = <credentials> ]
    [ MAX_ERRORS_PER_FILE = { integer | '<percentage>' } ]
    [ CASE_SENSITIVE_COLUMN_MAPPING = { FALSE | TRUE } ]
    [ <csv_options> ]
    [ <parquet_options> ]
    [ <settings> ]

<credentials>: (    -- Only applicable when using URL destination
  AWS_ACCESS_KEY_ID = '<aws_access_key_id>'
  AWS_SECRET_ACCESS_KEY = '<aws_secret_access_key>'
  [ AWS_SESSION_TOKEN = '<aws_session_token>' ]
  |
  AWS_ROLE_ARN = '<aws_role_arn>'
  [ AWS_ROLE_EXTERNAL_ID = '<aws_role_external_id>' ]
)

<csv_options>:
    [ HEADER = { FALSE | TRUE  } ]
    [ DELIMITER = '<character>' ]
    [ NEWLINE = '<string>' ]
    [ QUOTE = { DOUBLE_QUOTE | '<character>' |  SINGLE_QUOTE } ]
    [ ESCAPE = '<character>' ]
    [ NULL_STRING = '<string>' ]
    [ EMPTY_FIELD_AS_NULL = { TRUE | FALSE } ]
    [ SKIP_BLANK_LINES = { FALSE | TRUE } ]
    [ DATE_FORMAT = '<date_format>' ]
    [ TIMESTAMP_FORMAT = '<timestamp_format>' ]

<parquet_options>:
    [ SUPPORT_STRUCTS = { FALSE | TRUE } ]
    [ REPLACE_NON_UTF8_BYTES = { FALSE | TRUE } ]

Parameters

Parameter	Description
`<table_name>`	The name of the target table.
`<column_mapping>`	This feature is only available if the target table already exists. You can use `column_mapping` to specify the mapping between the source and target schema. Select a column in the source file to map to the target file using either the name of the column or its index.
`<column_name>`	The name of a target column in a table. If `<source_column_index/name>` is not specified, source columns will be automatically mapped to target columns based on name.
`<default_value>`	A replacement value for any `NULL` value generated by mapping the source to the target. This data type of `default_value` must be compatible with the data type of the target column.
`<source_column_index>`	The index position of the column in the source data to be mapped. The index starts at `1`. Accessing columns by index is not supported for `TYPE=PARQUET`. If you are specifying multiple source files, `source_column_index` specifies the index for all source files. Source column indexes are expressed by `$c<index>`, e.g., `$c1`, `$c2`, `$c4`. For example, prefix the index as shown in the following `COPY FROM` statement: `CREATE TABLE t(a text, b text); COPY INTO t(a $c1, b $c2) FROM 's3://my_bucket/my_folder/my_file';`.
`<source_column_name>`	The name of the column in the source data to be mapped. If you are specifying multiple source files, `source_column_name` specifies the column name for all source files.
`location_name`	The name of a location object that contains the S3 URL and credentials. This is the recommended approach for specifying the source. See CREATE LOCATION for details.
`<url>`	The path to an S3 URL where the source files are located. For example, `s3://my_bucket/my_folder/`. If the URL ends with a forward slash (`/`), it’s interpreted as a folder; otherwise, it’s treated as a single file.
`<directoryLocation>`	The Amazon S3 path to a directory.
`CREDENTIALS`	The Amazon S3 credentials for accessing the specified `<url>`. For more information, see CREDENTIALS.
`PATTERN`	A string that represents a glob pattern used to match filenames or other strings.
`TYPE`	The file type that Firebolt should expect when loading files identified by `PATTERN`. If `TYPE` is unspecified, Firebolt automatically detects the file type using the file’s suffix. If a file matched by `PATTERN` does not match the specified `TYPE`, Firebolt will generate an error. The default value is `AUTO`.
`AUTO_CREATE`	Specify whether Firebolt should automatically create a table if it doesn’t already exist. If `AUTO_CREATE` is set to `FALSE`, Firebolt will generate an error if the target table is missing. If the target table already exists, `AUTO_CREATE` is ignored. The default setting is `TRUE`, allowing automatic table creation.
`ALLOW_COLUMN_MISMATCH`	Set to `FALSE` to specify that all required columns must appear in the source file. If `ALLOW_COLUMN_MISMATCH` is set to `FALSE`, all required columns must be present in the source file. Required columns are those listed in `<column_mapping>` by name or index. If no column mapping is provided, the target table’s columns are required. Missing required columns will cause row-based errors for CSV or TSV files, and file-based errors for Parquet files. When `ALLOW_COLUMN_MISMATCH` is set to `TRUE`, any missing columns in the source file are filled with `NULL` values.
`ERROR_FILE`	The Amazon S3 `URL` where error files will be written. No error files are created by default or if the specified path doesn’t exist. If `ERROR_FILE` is specified, a subdirectory is created based on the time of load submission in the format: `YearMonthDay-HourMinuteSecond + QueryID`, such as: `20220330-173205`). For CSV files, this directory will contain a `rejected_rows.csv` file containing erroneous data rows, and an `error_reasons.csv` file, containing the reasons that the errors were generated. Because Parquet doesn’t produce row-based error files, on error, only an `error_reasons.csv` file is generated.
`ERROR_FILE_CREDENTIALS`	The Amazon S3 credentials required to write an error file to an Amazon S3 URL. For more information, see CREDENTIALS.
`MAX_ERRORS_PER_FILE`	Specify the maximum number of rows that can be rejected per file. `MAX_ERRORS_PER_FILE` can be an integer or percentage in the format “integer%”, such as `100%`. The only valid percentage options are `0%` and `100%`. If you specify an integer value, the `COPY FROM` job will load the job until it encounters the number of errors specified, and then stop the loading job. For example, if you specify `3`, then `COPY FROM` will load data until it encounters `3` errors, and then end the job with an error. If the threshold is exceeded, `COPY FROM` job will stop and return an error. By default, no errors are allowed. If `MAX_ERRORS_PER_FILE` is set to `100%`, then all errors are allowed.
`CASE_SENSITIVE_COLUMN_MAPPING`	Specify if column names should be case-sensitive when mapping from source to target tables. When set to `TRUE`, mapping between source and target columns is case-sensitive, and `COPY FROM` will fail or populate columns with `NULL` values when there is a case mismatch. You can use QUOTED_IDENTIFERS as an alternative when requiring case-sensitive behavior. When set to the default `FALSE` value, and if no target table exists, it is created with lowercase column identifiers if the source file’s column identifiers have capital, lowercase, or a combination of the two. When set to the default `FALSE` value, and the table already exists, `CASE_SENSITIVE_COLUMN_MAPPING` is ignored, and data will load from source to target.
`<settings>`	List of query-specific settings overrides.

Parameters for CSV files

Parameter	Description
`HEADER`	Specify if the file contains a header line containing the column names. If `HEADER` is `TRUE`, the first line will be interpreted to contain column names. The default value is `FALSE`.
`DELIMITER`	Specify the character used to separate fields. The default delimiter is a comma (`,`).
`NEWLINE`	Specify the character used to delimit rows. The default newline character is `\n`. If `NEWLINE` is `\n`, then `\r`, `\r\n`, and `\n\r` are also treated as newline characters. Custom characters are allowed only if the target table already exists.
`QUOTE`	Specify the character used for quoting fields. The default quote character is `DOUBLE_QUOTE`. You can specify either `SINGLE_QUOTE` or `DOUBLE_QUOTE`. You can also specify the single quote literal character (`'`) or the double quote literal character(`"`).
`ESCAPE`	Specify the character used to escape special characters. The default escape character is the quote (`'`) character.
`NULL_STRING`	Specify the string used to represent `NULL` values. The default null string is an empty string, which means that no specific null string is defined.
`EMPTY_FIELD_AS_NULL`	Specify whether empty fields should be interpreted as `NULL` values. The default value is `TRUE`.
`SKIP_BLANK_LINES`	Specify whether to ignore blank lines. If `SKIP_BLANK_LINES` is `TRUE`, then `COPY FROM` will ignore blank lines. The default value is `FALSE`.
`DATE_FORMAT`	Specify the date format for parsing text into date columns. This format will apply to all columns loaded as date columns. For supported formats, see TO_DATE.
`TIMESTAMP_FORMAT`	Specify the timestamp format for parsing text into timestamp columns. The format will apply to all columns loaded as timestamp columns. For supported formats, see TO_TIMESTAMP.

Parameters for Parquet files

Parameter	Description
`SUPPORT_STRUCTS`	A boolean which specifies whether structs in the source data are to be inferred as STRUCT data types. The default value is `TRUE` for new accounts (previously `FALSE`), and existing accounts are being migrated based on usage. This parameter is expected to be deprecated in the future. For more information, see Load Parquet structs.
`REPLACE_NON_UTF8_BYTES`	Whether to replace non-UTF8 bytes in string columns with the Unicode replacement character. Since Firebolt’s TEXT data type requires all values to be valid UTF8, non-UTF8 bytes in Parquet string columns are replaced with the Unicode replacement character (�, U+FFFD) when using this option. Otherwise, an error is raised. Default: false.

Best practice

Firebolt recommends using a LOCATION object to specify credentials. Location objects provide a secure, centralized way to manage Amazon S3 credentials and URLs. LOCATION objects let you specify credentials with the following:

Centralized credential management.
Reduced exposure of credentials in queries.
Role-based access control.
Simplified maintenance and updates.

For a comprehensive guide to LOCATION objects, see LOCATION objects. For detailed information, see CREATE LOCATION.

Settings to control behavior

insert_sharding to enforce partition locality during ingestion into partitioned tables.
tablet_min_size_bytes and tablet_max_size_bytes to control min/max tablet sizes during ingestion.
cross_region_request_mode to access data in cross-region Amazon S3 buckets.

Examples

COPY FROM with LOCATION

The following code example copies data from files matching the *.parquet pattern in the specified my_location into my_table:

COPY my_table 
FROM my_location
WITH (PATTERN = '*.parquet');

Filter by metadata during loading

When loading data into tables, you can filter data using the following options:

LIMIT: Restricts the number of rows loaded, which can be useful to preview or create sample datasets.
OFFSET: Skips a specified number of rows in the final result set before ingestion. The OFFSET clause in COPY FROM behaves the same way as the OFFSET clause in SELECT queries behaves.
- COPY FROM currently does not support the ORDER BY clause. Thus, using OFFSET may result in different outcomes every time you run the command.
- Both LIMIT and OFFSET apply to the entire result set, not to individual files.
WHERE: Filters data based on source file metadata, as follows:
- $source_file_name - The full path of the source file in an Amazon S3 bucket, without the name of the bucket. For example, if your bucket is: s3://my_bucket/xyz/year=2018/month=01/part-00001.parquet, then $source_file_name is xyz/year=2018/month=01/part-00001.parquet.
- $source_file_timestamp - The timestamp in UTC, to the second when the source file was last modified in an Amazon S3 bucket.
- $source_file_size - The size in bytes of the source file.
- $source_file_etag - The ETag of the file, often used for version control.

In the following code example, COPY FROM first reads all the files in the specified directory that were modified in the last three years. Then, it applies the offset and limit clause. As long as all source files modified in the last three years have at least 100 rows combined, the result set will have exactly 50 rows.

COPY tournament_results
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/rankings/TournamentID=1/'
LIMIT 50 OFFSET 50
WHERE $source_file_timestamp > NOW() - interval '3 YEARS';

The previous code example returns a table containing 50 rows of data that was modified in the last three years. Recommended best practice The following code example builds on the previous example by using a LOCATION to store credentials to authenticate:

COPY tournament_results
FROM my_tournament_location
LIMIT 50 OFFSET 50
WHERE $source_file_timestamp > NOW() - interval '3 YEARS';

Use LOCATION to copy data

The following code example creates a table and uses a LOCATION object to copy Parquet files that match the specified pattern into it:

-- Create a table
CREATE TABLE levels (
    LevelID INT,
    Name TEXT
);

-- Copy data using location object
COPY INTO levels
FROM my_game_data_location
WITH (
    PATTERN = 'levels/*.parquet',
    TYPE = PARQUET
);

Load multiple directories with LOCATION

The following code example copies data from Parquet files matching a pattern in my_tournaments_location to the tournament_results table:

COPY tournament_results 
FROM my_tournaments_location
WITH 
(
    PATTERN = 'tournaments/*/data/*.parquet',
    TYPE = PARQUET
);

Load multiple files and directories in parallel

You can use COPY FROM to read multiple sources and from multiple directories into a single table, simultaneously. The following code example reads any file ending in .parquet from multiple directories into table_from_multiple_directories.

COPY table_from_multiple_directories FROM 
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/TournamentID=1/',
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/TournamentID=10/',
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/TournamentID=100/'
WITH (pattern='*.parquet');

The following code example reads two CSV files into table_from_multiple_files:

COPY table_from_multiple_files FROM 
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv',
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/tournaments.csv'
WITH (HEADER=TRUE);

In the previous example, there are two CSV files in the firebolt_sample_dataset folder: levels.csv and tournaments.csv. These files have a different schema. COPY FROM reads these files into a single table, and infers the schema from the first file. Any column mismatches are filled with NULL values.

Automatic schema discovery

You can use the automatic schema discovery feature in COPY FROM to handle even very large data sources instead of manually defining it. The following apply:

Parquet files - Firebolt automatically reads metadata in Parquet files to create corresponding target tables.
CSV files - Firebolt infers column types based on the data content itself, which can streamline the initial data loading process significantly. Use WITH HEADER=TRUE if your CSV file contains column names in the first line.

Automatic schema discovery operates on a “best effort” basis, and attempts to balance accuracy with practical usability, but it may not always be error-free. The following query reads levels.csv, a sample dataset from the fictional “Ultra Fast Gaming Inc.” company. The example implicitly uses automatic schema creation with AUTO_CREATE=TRUE, which defaults to TRUE, and also triggers automatic table creation:

COPY automatic_schema_table 
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE);

Use the following example code to display the table contents, ordered by the fourth column, followed by the fifth column:

SELECT * FROM automatic_schema_table ORDER BY 4,5;

Use PATTERN to insert data into an existing table

You can use the PATTERN feature, which uses regular expressions, to select several files that match the specified pattern to populate a target table. The following example uses the *.csv pattern to read all files ending in .csv into the pattern_target table:

COPY pattern_target FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'
WITH (TYPE=CSV, HEADER=TRUE, PATTERN='*.csv');

Load metadata into a table

You can load metadata information about your source file into your table so that you can track the source name, timestamp, size, and etag information for each row. You can use the following metadata columns:

$source_file_name - The full path of the source file in an Amazon S3 bucket, without the name of the bucket. For example, if your bucket is: s3://my_bucket/xyz/year=2018/month=01/part-00001.parquet, then $source_file_name is xyz/year=2018/month=01/part-00001.parquet.
$source_file_timestamp - The timestamp in UTC, to the second when the source file was last modified in an Amazon S3 bucket.
$source_file_size - The size in bytes of the source file.
$source_file_etag - The ETag of the file, often used for version control.

The following code creates the levels table, and populates it with information from the LevelID column and the timestamp from the source data:

CREATE TABLE levels ("LevelID" TEXT NOT NULL, date_of_creation TIMESTAMP);
COPY INTO levels("LevelID", date_of_creation $source_file_timestamp)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv' WITH HEADER=TRUE;

The following code example displays the contents of levels:

SELECT * FROM levels;

The first three rows of the sample output follow:

LevelID (TEXT)	date_of_creation (TIMESTAMP)
1	`2023-02-27 10:06:52`
2	`2023-02-27 10:06:52`
3	`2023-02-27 10:06:52`

Column mapping

When loading data into a target table, you can manually map source and target schemas by column name or index position. Column mapping ensures that data from the source file is correctly inserted into the appropriate columns in the target table.

Handling case sensitivity

By default, column mapping in Firebolt is case-insensitive, meaning column names in the COPY FROM statement are matched without quotes and are treated as lowercase. The CASE_SENSITIVE_COLUMN_MAPPING parameter is ignored, and data loads into the table regardless of case differences. If you need to enforce case-sensitive column mapping, set the CASE_SENSITIVE_COLUMN_MAPPING parameter to TRUE. When enabled, column names must match exactly, including case, and COPY FROM will either fail or populate columns with NULL values if there is a case mismatch. You can also use quoted identifiers to preserve case-sensitive behavior.

Mapping by column name

Before mapping schemas and ingesting data, you must first create a target table. You can explicitly map source columns to target table columns by matching column names.
The following code example creates a table and maps columns by name:

CREATE TABLE column_mapping_by_name (
    LevelID INT,
    Name TEXT
);

COPY INTO column_mapping_by_name
(
   LevelID,
   Name
)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER = TRUE);

This approach allows the column order to differ between the source file and target table, as long as the column names in the COPY FROM statement match exactly with the source file.

Mapping by index position

Instead of mapping by column name, you can reference column positions in the source file using $cN notation, where N represents the column index starting from one.
The following code example maps the first column in the source file to the id column in the target table and the fourth column to name:

CREATE TABLE column_mapping_positional (
    id INT,
    name TEXT
);

COPY INTO column_mapping_positional
(
   id $c1,
   name $c4
)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER = TRUE);

This method is useful when column names do not match or when working with files that lack headers.

Mapping source columns and metadata

You can also map both source columns and metadata columns to store additional details about the ingested data. Metadata columns provide useful information such as file creation date, size, and last modified timestamp.
The following code example maps source file columns by name and includes metadata:

CREATE TABLE levels (
  "LevelID"          TEXT NOT NULL,
  "NumberOfLaps"     INT,
  "SceneDetails"     TEXT,
  date_of_creation   TIMESTAMP,
  file_name          TEXT,
  file_last_modified TIMESTAMP,
  file_size          BIGINT
);

COPY INTO levels (
  "LevelID",
  "NumberOfLaps",
  "SceneDetails",
  date_of_creation   $source_file_timestamp,
  file_name          $source_file_name,
  file_last_modified $source_file_timestamp,
  file_size          $source_file_size
)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE);

In the previous example:

Standard columns such as LevelID, NumberofLaps, and SceneDetails are mapped by name.
Metadata columns such as date_of_creation, file_name, file_last_modified, and file_size capture file-level information from the source file. This approach is particularly useful for auditing, tracking data lineage, or managing incremental loads.

Note that if a required column in the mapping is missing from the source file, the load will fail unless ALLOW_COLUMN_MISMATCH is enabled. For cases where column names may change or be missing in the source file, see the following section ALLOW_COLUMN_MISMATCH.

Allow column name mismatch

If you specify a column mapping during data loading, COPY FROM treats the columns listed in the <column_mapping> as required. If no column mapping is specified, the columns in the target table are considered required. To allow the data to continue loading when some required columns are missing from the source file, you can use ALLOW_COLUMN_MISMATCH, which is enabled by default. For example, if you create a table with LevelID2 and Name columns and attempt to load data from the levels.csv dataset, which lacks a LevelID2 column, COPY FROM will populate the Name column as specified and fill the LevelID2 column with NULL values. This allows flexible handling of missing data without raising errors, as shown in the following code example:

CREATE TABLE col_mismatch ("LevelID2" int, "Name" text);
COPY col_mismatch
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE, MAX_ERRORS_PER_FILE='0%');

Load Parquet structs

The SUPPORT_STRUCTS parameter controls how structs in Parquet files are handled during schema inference.

COPY INTO new_table FROM 's3://path/to/file.parquet'
WITH (TYPE=parquet, SUPPORT_STRUCTS=TRUE);

When SUPPORT_STRUCTS=TRUE: Nested structures in the source data will be treated as STRUCT data types.
When SUPPORT_STRUCTS=FALSE: Nested structures in the source data will be treated as individual shredded columns.

SUPPORT_STRUCTS applies only during schema discovery: It controls the type inference process for structs. When working with predefined schemas: It has no effect, as the system maps incoming data to your defined schema.

-- Example with STRUCT column. SUPPORT_STRUCTS is not used because it would have no effect.
CREATE TABLE table_with_struct_type (kv STRUCT(key INT, value TEXT);
COPY INTO table_with_struct_type FROM 's3://path/to/file.parquet';

-- Example with shredded struct columns. Again, SUPPORT_STRUCTS is not used because it would have no effect.
CREATE TABLE table_without_struct_type ("kv.key" INT, "kv.value" TEXT);
COPY INTO table_without_struct_type FROM 's3://path/to/file.parquet';

Note that we do not fully support mixed cases with regular structs and shredded structs. Internally, we check if the header has at least one struct column. If it does, it will prioritize ingest into struct columns and will not attempt to match shredded columns. We recommend using a consistent schema design that follows either the struct-based or shredded column pattern.

Error handling

The following sections show you how to handle errors for both CSV and Parquet files.

Row-based errors in CSV

COPY FROM generates a row-based error when there’s a mismatch between source and target table columns. In the following example, the col_mismatch_csv table includes a LevelID2 column defined as NOT NULL, that does not exist in the source levels.csv table:

CREATE TABLE col_mismatch_csv ("LevelID2" int NOT NULL);

COPY col_mismatch_csv FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE, MAX_ERRORS_PER_FILE='0%');

In the previous code example, when COPY FROM does not see the same column name in the target table col_mismatch_csv, it tries to fill the column with NULL values. Because LevelID2 is defined with a constraint that it cannot have NULL values, the query generates the following error:

ERROR: The INSERT INTO statement failed because it tried to insert a NULL into the column LevelID2, which is NOT NULL. Please specify a value or redefine the column's logical constraints.

and stops loading into the table.

Allow all row-based errors in CSV

The previous code example uses MAX_ERRORS_PER_FILE='0%', which causes the loading job to fail if there is a single error. You can change this behavior to allow errors. The following code example allows all errors, and the load job completes even if no data loads into the target table:

CREATE TABLE table_all_errors ("LevelID2" int NOT NULL);

COPY table_all_errors FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE, MAX_ERRORS_PER_FILE='100%');

Column data type mismatch in CSV

If you try to load data into a column in an existing table that has a different data type than the source data, COPY FROM will attempt to cast the data into the specified data type. If the cast fails, COPY FROM generates an error. To demonstrate this error, the following example intentionally creates a table that defines the LevelID column incorrectly as an integer, instead of as text, and then attempts to copy data into it:

CREATE TABLE col_mismatch_type_csv ("Name" int);

COPY col_mismatch_type_csv("Name" name)
  FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (TYPE=CSV, HEADER=TRUE);

The previous code example generates the following error: Line 1, Column 8: Unable to cast text 'Thunderbolt Circuit' to integer. In the previous code example, the query generates an error because the default value for MAX_ERRORS_PER_FILE is 0. You can set MAX_ERRORS_PER_FILE to 100% to allow all errors, as shown in the following section.

Allow all errors, and write them to file

You can also allow all errors, so that the loading job continues until it has attempted to load all rows in your dataset. Firebolt can write these errors to an Amazon S3 bucket as CSV files. If your specified S3 bucket requires access credentials, you must specify them so that Firebolt can write the files on your behalf. Data rows that load without error are ingested in row order. A loading job that specifies writing error files will write files with the following syntax to your Amazon S3 bucket:

error_reasons.csv - An error file that contains all the reasons that a row generated an error, and also file-based errors.
rejected_rows.csv - An error file that contains all the rejected rows in row order.

Producing an error while reading Parquet files doesn’t generate row-based error files. On error, only a error_reasons.csv file is generated. The previous files will have an order appended to the name such as error_reasons_1.csv. The following code example allows all errors, provides credentials, and writes two error files to an Amazon S3 bucket:

CREATE TABLE table_write_errors(TournamentID INT, Name INT);

COPY table_write_errors(TournamentID TournamentId, Name Name) FROM 's3://firebolt-publishing-public/help_center_assets/tournaments.csv'
WITH (ERROR_FILE_CREDENTIALS = (
    AWS_ACCESS_KEY_ID = 'YOUR_AWS_ACCESS_KEY_ID'
    AWS_SECRET_ACCESS_KEY = 'YOUR_AWS_SECRET_ACCESS_KEY'
),
MAX_ERRORS_PER_FILE='100%', HEADER=TRUE, ERROR_FILE='s3://bucket_name/error_directory/');

To provide your credentials in the previous example, do the following:

Replace the <aws_access_key_id> with an AWS access key ID that is associated with an AWS user or AWS role. The AWS access key ID is a 20-character string such as AKIAIOSFODNN7EXAMPLE.
Replace the <aws_secret_access_key> with an AWS secret access key associated with an AWS user or IAM role associated with the AWS access key ID. The AWS secret access key is a 40-character string such as wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY.

Read errors from file

The previous COPY FROM example shows how to create two error files, one that describes the error reasons and one that contains the rows that had errors. This example shows how to load and view the contents of these files. The following code example reads all files that begin with error_reasons and end with .csv into an error_reasons table:

COPY error_reasons FROM 's3://bucket_name/error_directory/' 
WITH (PATTERN='*error_reasons*.csv', HEADER=TRUE);

The following code returns the contents of the error_reasons table:

SELECT * FROM error_reasons ORDER BY source_line_num LIMIT 1;

The following output shows an example of the contents of the error_reasons table:

file_name (TEXT)	source_line_num (BIGINT)	error_message (TEXT)
help_center_assets/tournaments.csv	1	Error while casting

The following code reads all files that begin with rejected_rows and ends with .csv into a rejected_rows table:

COPY rejected_rows FROM 's3://bucket_name/error_directory/'
WITH (PATTERN='*rejected_rows*.csv', HEADER=FALSE);

Use SELECT to view the contents of a file. The following code returns the contents of the rejected_rows table:

SELECT * FROM rejected_rows ORDER BY f0 LIMIT 1;

The following output shows the contents of the rejected_rows table after running the previous SELECT statement:

f0 (TEXT)	f1 (TEXT)	f2 (TEXT)	f3 (TEXT)	f4 (TEXT)	f5 (TEXT)	f6 (TEXT)
1	The Snow Park Grand Prix	1	20903	`2021-05-28 20:40:54`	`2021-05-29 04:51:43`	…

Column mapping and default values

You can map a specific source column to a target column, and specify a default value to replace any NULL values generated during mapping. The following code example maps the LevelID column from the levels.csv sample dataset, into a column LevelID_team_A in a target_default_mapping table. It also maps a non-existent Country column in the levels.csv dataset, into LevelsID_team_B:

CREATE TABLE target_default_mapping ("LevelID_team_A" text, "LevelID_team_B" text);
COPY target_default_mapping("LevelID_team_A", "LevelID", "LevelID_team_B" default 50 "Country")
  FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE);

In the previous example, all rows under LevelsID_team_B will contain the value 50.

Type mismatch errors

If you read a column from a source file into a table with an incompatible data type, the mapping generates a casting error. A loading job that specifies writing error files will write files starting with the following prefixes to a specified Amazon S3 bucket:

error_reasons - An error file that contains all the reasons that a row generated an error, and also file-based errors.
rejected_rows - An error file that contains all the rejected rows in row order.

The following code uses the Firebolt sample players dataset which has a column PlayerID with a data type of INTEGER, and attempts to read it into an existing column with a DATE data type:

CREATE TABLE IF NOT EXISTS
 players (
   PlayerID DATE,
   Nickname TEXT,
   Email TEXT);

COPY players FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/players/'
WITH (TYPE=PARQUET, MAX_ERRORS_PER_FILE='100%',
ERROR_FILE='s3://bucket_name/parquet_error_directory/');

Use the following sample code to view a table that contains the contents of all error files that contain error_reasons:

COPY error_reasons FROM 's3://bucket_name/parquet_error_directory/'
WITH (PATTERN='*error_reasons*', HEADER=TRUE);

SELECT * FROM error_reasons;

file_name (TEXT)	source_line_num (BIGINT)	error_message (TEXT)
`gaming/parquet/players/11afd184-d2d4-4471-b23c-a14f4c0945a2_1_0_0.snappy.parquet`	0	Can not assignment cast column playerid from type integer null to type date null

The type mismatch error in this example creates a file-based error, which is an error that affects the entire file during processing. This query only produces a file-based error and does not produce the rejected_rows error file.

What is Firebolt?

Overview

Performance and Observability

Guides

SQL reference

General reference

API reference

Syntax

Parameters

Parameters for CSV files

Parameters for Parquet files

Best practice

Settings to control behavior

Examples

COPY FROM with LOCATION

Filter by metadata during loading

Use LOCATION to copy data

Load multiple directories with LOCATION

Load multiple files and directories in parallel

Automatic schema discovery

Use PATTERN to insert data into an existing table

Load metadata into a table

Column mapping

Handling case sensitivity

Mapping by column name

Mapping by index position

Mapping source columns and metadata

Allow column name mismatch

Load Parquet structs

Error handling

Row-based errors in CSV

Allow all row-based errors in CSV

Column data type mismatch in CSV

Allow all errors, and write them to file

Read errors from file

Column mapping and default values

Type mismatch errors

What is Firebolt?

Overview

Performance and Observability

Guides

SQL reference

General reference

API reference

​Syntax

​Parameters

​Parameters for CSV files

​Parameters for Parquet files

​Best practice

​Settings to control behavior

​Examples

​COPY FROM with LOCATION

​Filter by metadata during loading

​Use LOCATION to copy data

​Load multiple directories with LOCATION

​Load multiple files and directories in parallel

​Automatic schema discovery

​Use PATTERN to insert data into an existing table

​Load metadata into a table

​Column mapping

Handling case sensitivity

Mapping by column name

Mapping by index position

Mapping source columns and metadata

​Allow column name mismatch

​Load Parquet structs

​Error handling

​Row-based errors in CSV

Allow all row-based errors in CSV

Column data type mismatch in CSV

​Allow all errors, and write them to file

​Read errors from file

​Column mapping and default values

​Type mismatch errors

Syntax

Parameters

Parameters for CSV files

Parameters for Parquet files

Best practice

Settings to control behavior

Examples

COPY FROM with LOCATION

Filter by metadata during loading

Use LOCATION to copy data

Load multiple directories with LOCATION

Load multiple files and directories in parallel

Automatic schema discovery

Use PATTERN to insert data into an existing table

Load metadata into a table

Column mapping

Allow column name mismatch

Load Parquet structs

Error handling

Row-based errors in CSV

Allow all errors, and write them to file

Read errors from file

Column mapping and default values

Type mismatch errors