> ## Documentation Index
> Fetch the complete documentation index at: https://docs.firebolt.io/llms.txt
> Use this file to discover all available pages before exploring further.

<AgentInstructions>

## Submitting Feedback

If you encounter incorrect, outdated, or confusing documentation on this page, submit feedback:

POST https://docs.firebolt.io/feedback

```json
{
  "path": "/reference-sql/commands/data-management/copy-from",
  "feedback": "Description of the issue"
}
```

Only submit feedback when you have something specific and actionable to report.

</AgentInstructions>

> Reference and syntax for the COPY command that copies data from S3 files into a Firebolt table.

# COPY FROM

Loads data from an AWS S3 bucket into Firebolt. Use `COPY FROM` to load data from external sources, such as from Amazon S3, into Firebolt tables. `COPY FROM` supports various data loading workflows, including schema discovery, metadata filtering, and parallel processing, offering flexibility for different use cases. This page provides an overview of the `COPY FROM` syntax, parameters, and best practices.

## Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
-- Using location object (recommended)
COPY
[INTO] <table_name> [ <column_mapping> ] 
FROM location_name
[ LIMIT <count> ]
[ OFFSET <start> ]
[ [ WITH ] ( <option> [, ...] )  ]
[ WHERE <condition> ]
|
-- Using static credentials
COPY
[INTO] <table_name> [ <column_mapping> ] 
FROM '<url>'
[ LIMIT <count> ]
[ OFFSET <start> ]
[[ WITH ] ( <option> [, ...] ) ]
[ WHERE <condition> ]

<column_mapping>:
    ( <column_name> [DEFAULT <default_value>] [ { $c<source_column_index> | <source_column_name> } ] [, ...] )

<option>:
    [ CREDENTIALS = ( <credentials> ) ] 
    [ PATTERN = <glob_pattern> ]
    [ TYPE = { AUTO | CSV | TSV | PARQUET } ]
    [ AUTO_CREATE = { TRUE | FALSE } ]
    [ ALLOW_COLUMN_MISMATCH = { TRUE | FALSE } ]
    [ ERROR_FILE = <directoryLocation> ]
    [ ERROR_FILE_CREDENTIALS = <credentials> ]
    [ MAX_ERRORS_PER_FILE = { integer | '<percentage>' } ]
    [ CASE_SENSITIVE_COLUMN_MAPPING = { FALSE | TRUE } ]
    [ <csv_options> ]
    [ <parquet_options> ]
    [ <settings> ]

<credentials>: (    -- Only applicable when using URL destination
  AWS_ACCESS_KEY_ID = '<aws_access_key_id>'
  AWS_SECRET_ACCESS_KEY = '<aws_secret_access_key>'
  [ AWS_SESSION_TOKEN = '<aws_session_token>' ]
  |
  AWS_ROLE_ARN = '<aws_role_arn>'
  [ AWS_ROLE_EXTERNAL_ID = '<aws_role_external_id>' ]
)

<csv_options>:
    [ HEADER = { FALSE | TRUE  } ]
    [ DELIMITER = '<character>' ]
    [ NEWLINE = '<string>' ]
    [ QUOTE = { DOUBLE_QUOTE | '<character>' |  SINGLE_QUOTE } ]
    [ ESCAPE = '<character>' ]
    [ NULL_STRING = '<string>' ]
    [ EMPTY_FIELD_AS_NULL = { TRUE | FALSE } ]
    [ SKIP_BLANK_LINES = { FALSE | TRUE } ]
    [ DATE_FORMAT = '<date_format>' ]
    [ TIMESTAMP_FORMAT = '<timestamp_format>' ]

<parquet_options>:
    [ SUPPORT_STRUCTS = { FALSE | TRUE } ]
    [ REPLACE_NON_UTF8_BYTES = { FALSE | TRUE } ]
    [ PARSE_JSON_AS = { 'TEXT' | 'JSON' }]
```

## Parameters

| Parameter                       | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `<table_name>`                  | The name of the target table.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| `<column_mapping>`              | This feature is only available if the target table already exists. You can use `column_mapping` to specify the mapping between the source and target schema. Select a column in the source file to map to the target file using either the name of the column or its index.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| `<column_name>`                 | The name of a target column in a table. If `<source_column_index/name>` is not specified, source columns will be automatically mapped to target columns based on name.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| `<default_value>`               | A replacement value for any `NULL` value generated by mapping the source to the target. This data type of `default_value` must be compatible with the data type of the target column.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| `<source_column_index>`         | The index position of the column in the source data to be mapped. The index starts at `1`. Accessing columns by index is not supported for `TYPE=PARQUET`. If you are specifying multiple source files, `source_column_index` specifies the index for all source files. Source column indexes are expressed by `$c<index>`, e.g., `$c1`, `$c2`, `$c4`. For example, prefix the index as shown in the following `COPY FROM` statement: `CREATE TABLE t(a text, b text); COPY INTO t(a $c1, b $c2) FROM 's3://my_bucket/my_folder/my_file';`.                                                                                                                                                                                                                                                                                    |
| `<source_column_name>`          | The name of the column in the source data to be mapped. If you are specifying multiple source files, `source_column_name` specifies the column name for all source files.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| `location_name`                 | The name of a location object that contains the S3 URL and credentials. This is the recommended approach for specifying the source. See [CREATE LOCATION](/reference-sql/commands/data-definition/create-location) for details.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| `<url>`                         | The path to an S3 URL where the source files are located. For example, `s3://my_bucket/my_folder/`. If the URL ends with a forward slash (`/`), it's interpreted as a folder; otherwise, it's treated as a single file.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| `<directoryLocation>`           | The Amazon S3 path to a directory.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| `CREDENTIALS`                   | The Amazon S3 credentials for accessing the specified `<url>`. For more information, see [CREDENTIALS](/reference-sql/commands/data-definition/create-external-table#credentials).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| `PATTERN`                       | A string that represents a [glob pattern](https://en.wikipedia.org/wiki/Glob_\(programming\)) used to match filenames or other strings.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| `TYPE`                          | The file type that Firebolt should expect when loading files identified by `PATTERN`. If `TYPE` is unspecified, Firebolt automatically detects the file type using the file's suffix. If a file matched by `PATTERN` does not match the specified `TYPE`, Firebolt will generate an error. The default value is `AUTO`.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| `AUTO_CREATE`                   | Specify whether Firebolt should automatically create a table if it doesn't already exist. If `AUTO_CREATE` is set to `FALSE`, Firebolt will generate an error if the target table is missing. If the target table already exists, `AUTO_CREATE` is ignored. The default setting is `TRUE`, allowing automatic table creation.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| `ALLOW_COLUMN_MISMATCH`         | Set to `FALSE` to specify that all required columns must appear in the source file. If `ALLOW_COLUMN_MISMATCH` is set to `FALSE`, all required columns must be present in the source file. Required columns are those listed in `<column_mapping>` by name or index. If no column mapping is provided, the target table's columns are required. Missing required columns will cause row-based errors for CSV or TSV files, and file-based errors for Parquet files. When `ALLOW_COLUMN_MISMATCH` is set to `TRUE`, any missing columns in the source file are filled with `NULL` values.                                                                                                                                                                                                                                       |
| `ERROR_FILE`                    | The Amazon S3 `URL` where error files will be written. No error files are created by default or if the specified path doesn't exist. If `ERROR_FILE` is specified, a subdirectory is created based on the time of load submission in the format: `YearMonthDay-HourMinuteSecond + QueryID`, such as: `20220330-173205`). For CSV files, this directory will contain a `rejected_rows.csv` file containing erroneous data rows, and an `error_reasons.csv` file, containing the reasons that the errors were generated. Because Parquet doesn't produce row-based error files, on error, only an `error_reasons.csv` file is generated.                                                                                                                                                                                         |
| `ERROR_FILE_CREDENTIALS`        | The Amazon S3 credentials required to write an error file to an Amazon S3 URL. For more information, see [CREDENTIALS](/reference-sql/commands/data-definition/create-external-table#credentials).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| `MAX_ERRORS_PER_FILE`           | Specify the maximum number of rows that can be rejected per file. `MAX_ERRORS_PER_FILE` can be an integer or percentage in the format "integer%", such as `100%`. The only valid percentage options are `0%` and `100%`. If you specify an integer value, the `COPY FROM` job will load the job until it encounters the number of errors specified, and then stop the loading job. For example, if you specify `3`, then `COPY FROM` will load data until it encounters `3` errors, and then end the job with an error. If the threshold is exceeded, `COPY FROM` job will stop and return an error. By default, no errors are allowed. If `MAX_ERRORS_PER_FILE` is set to `100%`, then all errors are allowed.                                                                                                                |
| `CASE_SENSITIVE_COLUMN_MAPPING` | Specify if column names should be case-sensitive when mapping from source to target tables. When set to `TRUE`, mapping between source and target columns is case-sensitive, and `COPY FROM` will fail or populate columns with `NULL` values when there is a case mismatch. You can use [QUOTED\_IDENTIFERS](/reference-sql/lexical-structure/object-identifiers#quoted-identifiers) as an alternative when requiring case-sensitive behavior. When set to the default `FALSE` value, and if no target table exists, it is created with lowercase column identifiers if the source file's column identifiers have capital, lowercase, or a combination of the two. When set to the default `FALSE` value, and the table already exists, `CASE_SENSITIVE_COLUMN_MAPPING` is ignored, and data will load from source to target. |
| `<settings>`                    | List of [query-specific settings](/reference-sql/system-settings#setting-via-with) overrides.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |

### Parameters for CSV files

| Parameter             | Description                                                                                                                                                                                                                                                      |
| :-------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `HEADER`              | Specify if the file contains a header line containing the column names. If `HEADER` is `TRUE`, the first line will be interpreted to contain column names. The default value is `FALSE`.                                                                         |
| `DELIMITER`           | Specify the character used to separate fields. The default delimiter is a comma (`,`).                                                                                                                                                                           |
| `NEWLINE`             | Specify the character used to delimit rows. The default newline character is `\n`. If `NEWLINE` is `\n`, then `\r`, `\r\n`, and `\n\r` are also treated as newline characters. Custom characters are allowed only if the target table already exists.            |
| `QUOTE`               | Specify the character used for quoting fields. The default quote character is `DOUBLE_QUOTE`. You can specify either `SINGLE_QUOTE` or `DOUBLE_QUOTE`. You can also specify the single quote literal character (`'`) or the double quote literal character(`"`). |
| `ESCAPE`              | Specify the character used to escape special characters. The default escape character is the quote (`'`) character.                                                                                                                                              |
| `NULL_STRING`         | Specify the string used to represent `NULL` values. The default null string is an empty string, which means that no specific null string is defined.                                                                                                             |
| `EMPTY_FIELD_AS_NULL` | Specify whether empty fields should be interpreted as `NULL` values. The default value is `TRUE`.                                                                                                                                                                |
| `SKIP_BLANK_LINES`    | Specify whether to ignore blank lines. If `SKIP_BLANK_LINES` is `TRUE`, then `COPY FROM` will ignore blank lines. The default value is `FALSE`.                                                                                                                  |
| `DATE_FORMAT`         | Specify the date format for parsing text into date columns. This format will apply to all columns loaded as date columns. For supported formats, see [TO\_DATE](/reference-sql/functions-reference/date-and-time/to-date).                                       |
| `TIMESTAMP_FORMAT`    | Specify the timestamp format for parsing text into timestamp columns. The format will apply to all columns loaded as timestamp columns. For supported formats, see [TO\_TIMESTAMP](/reference-sql/functions-reference/date-and-time/to-timestamp).               |

### Parameters for Parquet files

| Parameter                | Description                                                                                                                                                                                                                                                                                                                                                                                                            |
| :----------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `SUPPORT_STRUCTS`        | A boolean which specifies whether structs in the source data are to be inferred as STRUCT data types. The default value is `TRUE` for new accounts (previously `FALSE`), and existing accounts are being migrated based on usage. This parameter is expected to be deprecated in the future. For more information, see [Load Parquet structs](/reference-sql/commands/data-management/copy-from#load-parquet-structs). |
| `REPLACE_NON_UTF8_BYTES` | Whether to replace non-UTF8 bytes in string columns with the Unicode replacement character. Since Firebolt's [TEXT data type](/reference-sql/data-types#text) requires all values to be valid UTF8, non-UTF8 bytes in Parquet string columns are replaced with the Unicode replacement character (�, U+FFFD) when using this option. Otherwise, an error is raised. Default: false.                                    |
| `PARSE_JSON_AS`          | Specifies the Firebolt data type which should be inferred for source columns of the native Parquet JSON type. Can be either `'TEXT'` or `'JSON'`, with the default being `'JSON'`.                                                                                                                                                                                                                                     |

## Best practice

Firebolt recommends using a `LOCATION` object to specify credentials.

Location objects provide a secure, centralized way to manage Amazon S3 credentials and URLs. `LOCATION` objects let you specify credentials with the following:

* Centralized credential management.
* Reduced exposure of credentials in queries.
* Role-based access control.
* Simplified maintenance and updates.

For a comprehensive guide to LOCATION objects, see [LOCATION objects](/guides/security/location).

For detailed information, see [CREATE LOCATION](/reference-sql/commands/data-definition/create-location).

### Settings to control behavior

* `insert_sharding` to [enforce partition locality](/reference-sql/system-settings#insert-sharding) during ingestion into partitioned tables.
* `tablet_min_size_bytes` and `tablet_max_size_bytes` to [control](/reference-sql/system-settings#target-tablet-size) min/max tablet sizes during ingestion.
* `cross_region_request_mode` to [access data](/reference-sql/system-settings#access-cross-region-data) in cross-region Amazon S3 buckets.

## Examples

### COPY FROM with LOCATION

The following code example copies data from files matching the \*.parquet pattern in the specified my\_location into `my_table`:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY my_table 
FROM my_location
WITH (PATTERN = '*.parquet');
```

### Filter by metadata during loading

When loading data into tables, you can filter data using the following options:

1. `LIMIT`: Restricts the number of rows loaded, which can be useful to preview or create sample datasets.

2. `OFFSET`: Skips a specified number of rows in the final result set before ingestion. The `OFFSET` clause in `COPY FROM` behaves the same way as the `OFFSET` clause in `SELECT` queries behaves.

   * `COPY FROM` currently does not support the `ORDER BY` clause. Thus, using `OFFSET` may result in different outcomes every time you run the command.

   * Both `LIMIT` and `OFFSET` apply to the entire result set, not to individual files.

3. `WHERE`: Filters data based on source file metadata, as follows:

   * `$source_file_name` - The full path of the source file in an Amazon S3 bucket, without the name of the bucket. For example, if your bucket is: `s3://my_bucket/xyz/year=2018/month=01/part-00001.parquet`, then `$source_file_name` is `xyz/year=2018/month=01/part-00001.parquet`.
   * `$source_file_timestamp` - The timestamp in UTC, to the second when the source file was last modified in an Amazon S3 bucket.
   * `$source_file_size` - The size in bytes of the source file.
   * `$source_file_etag` - The [ETag](https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html#API_Object_Contents) of the file, often used for version control.

In the following code example, `COPY FROM` first reads all the files in the specified directory that were modified in the last three years. Then, it applies the offset and limit clause. As long as all source files modified in the last three years have at least 100 rows combined, the result set will have exactly 50 rows.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY tournament_results
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/rankings/TournamentID=1/'
LIMIT 50 OFFSET 50
WHERE $source_file_timestamp > NOW() - interval '3 YEARS';
```

The previous code example returns a table containing 50 rows of data that was modified in the last three years.

**Recommended best practice**

The following code example builds on the previous example by using a `LOCATION` to store credentials to authenticate:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY tournament_results
FROM my_tournament_location
LIMIT 50 OFFSET 50
WHERE $source_file_timestamp > NOW() - interval '3 YEARS';
```

### Use LOCATION to copy data

The following code example creates a table and uses a `LOCATION` object to copy Parquet files that match the specified pattern into it:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
-- Create a table
CREATE TABLE levels (
    LevelID INT,
    Name TEXT
);

-- Copy data using location object
COPY INTO levels
FROM my_game_data_location
WITH (
    PATTERN = 'levels/*.parquet',
    TYPE = PARQUET
);
```

### Load multiple directories with LOCATION

The following code example copies data from Parquet files matching a pattern in `my_tournaments_location` to the `tournament_results` table:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY tournament_results 
FROM my_tournaments_location
WITH 
(
    PATTERN = 'tournaments/*/data/*.parquet',
    TYPE = PARQUET
);
```

### Load multiple files and directories in parallel

You can use `COPY FROM` to read multiple sources and from multiple directories into a single table, simultaneously. The following code example reads any file ending in `.parquet` from multiple directories into `table_from_multiple_directories`.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY table_from_multiple_directories FROM 
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/TournamentID=1/',
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/TournamentID=10/',
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/TournamentID=100/'
WITH (pattern='*.parquet');
```

The following code example reads two CSV files into `table_from_multiple_files`:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY table_from_multiple_files FROM 
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv',
    's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/tournaments.csv'
WITH (HEADER=TRUE);
```

In the previous example, there are two CSV files in the `firebolt_sample_dataset` folder: `levels.csv` and `tournaments.csv`. These files have a different schema. `COPY FROM` reads these files into a single table, and infers the schema from the first file. Any column mismatches are filled with `NULL` values.

### Automatic schema discovery

You can use the automatic schema discovery feature in `COPY FROM` to handle even very large data sources instead of manually defining it. The following apply:

* **Parquet files** - Firebolt automatically reads metadata in Parquet files to create corresponding target tables.
* **CSV files** - Firebolt infers column types based on the data content itself, which can streamline the initial data loading process significantly. Use `WITH HEADER=TRUE` if your CSV file contains column names in the first line.

<Note>
  **When loading multiple files, Firebolt infers the schema from the most recently modified file.** The remaining files must have compatible data types. If types vary between files (e.g., a column contains integers in one file but doubles in another, or is numeric in one file but text in another), the inferred schema may not match all files and thus cause data type errors or query failures. In such cases, we recommend defining an explicit schema using either [external tables](/reference-sql/commands/data-definition/create-external-table) or [`COPY FROM`](/reference-sql/commands/data-management/copy-from) into existing tables.
</Note>

Automatic schema discovery operates on a "best effort" basis, and attempts to balance accuracy with practical usability, but it may not always be error-free.

The following query reads `levels.csv`, a sample dataset from the fictional “Ultra Fast Gaming Inc.” company. The example implicitly uses automatic schema creation with `AUTO_CREATE=TRUE`, which defaults to `TRUE`, and also triggers automatic table creation:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY automatic_schema_table 
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE);
```

Use the following example code to display the table contents, ordered by the fourth column, followed by the fifth column:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT * FROM automatic_schema_table ORDER BY 4,5;
```

### Use PATTERN to insert data into an existing table

You can use the `PATTERN` feature, which uses [regular expressions](https://en.wikipedia.org/wiki/Glob_\(programming\)), to select several files that match the specified pattern to populate a target table. The following example uses the `*.csv` pattern to read all files ending in `.csv` into the `pattern_target` table:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY pattern_target FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'
WITH (TYPE=CSV, HEADER=TRUE, PATTERN='*.csv');
```

In the previous example, there are two CSV files in the `firebolt_sample_dataset` folder: `levels.csv` and `tournaments.csv`. These files have a different schema. `COPY FROM` reads these files into a single table, and infers the schema from the first file. Any column mismatches are filled with `NULL` values.

### Load metadata into a table

You can load metadata information about your source file into your table so that you can track the source name, timestamp, size, and etag information for each row. You can use the following metadata columns:

* `$source_file_name` - The full path of the source file in an Amazon S3 bucket, without the name of the bucket. For example, if your bucket is: `s3://my_bucket/xyz/year=2018/month=01/part-00001.parquet`, then `$source_file_name` is `xyz/year=2018/month=01/part-00001.parquet`.
* `$source_file_timestamp` - The timestamp in UTC, to the second when the source file was last modified in an Amazon S3 bucket.
* `$source_file_size` - The size in bytes of the source file.
* `$source_file_etag` - The [ETag](https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html#API_Object_Contents) of the file, often used for version control.

The following code creates the `levels` table, and populates it with information from the `LevelID` column and the timestamp from the source data:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE levels ("LevelID" TEXT NOT NULL, date_of_creation TIMESTAMP);
COPY INTO levels("LevelID", date_of_creation $source_file_timestamp)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv' WITH HEADER=TRUE;
```

The following code example displays the contents of `levels`:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT * FROM levels;
```

The first three rows of the sample output follow:

| LevelID (TEXT) | date\_of\_creation (TIMESTAMP) |
| -------------- | ------------------------------ |
| 1              | `2023-02-27 10:06:52`          |
| 2              | `2023-02-27 10:06:52`          |
| 3              | `2023-02-27 10:06:52`          |

### Column mapping

When loading data into a target table, you can manually map source and target schemas by [column name](#mapping-by-column-name) or [index position](#mapping-by-index-position). Column mapping ensures that data from the source file is correctly inserted into the appropriate columns in the target table.

##### Handling case sensitivity

By default, column mapping in Firebolt is **case-insensitive**, meaning column names in the `COPY FROM` statement are matched without quotes and are treated as lowercase. The [`CASE_SENSITIVE_COLUMN_MAPPING`](#parameters) parameter is ignored, and data loads into the table regardless of case differences. If you need to enforce case-sensitive column mapping, set the `CASE_SENSITIVE_COLUMN_MAPPING` parameter to `TRUE`. When enabled, column names must match exactly, including case, and `COPY FROM` will either fail or populate columns with `NULL` values if there is a case mismatch. You can also use quoted identifiers to preserve case-sensitive behavior.

##### Mapping by column name

Before mapping schemas and ingesting data, you must first create a target table. You can explicitly map source columns to target table columns by matching column names.\
The following code example creates a table and maps columns by name:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE column_mapping_by_name (
    LevelID INT,
    Name TEXT
);

COPY INTO column_mapping_by_name
(
   LevelID,
   Name
)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER = TRUE);
```

This approach allows the column order to differ between the source file and target table, as long as the column names in the `COPY FROM` statement match exactly with the source file.

##### Mapping by index position

Instead of mapping by column name, you can reference column positions in the source file using `$cN` notation, where `N` represents the column index starting from one.\
The following code example maps the first column in the source file to the `id` column in the target table and the fourth column to `name`:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE column_mapping_positional (
    id INT,
    name TEXT
);

COPY INTO column_mapping_positional
(
   id $c1,
   name $c4
)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER = TRUE);
```

This method is useful when column names do not match or when working with files that lack headers.

##### Mapping source columns and metadata

You can also map both source columns and metadata columns to store additional details about the ingested data. Metadata columns provide useful information such as file creation date, size, and last modified timestamp.\
The following code example maps source file columns by name and includes metadata:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE levels (
  "LevelID"          TEXT NOT NULL,
  "NumberOfLaps"     INT,
  "SceneDetails"     TEXT,
  date_of_creation   TIMESTAMP,
  file_name          TEXT,
  file_last_modified TIMESTAMP,
  file_size          BIGINT
);

COPY INTO levels (
  "LevelID",
  "NumberOfLaps",
  "SceneDetails",
  date_of_creation   $source_file_timestamp,
  file_name          $source_file_name,
  file_last_modified $source_file_timestamp,
  file_size          $source_file_size
)
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE);
```

In the previous example:

* Standard columns such as `LevelID`, `NumberofLaps`, and `SceneDetails` are mapped by name.
* Metadata columns such as `date_of_creation`, `file_name`, `file_last_modified`, and `file_size` capture file-level information from the source file.
  This approach is particularly useful for auditing, tracking data lineage, or managing incremental loads.

Note that if a required column in the mapping is missing from the source file, the load will fail unless `ALLOW_COLUMN_MISMATCH` is enabled. For cases where column names may change or be missing in the source file, see the following section [`ALLOW_COLUMN_MISMATCH`](#allow-column-name-mismatch).

### Allow column name mismatch

If you specify a column mapping during data loading, `COPY FROM` treats the columns listed in the `<column_mapping>` as required. If no column mapping is specified, the columns in the target table are considered required. To allow the data to continue loading when some required columns are missing from the source file, you can use `ALLOW_COLUMN_MISMATCH`, which is enabled by default.

For example, if you create a table with `LevelID2` and `Name` columns and attempt to load data from the `levels.csv` dataset, which lacks a `LevelID2` column, `COPY FROM` will populate the `Name` column as specified and fill the `LevelID2` column with `NULL` values. This allows flexible handling of missing data without raising errors, as shown in the following code example:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE col_mismatch ("LevelID2" int, "Name" text);
COPY col_mismatch
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE, MAX_ERRORS_PER_FILE='0%');
```

### Load Parquet structs

The `SUPPORT_STRUCTS` parameter controls how structs in Parquet files are handled during schema inference.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY INTO new_table FROM 's3://path/to/file.parquet'
WITH (TYPE=parquet, SUPPORT_STRUCTS=TRUE);
```

* When `SUPPORT_STRUCTS=TRUE`: Nested structures in the source data will be treated as `STRUCT` data types.
* When `SUPPORT_STRUCTS=FALSE`: Nested structures in the source data will be treated as individual shredded columns.

`SUPPORT_STRUCTS` applies only **during schema discovery**: It controls the type inference process for structs.
When working with **predefined schemas**: It has no effect, as the system maps incoming data to your defined schema.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
-- Example with STRUCT column. SUPPORT_STRUCTS is not used because it would have no effect.
CREATE TABLE table_with_struct_type (kv STRUCT(key INT, value TEXT);
COPY INTO table_with_struct_type FROM 's3://path/to/file.parquet';

-- Example with shredded struct columns. Again, SUPPORT_STRUCTS is not used because it would have no effect.
CREATE TABLE table_without_struct_type ("kv.key" INT, "kv.value" TEXT);
COPY INTO table_without_struct_type FROM 's3://path/to/file.parquet';
```

Note that we do not fully support mixed cases with regular structs and shredded structs. Internally, we check if the header has at least one struct column. If it does, it will prioritize ingest into struct columns and will not attempt to match shredded columns. We recommend using a consistent schema design that follows either the struct-based or shredded column pattern.

### Error handling

The following sections show you how to handle errors for both CSV and Parquet files.

#### Row-based errors in CSV

`COPY FROM` generates a row-based error when there's a mismatch between source and target table columns. In the following example, the `col_mismatch_csv` table includes a `LevelID2` column defined as `NOT NULL`, that does not exist in the source `levels.csv` table:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE col_mismatch_csv ("LevelID2" int NOT NULL);

COPY col_mismatch_csv FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE, MAX_ERRORS_PER_FILE='0%');
```

In the previous code example, when `COPY FROM` does not see the same column name in the target table `col_mismatch_csv`, it tries to fill the column with `NULL` values. Because `LevelID2` is defined with a constraint that it cannot have `NULL` values, the query generates the following error:
`ERROR: The INSERT INTO statement failed because it tried to insert a NULL into the column LevelID2, which is NOT NULL. Please specify a value or redefine the column's logical constraints.` and stops loading into the table.

##### Allow all row-based errors in CSV

The previous code example uses `MAX_ERRORS_PER_FILE='0%'`, which causes the loading job to fail if there is a single error. You can change this behavior to allow errors. The following code example allows all errors, and the load job completes even if no data loads into the target table:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE table_all_errors ("LevelID2" int NOT NULL);

COPY table_all_errors FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE, MAX_ERRORS_PER_FILE='100%');
```

##### Column data type mismatch in CSV

If you try to load data into a column in an existing table that has a different data type than the source data, `COPY FROM` will attempt to cast the data into the specified data type. If the cast fails, `COPY FROM` generates an error. To demonstrate this error, the following example intentionally creates a table that defines the `LevelID` column incorrectly as an integer, instead of as text, and then attempts to copy data into it:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE col_mismatch_type_csv ("Name" int);

COPY col_mismatch_type_csv("Name" name)
  FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (TYPE=CSV, HEADER=TRUE);
```

The previous code example generates the following error:
`Line 1, Column 8: Unable to cast text 'Thunderbolt Circuit' to integer`.

In the previous code example, the query generates an error because the default value for `MAX_ERRORS_PER_FILE` is `0`. You can set `MAX_ERRORS_PER_FILE` to `100%` to allow all errors, as shown in the following section.

#### Allow all errors, and write them to file

You can also allow all errors, so that the loading job continues until it has attempted to load all rows in your dataset. Firebolt can write these errors to an Amazon S3 bucket as CSV files.  If your specified S3 bucket requires access credentials, you must specify them so that Firebolt can write the files on your behalf. Data rows that load without error are ingested in row order. A loading job that specifies writing error files will write files with the following syntax to your Amazon S3 bucket:

* `error_reasons.csv` - An error file that contains all the reasons that a row generated an error, and also file-based errors.
* `rejected_rows.csv` - An error file that contains all the rejected rows in row order.

Producing an error while reading Parquet files doesn't generate row-based error files. On error, only a `error_reasons.csv` file is generated.

The previous files will have an order appended to the name such as `error_reasons_1.csv`.

The following code example allows all errors, provides credentials, and writes two error files to an Amazon S3 bucket:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE table_write_errors(TournamentID INT, Name INT);

COPY table_write_errors(TournamentID TournamentId, Name Name) FROM 's3://firebolt-publishing-public/help_center_assets/tournaments.csv'
WITH (ERROR_FILE_CREDENTIALS = (
    AWS_ACCESS_KEY_ID = 'YOUR_AWS_ACCESS_KEY_ID'
    AWS_SECRET_ACCESS_KEY = 'YOUR_AWS_SECRET_ACCESS_KEY'
),
MAX_ERRORS_PER_FILE='100%', HEADER=TRUE, ERROR_FILE='s3://bucket_name/error_directory/');
```

To provide your credentials in the previous example, do the following:

* Replace the `<aws_access_key_id>` with an AWS access key ID that is associated with an AWS user or AWS role. The AWS access key ID is a 20-character string such as `AKIAIOSFODNN7EXAMPLE`.
* Replace the `<aws_secret_access_key>` with an AWS secret access key associated with an AWS user or IAM role associated with the AWS access key ID. The AWS secret access key is a 40-character string such as `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`.

#### Read errors from file

The previous `COPY FROM` example shows how to create two error files, one that describes the error reasons and one that contains the rows that had errors. This example shows how to load and view the contents of these files.

The following code example reads all files that begin with `error_reasons` and end with `.csv` into an `error_reasons` table:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY error_reasons FROM 's3://bucket_name/error_directory/' 
WITH (PATTERN='*error_reasons*.csv', HEADER=TRUE);
```

The following code returns the contents of the `error_reasons` table:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT * FROM error_reasons ORDER BY source_line_num LIMIT 1;
```

The following output shows an example of the contents of the `error_reasons` table:

| file\_name (TEXT)                    | source\_line\_num (BIGINT) | error\_message (TEXT) |
| ------------------------------------ | -------------------------- | --------------------- |
| help\_center\_assets/tournaments.csv | 1                          | Error while casting   |

The following code reads all files that begin with `rejected_rows` and ends with `.csv` into a rejected\_rows table:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY rejected_rows FROM 's3://bucket_name/error_directory/'
WITH (PATTERN='*rejected_rows*.csv', HEADER=FALSE);
```

Use `SELECT` to view the contents of a file.
The following code returns the contents of the `rejected_rows` table:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT * FROM rejected_rows ORDER BY f0 LIMIT 1;
```

The following output shows the contents of the `rejected_rows` table after running the previous `SELECT` statement:

| f0 (TEXT) | f1 (TEXT)                | f2 (TEXT) | f3 (TEXT) | f4 (TEXT)             | f5 (TEXT)             | f6 (TEXT) |
| --------- | ------------------------ | --------- | --------- | --------------------- | --------------------- | --------- |
| 1         | The Snow Park Grand Prix | 1         | 20903     | `2021-05-28 20:40:54` | `2021-05-29 04:51:43` | ...       |

#### Column mapping and default values

You can map a specific source column to a target column, and specify a default value to replace any `NULL` values generated during mapping.

The following code example maps the `LevelID` column from the `levels.csv` sample dataset, into a column `LevelID_team_A` in a `target_default_mapping` table. It also maps a non-existent `Country` column in the `levels.csv` dataset, into `LevelsID_team_B`:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE target_default_mapping ("LevelID_team_A" text, "LevelID_team_B" text);
COPY target_default_mapping("LevelID_team_A", "LevelID", "LevelID_team_B" default 50 "Country")
  FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH (HEADER=TRUE);
```

In the previous example, all rows under `LevelsID_team_B` will contain the value `50`.

#### Type mismatch errors

If you read a column from a source file into a table with an incompatible data type, the mapping generates a casting error. A loading job that specifies writing error files will write files starting with the following prefixes to a specified Amazon S3 bucket:

* `error_reasons` - An error file that contains all the reasons that a row generated an error, and also file-based errors.
* `rejected_rows` - An error file that contains all the rejected rows in row order.

The following code uses the Firebolt sample `players` dataset which has a column `PlayerID` with a data type of `INTEGER`, and attempts to read it into an existing column with a `DATE` data type:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE IF NOT EXISTS
 players (
   PlayerID DATE,
   Nickname TEXT,
   Email TEXT);

COPY players FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/players/'
WITH (TYPE=PARQUET, MAX_ERRORS_PER_FILE='100%',
ERROR_FILE='s3://bucket_name/parquet_error_directory/');
```

Use the following sample code to view a table that contains the contents of all error files that contain `error_reasons`:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY error_reasons FROM 's3://bucket_name/parquet_error_directory/'
WITH (PATTERN='*error_reasons*', HEADER=TRUE);

SELECT * FROM error_reasons;
```

| file\_name (TEXT)                                                                  | source\_line\_num (BIGINT) | error\_message (TEXT)                                                            |
| ---------------------------------------------------------------------------------- | -------------------------- | -------------------------------------------------------------------------------- |
| `gaming/parquet/players/11afd184-d2d4-4471-b23c-a14f4c0945a2_1_0_0.snappy.parquet` | 0                          | Can not assignment cast column playerid from type integer null to type date null |

The type mismatch error in this example creates a file-based error, which is an error that affects the entire file during processing. This query only produces a file-based error and does not produce the `rejected_rows` error file.
