CREATE EXTERNAL TABLE
Creates an external table. External tables serve as connectors to your external data sources. External tables contain no data within Firebolt other than metadata virtual columns that are automatically populated with metadata. For more information, see Working with external tables. Data that you ingest must be in an Amazon S3 bucket in the same AWS Region as the Firebolt database.
Syntax
CREATE EXTERNAL TABLE [IF NOT EXISTS] <table>
(
<column_name> <column_type>[ PARTITION('<regex>')]
[, <column_name2> <column_type2> [PARTITION('<regex>')]]
[,...<column_name2> <column_type2> [PARTITION('<regex>')]]
)
[CREDENTIALS = (<awsCredentials>)]
URL = 's3://<bucket_name>[/<folder>][/...]/'
OBJECT_PATTERN = '<object_pattern>'[, '<object_pattern>'[, ...n]]
TYPE = (<type>)
[ <type option> ]
[COMPRESSION = <compression_type>]
Parameters
Parameter | Description |
---|---|
<table> | An identifier that specifies the name of the external table. This name should be unique within the database. For identifier usage and syntax, see Object identifiers. |
<column_name> | An identifier that specifies the name of the column. This name should be unique within the table. Note: If column names are using mixed case, wrap your column name definitions in double quotes ( " ); otherwise they will be translated to lower case and will not match the mixed case Parquet schema. |
<column_type> | Specifies the data type for the column. |
PARTITION | An optional keyword. When specified, allows you to use a regular expression <regex> to extract a value from the file prefix to be stored as the column value. For more information, see PARTITION. |
CREDENTIALS | Specifies the AWS credentials with permission to access the S3 location specified using URL . For more information, see CREDENTIALS. |
URL and OBJECT_PATTERN | Specifies the S3 location and the file naming pattern that Firebolt ingests when using this table. For more information, see URL & OBJECT_PATTERN. |
TYPE | Specifies the file type Firebolt expects to ingest given the OBJECT_PATTERN . If a file referenced using OBJECT_PATTERN does not conform to the specified TYPE , an error occurs. For more information, see TYPE. |
<type option> | Allows configuration for ingesting different CSV file formats. Type option can be set at this top level, or as an option in the TYPE parameter. |
COMPRESSION | See COMPRESSION. |
PARTITION
In some applications, such as Hive partitioning, files use a folder naming convention to identify which partition their data belongs to. The PARTITION
keyword allows you to specify a regular expression, <regex>
, to extract a portion of the file path and store it in the specified column when Firebolt uses the external table to ingest partitioned data.
Using PARTITION
in this way is one method of extracting partition data from file paths. Another method is to use the table metadata column, $source_file_name
, during the INSERT
operation. For more information, see Example–extracting partition values using INSERT.
Guidelines for creating the regex
- The regular expression is matched against the object prefix, not including the
s3://<bucket_name>/
portion of the prefix. - You can specify as many
PARTITION
columns as you need, each extracting a different portion of the object prefix. - For each
PARTITION
column, you must specify a re2 regular expression that contains a capturing group, which determines the column value. - When
<column_type>
isDATE
, Firebolt requires up to three capturing groups that must be in the order of year, month, and day. - When
<column_type>
isTIMESTAMP
, Firebolt requires up to six capturing groups that must be in the order of year, month, day, hour, minute, second. - Firebolt tries to convert the captured string to the specified
<column_type>
. If the type conversion fails, the ingest will error out.
In most cases, the easiest way to build a regular expression is as follows:
- Count the number of folders in the path, not including the bucket name.
- Concatenate the string
[^/]+/
according to the number of folders. - Prefix the regex with an additional
[^/]+
for the file name. - Wrap the
[^/]+
in the right folder with a capturing group parenthesis, such as([^/]+).
For more information, see Match groups on the RegexOne website. To test your regular expressions, online tools such as regex101 are available.
Example–extract Hive-compatible partitions
The example below demonstrates a CREATE EXTERNAL TABLE
statement that creates the table my_ext_table
. This table is used to ingest all files with a *.parquet
file extension in any sub-folder of the S3 bucket s3://my_bucket
.
Consider an example where folders and files in the bucket have the following consistent pattern, which is common for Hive partitions:
s3://my_bucket/c_type=xyz/year=2018/month=01/part-00001.parquet
s3://my_bucket/c_type=xyz/year=2018/month=01/part-00002.parquet
s3://my_bucket/c_type=abc/year=2018/month=01/part-00001.parquet
s3://my_bucket/c_type=abc/year=2018/month=01/part-00002.parquet
[...]
In the example CREATE EXTERNAL TABLE
statement below, the PARTITION
keyword in the column definition for c_type
specifies a regular expression. This expression extracts the portion of the S3 path name that correspond to the xyz
or abc
within c_type=xyz
or c_type=abc
.
CREATE EXTERNAL TABLE my_ext_table (
c_id INTEGER,
c_name TEXT,
c_type TEXT PARTITION('[^/]+/c_type=([^/]+)/[^/]+/[^/]+')
)
CREDENTIALS = (AWS_KEY_ID = 'AKIAIOSFODNN7EXAMPLE' AWS_SECRET_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY')
URL = 's3://my_bucket/'
OBJECT_PATTERN= '*.parquet'
TYPE = (PARQUET)
When Firebolt ingests the data from a Parquet file stored in that path, the c_type
column for each row contains the extracted portion of the path. For the files listed above, the extraction results in the following values. c_id
and c_name
are values stored within the respective Parquet files, while c_type
are values extracted from the file path.
c_id | c_name | c_type |
---|---|---|
1ef4302294 | Njimba | xyz |
8b98470659 | Yuang | xyz |
98734hkk89 | Cole | xyz |
38cjodjlo8 | Blanda | xyz |
448dfgkl12 | Harris | abc |
j987rr3233 | Espinoza | abc |
CREDENTIALS
The credentials for accessing your data on AWS S3 using access key & secret.
Syntax–authenticating using an access key and secret
CREDENTIALS = (AWS_KEY_ID = '<ID>' AWS_SECRET_KEY = '<secret>' )
Parameters
Parameter | Description | Data type |
---|---|---|
<ID> | The AWS access key ID for the authorized app (Firebolt) | TEXT |
<secret> | The AWS secret access key for the app (Firebolt) | TEXT |
In case you don’t have the access key and secret to access your S3 bucket, read more here on how to obtain them.
URL and OBJECT_PATTERN
An external table enables reading some (or all) files from an S3 bucket that you have read access to. Note that the S3 bucket that you reference must be in the same AWS Region as the Firebolt database.
TheURL
andOBJECT_PATTERN
parameters identify which files represent the data for the external table.URL
must be a listable directory: it will be either an entire bucket or some subfolder. OBJECT_PATTERN
is a glob that selects files within theURL
.
Syntax
URL = 's3://<bucket>[/<folder>][/...]/'
OBJECT_PATTERN = '<object_pattern>'
Parameters
Parameters | Description | Data type |
---|---|---|
<url> | This is the URL of the specific bucket and path within the bucket where the relevant files are located (common path prefix). | TEXT |
<object_pattern> | Specify the data pattern to be found in your data source. For example, *.parquet indicates that all parquet files should be found. | TEXT |
The following wildcards are supported:
'*'
matches any sequence of characters'?'
matches any single character[SET]
matches any single character in the specified set[!SET]
matches any character, not in the specified set.
Example
In the following layout of objects in a bucket, the data is partitioned according to client type, year, and month, with multiple parquet files in each partition. The examples demonstrate how choosing both URL and OBJECT_PATTERN impacts the objects that are retrieved from S3.
s3://bucket/c_type=xyz/year=2018/month=01/part-00001.parquet
s3://bucket/c_type=xyz/year=2018/month=01/part-00002.parquet
...
s3://bucket/c_type=xyz/year=2018/month=12/part-00001.parquet
s3://bucket/c_type=xyz/year=2018/month=12/part-00002.parquet
...
s3://bucket/c_type=xyz/year=2019/month=01/part-00001.parquet
s3://bucket/c_type=xyz/year=2019/month=01/part-00002.parquet
...
s3://bucket/c_type=xyz/year=2020/month=01/part-00001.parquet
s3://bucket/c_type=xyz/year=2020/month=01/part-00002.parquet
...
s3://bucket/c_type=abc/year=2018/month=01/part-00001.parquet
s3://bucket/c_type=abc/year=2018/month=01/part-00002.parquet
...
Following are some common use cases for URL and object pattern combinations:
Use cases | Syntax |
---|---|
Get all files for file type xyz | URL = ‘s3://bucket/c_type=xyz/’ OBJECT_PATTERN = ‘*’ URL = ‘s3://bucket/’ OBJECT_PATTERN = ‘c_type=xyz/*’ |
Get one specific file: c_type=xyz/year=2018/month=01/part-00001.parquet | *URL = ‘s3://bucket/c_type=xyz/year=2018/month=01/’ OBJECT_PATTERN = ‘part-00001.parquet’ |
Get all parquet files for type xyz | URL = ‘s3://bucket/c_type=xyz/’ OBJECT_PATTERN = ‘*.parquet’ |
TYPE
Specifies the type of the files in S3. The following types and type options are supported.
CSV Types
TYPE = (CSV [ <type option> ])
or
TYPE = (CSV)
[ <type option> ]
The following type options allow configuration for ingesting different CSV file formats.
-
[ALLOW_DOUBLE_QUOTES = {TRUE|FALSE}]
[ALLOW_SINGLE_QUOTES = {TRUE|FALSE}]
WithALLOW_DOUBLE_QUOTES = TRUE
orALLOW_SINGLE_QUOTES = TRUE
you define that unescaped double or single quotes in CSV input file will not cause an error to be generated on ingest. By defaultALLOW_DOUBLE_QUOTES
andALLOW_SINGLE_QUOTES
are set toTRUE
. -
[ALLOW_COLUMN_MISMATCH = {TRUE|FALSE}]
WithALLOW_COLUMN_MISMATCH = TRUE
the number of delimited columns in a CSV input file can be fewer than the number of columns in the corresponding table. By default,ALLOW_COLUMN_MISMATCH
is set toFALSE
, and an error is generated if the number of columns is fewer than the number of columns defined in the external table. If set toTRUE
, and an input file record contains fewer columns than defined in the external table, the non-matching columns in the table are loaded withNULL
values. -
[ALLOW_UNKNOWN_FIELDS = {TRUE|FALSE}]
WithALLOW_UNKNOWN_FIELDS = TRUE
the number of delimited columns in a CSV input file can be more than the number of columns in the corresponding table. By default,ALLOW_UNKNOWN_FIELDS
is set toFALSE
, and an error is generated if the number of columns is more than the number of columns defined in the external table. If set toTRUE
, and an input file record contains more columns than defined in the external table, the non-matching columns in the table are ignored. -
[ESCAPE_CHARACTER = {‘<character>’|NONE}
WithESCAPE_CHARACTER = '<character>'
you can define which character is used to escape, to change interpretations from the original. By default, theESCAPE_CHARACTER
value is set to\
. If, for example, you want to use"
as a value and not as delimiter for string, you can escape like\"
, with the default escape character. -
[FIELD_DELIMITER = '<field_delimeter>']
WithFIELD_DELIMITER = '<field_delimeter>'
, you can define a custom field delimiter to separate fields for ingest. By default, theFIELD_DELIMITER
is set as,
. -
[NEW_LINE_CHARACTER = '<new_line_character>']
WithNEW_LINE_CHARACTER = '<new_line_character>'
, you can define a custom new line delimiter to separate entries for ingest. By default, theNEW_LINE_CHARACTER
is set as the end of line character\n
, but also supports other end of line conventions, such as\r\n
,\n\r
, and\r
, as well as multi-character delimiters, such as#*~
. -
[NULL_CHARACTER = '<null_character>']
WithNULL_CHARACTER = '<null_character>'
you can define which character is interpreted asNULL
. By default, theNULL_CHARACTER
value is set to\\N
. -
[SKIP_BLANK_LINES {TRUE|FALSE}]
WithSKIP_BLANK_LINES = TRUE
any blank lines encountered in the CSV input file will be skipped. By default,SKIP_BLANK_LINES
is set toFALSE
, and an error is generated if blank lines are enountered on ingest. -
[SKIP_HEADER_ROWS = {TRUE|FALSE}]
WithSKIP_HEADER_ROWS = TRUE
, Firebolt assumes that the first row in each file read from S3 is a header row and skips it when ingesting data. When set toFALSE
, which is the default if not specified, Firebolt ingests the first row as data.
JSON Types
TYPE = (JSON [PARSE_AS_TEXT = {'TRUE'|'FALSE'}])
WithTYPE = (JSON PARSE_AS_TEXT = 'TRUE')
, Firebolt ingests each JSON object literal in its entirety into a single column of typeTEXT
. WithTYPE = (JSON PARSE_AS_TEXT = 'FALSE')
, Firebolt expects each key in a JSON object literal to map to a column in the table definition. During ingestion, Firebolt inserts the key’s value into the corresponding column.
Other Types
TYPE = (ORC)
TYPE = (PARQUET)
TYPE = (AVRO)
TYPE = (TSV)
All type options for CSV above, except for FIELD_DELIMITER
, are also supported for the TSV file type.
Example
Creating an external table that reads parquet files from S3 is being done with the following statement:
CREATE EXTERNAL TABLE my_external_table
(
c_id INTEGER,
c_name TEXT
)
CREDENTIALS = (AWS_KEY_ID = '*****' AWS_SECRET_KEY = '******')
URL = 's3://bucket/'
OBJECT_PATTERN= '*.parquet'
TYPE = (PARQUET)
COMPRESSION
Specifies the compression type of the files matching the specified OBJECT_PATTERN
in S3.
Syntax
[COMPRESSION = <compression_type>]
Parameters
Parameters | Description |
---|---|
<compression_type> | Specifies the compression type of files. GZIP is supported. |
Example
The example below creates an external table to ingest CSV files from S3 that are compressed using gzip. The credentials for an IAM user with access to the bucket are provided.
CREATE EXTERNAL TABLE my_external_table
(
c_id INTEGER,
c_name TEXT
)
CREDENTIALS = (AWS_KEY_ID = 'AKIAIOSFODNN7EXAMPLE' AWS_SECRET_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY')
URL = 's3://mybucket/'
OBJECT_PATTERN= '*.csv.gz'
TYPE = (CSV)
COMPRESSION = GZIP