Defining external table columns for Parquet arrays and maps
When you set up an external table to ingest Parquet data files, you use a hierarchical dotted notation syntax to define table columns. Firebolt uses this notation to identify the field to ingest.Syntax for defining a Parquet nested structure
You specify the top grouping element of a nested structure in Parquet followed by the field in that structure that contains the data to ingest. You then declare the column type using theARRAY(<data_type>)
notation, where <data type>
is the Firebolt data type corresponding to the data type of the field in Parquet.
CREATE EXTERNAL TABLE
queries are demonstrated below.
Example–ingest and work with structs inside Parquet arrays
Consider the Parquet schema example below. The following elements define an array of structs:- A single, optional group field,
hashtags
, contains any number of another group,bag
. This is the top grouping element. - The
bag
groups each contain a single, optional group,array_element
. - The
array_element
group contains a single, optional field,s
. - The field
some_value
contains a value that is aTEXT
type (in binary primitive format).
Step 1–create an external table
TheCREATE EXTERNAL TABLE
example below creates a column in an external table from the Parquet schema shown in the example above. The column definition uses the top level grouping hashtags
followed by the field some_value
. Intermediate nesting levels are omitted.
Step 2–create a fact or dimension table
Create a fact or dimension table that defines a column of the sameARRAY(TEXT)
type that you defined in the external table in step 1. The example below demonstrates this for a fact table.
Step 3–insert into the fact table from the external table
The example below demonstrates anINSERT
statement that selects the array values from Parquet data files using the external table column definition in step 1, and then inserts them into the specified fact table column, some_value
.
Step 4–query array values
After you ingest array values into the fact table, you can query and manipulate the array using array functions and Lambda functions. For more information, see Working with arrays.Use multipart Parquet column names to extract data from nested structures. For simple
ARRAY(TEXT)
, use a single top-level field name.Example–ingest and work with maps
External tables connected to AWS Glue currently do not support reading maps from Parquet files. Parquet stores maps as arrays of key-value pairs, where each key_value group contains a key and its corresponding value. Consider the Parquet schema example below. The following define the key-value elements of the map:- A single, optional group,
context
, is a group of mappings that contains any number of the groupkey_value
. - The
key_value
groups each contain a required field,key
, which contains the key name as aTEXT
. Each group also contains an optional fieldvalue
, which contains the value as aTEXT
corresponding to the key name in the samekey_value
group.
Step 1–create an external table
When you create an external table for a Parquet map, you use the same syntax that you use in the example for arrays above. You create one column for keys and another column for values. TheCREATE EXTERNAL TABLE
example below demonstrates this.
Step 2–create a fact or dimension table
Create a Firebolt fact or dimension table that defines columns of the sameARRAY(TEXT)
types that you defined in the external table in step 1. The example below demonstrates this for a fact table.
Step 3–insert into the fact table from the external table
The example below demonstrates anINSERT INTO
statement that selects the array values from Parquet data files using the external table column definition in step 1, and inserts them into the specified fact table columns, my_parquet_array_keys
and my_parquet_array_values
.