Ingest semi-structured data
There are three major approaches to ingesting and handling semi-structured data as shown below. All approaches can be combined depending on the nature of the input data and the queries to run over the data.
-
Transforming the input using JSON and
ARRAY
functions to fit the target schema during ingestion. -
Ingesting the JSON object as raw
TEXT
rows, and later using JSON andARRAY
functions to query and manipulate them. -
When the input JSON adheres to a fixed schema—that is, each object has a known set of keys, and the nesting level of at most two (not including nesting of arrays, which can be arbitrary)—the data can be ingested directly. Omitted keys can be handled by specifying default values for the respective columns, but keys that are defined at table creation time are ignored. This approach is not common with true, semi-structured sources, usually applies to exports from table-oriented storage.
Assume that each JSON record is stored as plain text in the column raw_json
of a (potentially external) table named source_json
in the format shown below.
// 1st record
{
"id": 1,
"StartTime": "2020-01-06 17:00:00",
"Duration": 450,
"tags": ["summer-sale","sports"],
"user_agent":{
"agent": "Mozilla/5.0",
"platform": "Windows NT 6.1",
"resolution": "1024x4069"
}
}
// 2nd record
{
"id": 2,
"StartTime": "2020-01-05 12:00:00",
"Duration": 959,
"tags": ["gadgets","audio"],
"user_agent":{
"agent": "Safari",
"platform": "iOS 14"
}
}
Recall that we want the target Firebolt table named visits
to have columns and values similar to the table shown below.
id (INTEGER) | start_time (TIMESTAMP) | duration (INTEGER) | tags (ARRAY(TEXT)) |
---|---|---|---|
1 | 2020-01-06 17:00:00 | 450 | [“summer-sale”,”sports”] |
2 | 2020-01-05 12:00:00 | 959 | [“gadgets”,”audio”] |
Extracting top-level scalars and arrays
For the top-level keys (id
, Duration
, and tags
), the task is straightforward using the JSON_EXTRACT function.
The following statement takes the raw JSON input and uses INSERT
to load the results into the table visits
. The result is provided as an illustration, since an INSERT
returns only the number of affected rows.
INSERT INTO visits
SELECT JSON_POINTER_EXTRACT(raw_json, '/id')::INTEGER as id,
JSON_POINTER_EXTRACT(raw_json, '/StartTime')::TIMESTAMP as StartTime,
JSON_POINTER_EXTRACT(raw_json, '/Duration')::INTEGER as duration,
JSON_POINTER_EXTRACT(raw_json, '/tags')::ARRAY(TEXT) as tags
FROM doc_visits_source
Result (if the script had been excecuted without the INSERT
clause):
id | StartTime | duration | tags |
---|---|---|---|
1 | 2020-01-06 17:00:00 | 450 | [“summer-sale”,”sports”] |
2 | 2020-01-05 12:00:00 | 959 | [“gadgets”,”audio”] |