PARQUET_METADATA - Firebolt Documentation

OSS preview A table-valued function (TVF) that returns the internal metadata of Parquet files stored in Amazon S3 without reading their data. The function can use either a location object (recommended) or direct credentials to access the files. For each file it reports the file’s row groups and, within each row group, every column chunk: its encoding, compression, byte sizes, value counts, and column statistics. Optionally it also reports per-page metadata. Use it to understand file layout and decide how to tune a workload: row-group sizing, compression and encoding choices, the column statistics available for pruning, and dictionary usage. It complements READ_PARQUET, which returns the data itself.

Syntax

PARQUET_METADATA accepts the same LOCATION / URL access and authentication parameters as READ_PARQUET, plus the parameter below. It does not accept the data-parsing parameters of READ_PARQUET (SCHEMA, REPLACE_NON_UTF8_BYTES, ESTIMATED_ROWS, PARSE_JSON_AS).

PARQUET_METADATA (
  -- Access and authentication: same as READ_PARQUET (LOCATION [, PATTERN], or URL
  -- plus the relevant AWS_* params).
  { LOCATION => '<location_name>' | URL => '<file_url>' }
  [, ... ]
  [, SHOW_PAGES_STATS => { TRUE | FALSE } ]
)

Parameters

Parameter	Description	Supported input types
`SHOW_PAGES_STATS`	When `TRUE`, adds per-page columns to the result, returning one row per page within each column chunk instead of one row per column chunk. Default: `FALSE`.	`BOOLEAN`

For the access and authentication parameters (LOCATION, PATTERN, URL, and AWS_*), see READ_PARQUET parameters.

Return Type

The result is a table whose grain depends on SHOW_PAGES_STATS. With the default FALSE, there is one row per column chunk (one row group’s worth of a single column). With TRUE, there is one row per page, and the per-page columns below are appended. Base columns:

Column	Type	Description
`file_name`	`TEXT`	Path of the source Parquet file.
`row_group_id`	`BIGINT`	Index of the row group within the file.
`row_group_num_rows`	`BIGINT`	Number of rows in the row group.
`row_group_num_columns`	`BIGINT`	Number of columns in the row group.
`row_group_bytes`	`BIGINT`	Total byte size of the row group.
`column_id`	`BIGINT`	Index of the column within the row group.
`file_offset`	`BIGINT`	Byte offset of the column chunk within the file. Nullable.
`num_values`	`BIGINT`	Number of values in the column chunk.
`path_in_schema`	`ARRAY(TEXT)`	Path of the column in the Parquet schema, as a sequence of nested field names.
`type`	`TEXT`	Physical Parquet type of the column.
`stats_null_count`	`BIGINT`	Number of null values recorded in the column-chunk statistics. Nullable.
`stats_distinct_count`	`BIGINT`	Number of distinct values recorded in the column-chunk statistics. Nullable.
`stats_min_value`	`TEXT`	Minimum value recorded in the column-chunk statistics, as text. Nullable.
`stats_max_value`	`TEXT`	Maximum value recorded in the column-chunk statistics, as text. Nullable.
`compression`	`TEXT`	Compression codec of the column chunk.
`encodings`	`ARRAY(TEXT)`	Encodings used in the column chunk.
`dictionary_page_offset`	`BIGINT`	Byte offset of the dictionary page, when present. Nullable.
`data_page_offset`	`BIGINT`	Byte offset of the first data page.
`total_compressed_size`	`BIGINT`	Compressed size of the column chunk in bytes.
`total_uncompressed_size`	`BIGINT`	Uncompressed size of the column chunk in bytes.

Additional columns when SHOW_PAGES_STATS => TRUE:

Column	Type	Description
`page_type`	`TEXT`	Type of the page (for example data page or dictionary page).
`page_encoding`	`TEXT`	Encoding of the page.
`page_compressed_size`	`BIGINT`	Compressed size of the page in bytes.
`page_uncompressed_size`	`BIGINT`	Uncompressed size of the page in bytes.
`page_num_values`	`BIGINT`	Number of values in the page.
`page_num_nulls`	`BIGINT`	Number of null values in the page. Nullable.
`page_num_rows`	`BIGINT`	Number of rows in the page. Nullable.

Examples

The following code example reads the metadata of a Parquet file with SHOW_PAGES_STATS => TRUE, returning one row per page with the full base and page-level columns:

SELECT *
FROM PARQUET_METADATA(
  URL => 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/TournamentID=92/cc2a2a0b4e8b4fb39abf20a956e7cc3e-0.parquet',
  SHOW_PAGES_STATS => TRUE
);

​Syntax

​Parameters

​Return Type

​Examples

Syntax

Parameters

Return Type

Examples