> ## Documentation Index
> Fetch the complete documentation index at: https://docs.firebolt.io/llms.txt
> Use this file to discover all available pages before exploring further.

<AgentInstructions>

## Submitting Feedback

If you encounter incorrect, outdated, or confusing documentation on this page, submit feedback:

POST https://docs.firebolt.io/feedback

```json
{
  "path": "/reference-sql/system-settings",
  "feedback": "Description of the issue"
}
```

Only submit feedback when you have something specific and actionable to report.

</AgentInstructions>

> Lists Firebolt system settings that you can configure using SQL.

# System settings

You can use a `SET` statement in a SQL script to configure aspects of Firebolt's system behavior. Each statement is a query in its own right and must be terminated with a semi-colon (;). The `SET` statement cannot be included in other queries. This topic provides a list of available settings by function.

## Setting via `WITH`

You can override settings by appending `WITH (<setting_1_name> = <setting_1_value>, ...)` to the query. This lets you apply settings directly to specific queries without affecting the entire session.

### Example

Instead of:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET timezone = 'America/Chicago';
SELECT TIMESTAMPTZ '2023-1-29 12:21:49';  --> 2023-01-29 12:21:49-06
```

You can write:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT TIMESTAMPTZ '2023-1-29 12:21:49' WITH (timezone = 'America/Chicago');  --> 2023-01-29 12:21:49-06
```

### Supported Commands

The `WITH` clause is supported for the following commands:

* [`SELECT`](/reference-sql/commands/queries/select)
* [`INSERT`](/reference-sql/commands/data-management/insert)
* [`COPY FROM`](/reference-sql/commands/data-management/copy-from)
* [`COPY TO`](/reference-sql/commands/data-management/copy-to)
* [`UPDATE`](/reference-sql/commands/data-management/update)
* [`DELETE`](/reference-sql/commands/data-management/delete)

### Supported Settings

The following settings can be configured using the `WITH` clause and the `SET` command:

* [timezone](#setting-the-time-zone)
* [standard\_conforming\_strings](#enable-parsing-for-literal-strings)
* [max\_result\_rows](#limit-the-number-of-result-rows)
* [statement\_timeout](#statement-timeout)
* [cancel\_query\_on\_connection\_drop](#query-cancellation-mode-on-connection-drop)
* [enable\_automated\_column\_statistics](#using-automated-column-statistics-in-the-optimizer)
* [enable\_result\_cache](#result-cache)
* [enable\_subresult\_cache](#subresult-cache)
* [enable\_scan\_cache](#scan-cache)
* [insert\_sharding](#insert-sharding)
* [tablet\_min\_size\_bytes and tablet\_max\_size\_bytes](#target-tablet-size)
* [max\_table\_partitions\_on\_insert](#partition-limits)
* [cross\_region\_request\_mode](#access-cross-region-data)
* [optimizer\_mode](#changing-the-optimizer-mode)
* [enable\_storage\_statistics](#using-storage-manager-statistics-in-the-optimizer)
* [max\_threads](#limiting-max-threads)
* [max\_insert\_threads](#limiting-max-insert-threads)
* [iceberg\_unsafe\_version\_guessing\_mode](#guessing-iceberg-metadata-version)
* [enable\_iceberg\_partitioned\_scan](#using-iceberg-partitioned-scans-for-co-located-joins-and-aggregations)
* [iceberg\_insert\_sharding](#iceberg-insert-sharding)
* [late\_materialization\_max\_rows](#control-late-materialization)
* [derive\_ndv\_from\_unique\_keys](#derive-ndv-from-unique-keys)
* [enable\_type\_based\_ndv\_heuristic](#enable-type-based-ndv-heuristic)
* [enable\_date\_time\_func\_ndv\_heuristic](#enable-datetime-function-ndv-heuristic)
* [force\_shuffle\_hash\_join](#forcing-hash-shuffled-joins)
* [as\_hbs\_training\_data](#marking-queries-as-hbs-training-data)
* [hbs\_object\_id](#using-an-hbs-snapshot-for-query-planning)

The following settings can be configured only using the `SET` command:

* [query\_label](#query-labeling-and-tagging)
* [warmup](#multi-cluster-engine-warmup)

## Setting the time zone

Use this setting to specify the session time zone. Time zone names are from the [Time Zone Database](http://www.iana.org/time-zones). You can see the list of tz database time zones [here](http://en.wikipedia.org/wiki/List_of_tz_database_time_zones). For times in the future, the latest known rule for the given time zone is applied. Firebolt does not support time zone abbreviations, as they cannot account for daylight savings time transitions, and some time zone abbreviations have meant different UTC offsets at different times. The default value of the `timezone` setting is UTC.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET timezone = '<time_zone>'
```

### Example

The following code example demonstrates how setting the timezone parameter affects the interpretation and conversion of `TIMESTAMPTZ` values:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET timezone = 'UTC';
SELECT TIMESTAMPTZ '1996-09-03 11:19:33.123456 Europe/Berlin';  --> 1996-09-03 09:19:33.123456+00
SELECT TIMESTAMPTZ '2023-1-29 6:3:42.7-3:30';  --> 2023-01-29 09:33:42.7+00

SET timezone = 'America/Chicago';
SELECT TIMESTAMPTZ '2023-1-29 12:21:49';  --> 2023-01-29 12:21:49-06
SELECT TIMESTAMPTZ '2023-1-29Z';  --> 2023-01-28 18:00:00-06
```

## Enable parsing for literal strings

If set to `true`, strings are parsed without escaping, treating backslashes literally. By default, this setting is enabled.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET standard_conforming_strings = [true|false]
```

### Example

The following code example demonstrates how setting `standard_conforming_strings` affects the interpretation of escape sequences in string literals:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET standard_conforming_strings = false;
SELECT '\x3132'; --> 132

SET standard_conforming_strings = true;
SELECT '\x3132'; --> \x3132
```

## Statement timeout

Specifies the number of milliseconds a SQL statement is allowed to run. Any SQL statement or query exceeding the specified time is canceled. A value of zero disables the timeout by default.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET statement_timeout = <number_of_milliseconds>;
```

### Example

The following SQL example sets the query timeout to three seconds:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET statement_timeout = 3000;
```

## Limit the number of result rows

When set to a value greater than zero, this setting limits the number of rows returned by `SELECT` statements. The query is executed as if an additional `LIMIT` clause is added to the SQL query. A value of zero or less means that no limit is applied. By default, no limit to the number of result rows is applied.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET max_result_rows = <integer>;
```

### Example

The following queries all return the same result. For the first query, no explicit settings are set:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT * FROM table LIMIT 10000;

SET max_result_rows = 10000;
SELECT * FROM table;

SET max_result_rows = 10000;
SELECT * FROM table LIMIT 20000;
```

## Query cancellation mode on connection drop

Specify how the query should behave when the HTTP connection to Firebolt is dropped, such as when the UI window is closed. For this, you can choose between 3 different modes:

* `NONE`: The query will not be canceled on connection drop
* `ALL` : The query will be canceled on connection drop
* `TYPE_DEPENDENT`: Only queries without side effects will be canceled, such as `SELECT`.

The default is `TYPE_DEPENDENT`.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET cancel_query_on_connection_drop = <mode>
```

### Example

The following code example demonstrates how to control query cancellation behavior when a connection drops using `none`, `all`, and `type_dependent` modes for `SET cancel_query_on_connection_drop`:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET cancel_query_on_connection_drop = none;
INSERT INTO X [...]
SELECT * FROM X;

SET cancel_query_on_connection_drop = all;
INSERT INTO X [...]
SELECT * FROM X;

SET cancel_query_on_connection_drop = type_dependent;
INSERT INTO X [...]
SELECT * FROM X;
```

## Query labeling and tagging

Use this option to label your query with a custom text. This simplifies query cancellation and retrieving the query status from system tables.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET query_label = '<text>'
```

### Example

The following code example assigns a query label to a query using `SET query_label`, allowing you to track it in `information_schema`, `engine_running_queries`, and `information_schema.engine_query_history`. It then demonstrates how to retrieve the `QUERY_ID` for the labeled query and cancel it using `CANCEL QUERY`:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET query_label = 'Hello Firebolt';
SELECT * FROM X;

SET query_label = '';

SELECT query_id, * FROM information_schema.engine_running_queries WHERE query_label = 'Hello Firebolt'
SELECT query_id, * FROM information_schema.engine_query_history WHERE query_label = 'Hello Firebolt'

CANCEL QUERY WHERE query_id = '<retrieved query_id>'
```

## Multi-cluster engine warmup

Use this option to distribute queries across all clusters of an engine, simplifying the process of initializing cached data to a consistent state across all clusters after a `START ENGINE` or `ALTER ENGINE` operation.

Warmup queries complete after they have run on all clusters of the engine. The queries return an empty result if they succeed on all clusters. If the query fails on any cluster, it returns an error. If multiple errors occur, only one error is returned.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET warmup = true;
```

### Example

The following code example activates the warmup mode so that the query runs on `production_table` using all clusters of an engine, and returns an empty result upon success:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
USE ENGINE multi_cluster_engine;
SET warmup = true;
SELECT checksum(*) FROM production_table;
SET warmup = false;
```

## Result cache

Set `enable_result_cache` to `FALSE` to disable the use of Firebolt's [result cache](/overview/queries/understand-query-performance-subresult), which is set to `TRUE` by default. Disabling result cashing can be useful for benchmarking query performance. When `enable_result_cache` is disabled, resubmitting the same query will recompute the results rather than retrieving them from cache.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET enable_result_cache = [true|false];
```

### Example

The following code example disables the result cache so that no previously cached results are used, and no new cache entries are written:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET enable_result_cache = false;
SELECT checksum(*) FROM production_table;
```

## Subresult cache

Firebolt implements [advanced cross-query optimization](/overview/queries/understand-query-performance-subresult) that allows SQL queries to reuse intermediate query execution states from previous requests.
Subresult caching operates at a semantic level, which allows Firebolt to understand and optimize queries based on the meaning and context of the data rather than solely based on their syntax or structure. This capability allows Firebolt to optimize across different query patterns for improved efficiency.

Set `enable_subresult_cache` to `FALSE` to disable Firebolt’s subresult caching, which is set to `TRUE` by default.

Disabling subresult caching is generally **not recommended**, as it can negatively impact query performance, especially for complex workloads. For most benchmarking scenarios, disable the result cache instead, as described in the previous [Result cache](#result-cache) section. This approach affects only the final result caching while preserving the benefits of subresult optimizations.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET enable_subresult_cache = [true|false];
```

### Example

The following code example disables the subresult cache so no previously cached subresult is used and no new cache entries are written by this query:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET enable_subresult_cache = false;
SELECT count(*) FROM fact_table INNER JOIN dim_table ON (a = b);
```

Setting `enable_subresult_cache` to `FALSE` disables the use of all [cached subresults](/overview/queries/understand-query-performance-subresult). In particular, it deactivates two caching mechanisms that normally speed up query runtimes: the use of the `MaybeCache` operator, which includes the full result cache, and the hash-table cache used by the `Join` operator.

## Scan cache

The scan cache is a feature that caches all columns accessed by a query right after they have been read by the scan of the underlying table. Note that no filters are applied, also any primary indexes will not be used. The cached columns simply represent the entire data for these fields in the table. Since they are in RAM once cached, certain queries / workloads may benefit from this. E.g., if there are many consecutive queries which access the exact same fields, but have changing filters.

Set `enable_scan_cache` to `TRUE` to enable Firebolt’s scan caching, which is set to `FALSE` by default. Note that RAM usage may be high when enabling the scan cache and the impact depends a lot on the workload.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET enable_scan_cache = [true|false];
```

### Example

The following code example enables the scan cache. Once enabled, we run several queries which access exactly the same columns, but change the filters. The first query will populate the cache, the following can use the cached columns.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET enable_scan_cache = false;
SELECT a, b, c FROM production_table WHERE c > 0 LIMIT 10;
SELECT a, b, c FROM production_table WHERE c = 42 LIMIT 10;
SELECT a, b, c FROM production_table WHERE a = 17 AND b LIKE '%string%' LIMIT 10;
```

## Insert sharding

When working with [partitioned tables](/reference-sql/commands/data-definition/create-fact-dimension-table#partition-by), Firebolt enforces separation of data between tablets: rows of different partitions cannot be stored together in the same tablet.

Consider a scenario where you're ingesting historical data for the last 3 years with date-based partitioning: this could result in around 1000 tablets. For large datasets, a common practice is [to scale out](/reference-sql/commands/engines/alter-engine#scale-up-and-out-an-engine) for ingestion. However, this creates a challenge: each date might be processed from multiple nodes: for example, for 10 nodes it can result in up to 10,000 tablets, instead of 1,000. This not only slows down data persistence due to increased storage requests but can also degrade [query performance](/reference-sql/commands/data-management/vacuum#space-and-performance-considerations).

To address this, Firebolt provides controls for partitioned tables ingestion:

* `insert_sharding='shard_on_read'`: Use when the partition expression is based on [`$source_file_name`](/guides/loading-data/loading-data-sql#load-source-file-metadata-into-a-table). This allows Firebolt to determine the target partition before reading data and group files of the same partition on the same nodes. This is most effective when your source files are already organized by partition (e.g., files named like `data_20240101.csv`, `data_20240102.csv`).

* `insert_sharding='shuffle_on_write'`: Use when the partition expression is based on the data itself. In this case, data must be read first to determine partitioning. Just before insertion and after any transformations, the data is re-shuffled for partitions locality. Use this when your partition values come from the data content rather than file names.

### Notes

* This setting overrides default load-based sharding of input files. Be cautious as a single partition with heavy data could overload a single shard.
* This setting is only available via the `WITH SETTINGS` syntax, not with `SET`.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
INSERT INTO ...
WITH (insert_sharding = ['auto'|'no_sharding'|'shard_on_read'|'shuffle_on_write']);
```

### Example

The following examples demonstrate when to use each sharding option:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE partitioned_table (
  "date" TEXT,
  f0 TEXT,
  f1 TEXT,
  f2 TEXT,
  f3 TEXT,
  f4 TEXT
) PARTITION BY "date";

-- Example 1: Using shard_on_read with files named like payments-incremental_20240101.csv
-- This works because the partition value comes from the file name
INSERT INTO
  partitioned_table
SELECT
  REGEXP_EXTRACT($source_file_name, 'payments-incremental_(\d{8})', '', 1) AS "date",
  *
FROM
  READ_CSV(url => 's3://firebolt-publishing-public/help_center_assets/ledgering_sample/payments-incremental_*')
WITH (insert_sharding = 'shard_on_read');

-- Example 2: Using shuffle_on_write when partition value comes from the data
-- This is necessary because the partition value is only known after reading the data
INSERT INTO
  partitioned_table
SELECT
  f0,  -- date comes from the CSV data
  *
FROM
  READ_CSV(url => 's3://firebolt-publishing-public/help_center_assets/ledgering_sample/payments-incremental_*')
WITH (insert_sharding = 'shuffle_on_write');
```

Setting `insert_sharding` to `shard_on_read` changes the file distribution strategy across nodes: with this each date is processed by exactly one node, but only when the partition value can be determined from the source file name.

## Target tablet size

During ingestion, Firebolt attempts to create optimally sized tablets to balance ingestion speed and future scan performance. When all ingested data has been read, Firebolt prefers creating relatively small tablets to prioritize data persistence, leaving further optimization to [Auto Vacuum](/reference-sql/commands/data-management/vacuum). However, if this behavior isn't desirable, you can control it using the `tablet_min_size_bytes` and `tablet_max_size_bytes` settings:

* `tablet_min_size_bytes`: Controls the minimum size of tablets. If there isn't enough data in the ingestion, smaller tablets are created nevertheless. When possible, data is compacted into tablets of at least this size. Default: `1.5 GiB`. Minimum: `1 GiB`.
* `tablet_max_size_bytes`: Controls the maximum size of tablets. Default: `4 GiB`. Should be greater or equal than `tablet_min_size_bytes`.

### Note

Larger target tablet sizes may require more memory during ingestion.

### Example

The following example sets both minimum and maximum tablet sizes to `4 GiB`:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE playstats (
  "GameID" BIGINT NULL,
  "PlayerID" BIGINT NULL,
  "Timestamp" TEXT NULL,
  "SelectedCar" TEXT NULL,
  "CurrentLevel" BIGINT NULL,
  "CurrentSpeed" BIGINT NULL,
  "CurrentPlayTime" DOUBLE PRECISION NULL,
  "CurrentScore" BIGINT NULL,
  "Event" TEXT NULL,
  "ErrorCode" TEXT NULL
);

INSERT INTO
  playstats
SELECT
  *
FROM
  READ_PARQUET(URL => 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/*.parquet')
WITH (tablet_min_size_bytes = 4294967296, tablet_max_size_bytes = 4294967296);
```

Changing both `tablet_min_size_bytes` and `tablet_max_size_bytes` to `4 GiB` ensures that larger tablets are created.

## Partition limits

Excessive partitioning can degrade table health and query performance. To prevent tables from having too many partitions, Firebolt enforces a limit on the number of partitions per table. This limit is checked during data ingestion operations and will cause the operation to fail if the resulting partition count would exceed the configured threshold.

For accounts created after December 1, 2025, the default limit is `10000` partitions. (Accounts created before this date do not have a limit applied.) You can adjust this limit using the `max_table_partitions_on_insert` setting.

### Example

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET max_table_partitions_on_insert = 20000;
INSERT INTO my_table VALUES ...;
```

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
COPY INTO my_table FROM my_location WITH max_table_partitions_on_insert = 20000;
```

### Notes

* Setting `max_table_partitions_on_insert` to `0` disables the partition limit entirely.
* This limit operates on a best-effort basis, attempting to accuracy with performance. Strict enforcement is not guaranteed, especially on multi-node engines.

## Access Cross-Region Data

By default, Firebolt restricts importing from and exporting to Amazon S3 buckets located outside your engine's default region to minimize latencies and network costs. For workloads requiring cross-region access, you can enable this functionality using the `cross_region_request_mode` setting. Firebolt offers three modes:

* `disabled`: The default setting. Any attempt to access a bucket in another region will cause the query to fail.
* `auto`: Recommended for most scenarios when you need to access a bucket in a different region. Firebolt will automatically infer each bucket's region using the Amazon `HeadBucket` API and route requests to the appropriate regional endpoint. The region information is cached on the node for one hour. If you're using a role to authenticate to Amazon S3, it must include ListBuckets permissions.
  * For Iceberg queries using vended credentials, the approach is different because these credentials typically do not expose `HeadBucket` permissions. In this case, Firebolt will first attempt to resolve the region from the REST catalog; if not possible, it will use the pseudo-region `aws-global`.
* `enforced`: Use when you need to access a bucket in a different region and do not have the permissions for `auto` mode. Firebolt will use the pseudo-region `aws-global` for every request. This may lead to higher latencies since AWS needs to resolve the region internally for each request.

### Notes

* Firebolt currently offers limited cross-region data transfer and will throttle requests if the limit is exceeded. The current limit is 100 GB per hour per node.
* This setting is only available via the `WITH SETTINGS` syntax, not with `SET`.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
-- Example 1: COPY FROM
COPY INTO playstats
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/'
LIMIT 5
WITH (TYPE = PARQUET, CROSS_REGION_REQUEST_MODE = 'auto');

-- Example 2: COPY TO
COPY (SELECT * FROM playstats)
TO 's3://sink/'
WITH (TYPE = PARQUET, SINGLE_FILE = TRUE, CROSS_REGION_REQUEST_MODE = 'auto');

-- Example 3: Read TVFs
SELECT * FROM read_parquet('s3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/')
LIMIT 5
WITH (CROSS_REGION_REQUEST_MODE = 'auto');
```

## Changing the optimizer mode

If set to `'user_guided'`, the optimizer will disable all cost-based optimization rules.
This ensures that the relative order of joins and aggregations in the resulting execution plan will follow the syntactic order defined in the `SELECT` blocks of your SQL query
The default value is `automatic`.
For more details, see [User-guided mode](/performance-and-observability/query-planning/user-guided-mode).

### Syntax

You can use both the `SET`

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET optimizer_mode = 'user_guided';
SELECT ... ;
SET optimizer_mode = 'automatic';
```

and the `WITH` syntax.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT ... WITH (optimizer_mode = 'user_guided');
```

If you want to limit the scope of this setting to a specific query, we recommend the latter.

To learn more, read [the dedicated "User-guided mode" guide](/performance-and-observability/query-planning/user-guided-mode).

### Example

The following examples demonstrate how to use the `optimizer_mode` setting to ensure that in the resulting execution plan, the join between `r` and `s` is executed before the join between `r` and `t`.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT
  *
FROM
  r JOIN s ON(r.x = s.x) JOIN t ON(r.y = t.y)
WITH (optimizer_mode = 'user_guided');
```

## Using storage manager statistics in the optimizer

You can control the use of storage manager statistics in the optimizer with the `enable_storage_statistics` session parameter.
Per default, storage manager statistics are enabled (the default value is `true`).
If you want to disable them, set the value to `false`.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET enable_storage_statistics = [true|false];
```

### Example

Consider the following example.

When `enable_storage_statistics` is set to `true`, the optimizer will use the storage manager statistics to estimate the number of rows in the fact table `ft` and the dimension tables `d1` and `d2`.
The storage manager reports that `d2` is much smaller (50 rows) compared to `d1` (1000 rows).
The optimizer estimates that even after applying the local predicates on these dimension tables, it will be more efficient to join `ft` with `d2` first and then with `d1`.
The join order in this plan will adapt to the size of the underlying tables.
For example, if the size of `d2` becomes 400 rows, the optimizer will first join `ft` with `d1` and then with `d2`.

However, when `enable_storage_statistics` is set to `false`, the optimizer uses the hardcoded default estimates for the fact table `ft` (100,000,000 rows) and the dimension tables `d1` and `d2` (100,000 rows).
Because the two dimension tables are estimated to be equally large, the join order in this case is fully determined by the number of local predicates.
Applying two local predicates on `d1` is estimated to produce a smaller estimated row count (20000 rows) than applying one local predicate on `d2` (70000 rows).
This join order will remain stable even if the underlying table sizes change.

<CodeGroup>
  ```sql SQL query theme={"theme":{"light":"github-light","dark":"github-dark"}}
  EXPLAIN(LOGICAL, STATISTICS)
  SELECT
    *
  FROM
    fact_table ft JOIN dim_table1 d1 ON(ft.x = d1.x) JOIN dim_table2 d2 ON(ft.y = d2.y)
  WHERE
    d1.a > 4 AND d1.a < 9 AND d2.b > 5;
  ```

  ```text EXPLAIN plan with enable_storage_statistics = true {9-10,13-14,17-18} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [Projection] fact_table.x, fact_table.y, fact_table.z, fact_table.x, dim_table1.a, fact_table.y, dim_table2.b
  |   [Logical Profile]: [est. #rows=16922, source: estimated]
   \_[1] [Join] Mode: Inner [(fact_table.x = dim_table1.x)]
     |   [Logical Profile]: [est. #rows=16922, source: estimated]
      \_[2] [Projection] fact_table.x, fact_table.y, fact_table.z, dim_table2.b
      | |   [Logical Profile]: [est. #rows=5677, source: estimated]
      |  \_[3] [Join] Mode: Inner [(fact_table.y = dim_table2.y)]
      |    |   [Logical Profile]: [est. #rows=5677, source: estimated]
      |     \_[4] [StoredTable] Name: "fact_table"
      |     |     [Logical Profile]: [est. #rows=10000, source: metadata]
      |     \_[5] [Filter] (dim_table2.b > 5)
      |       |   [Logical Profile]: [est. #rows=35, source: estimated]
      |        \_[6] [StoredTable] Name: "dim_table2"
      |              [Logical Profile]: [est. #rows=50, source: metadata]
      \_[7] [Filter] (dim_table1.a > 4), (dim_table1.a < 9)
        |   [Logical Profile]: [est. #rows=200, source: estimated]
         \_[8] [StoredTable] Name: "dim_table1"
               [Logical Profile]: [est. #rows=1000, source: metadata]
  ```

  ```text EXPLAIN plan with enable_storage_statistics = false {9-10,13-14,17-18} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [Projection] fact_table.x, fact_table.y, fact_table.z, fact_table.x, dim_table1.a, fact_table.y, dim_table2.b
  |   [Logical Profile]: [est. #rows=6.18316e+12, source: estimated]
   \_[1] [Join] Mode: Inner [(fact_table.y = dim_table2.y)]
     |   [Logical Profile]: [est. #rows=6.18316e+12, source: estimated]
      \_[2] [Projection] fact_table.x, fact_table.y, fact_table.z, dim_table1.a
      | |   [Logical Profile]: [est. #rows=1.46831e+10, source: estimated]
      |  \_[3] [Join] Mode: Inner [(fact_table.x = dim_table1.x)]
      |    |   [Logical Profile]: [est. #rows=1.46831e+10, source: estimated]
      |     \_[4] [StoredTable] Name: "fact_table"
      |     |     [Logical Profile]: [est. #rows=1e+08, source: hardcoded]
      |     \_[5] [Filter] (dim_table1.a > 4), (dim_table1.a < 9)
      |       |   [Logical Profile]: [est. #rows=20000, source: estimated]
      |        \_[6] [StoredTable] Name: "dim_table1"
      |              [Logical Profile]: [est. #rows=100000, source: hardcoded]
      \_[7] [Filter] (dim_table2.b > 5)
        |   [Logical Profile]: [est. #rows=70000, source: estimated]
         \_[8] [StoredTable] Name: "dim_table2"
               [Logical Profile]: [est. #rows=100000, source: hardcoded]
  ```
</CodeGroup>

## Using automated column statistics in the optimizer

The `enable_automated_column_statistics` setting controls the [Automated Column Statistics (ACS)](/performance-and-observability/query-planning/automated-column-statistics) feature.
ACS enables the query optimizer to leverage existing aggregating indexes to obtain statistical information about columns during query planning.
When enabled, this feature provides statistical information that helps the planner to make better optimization decisions.

**Default value:** `false` (opt-in feature)

### Prerequisites

To use automated column statistics, you must:

1. **Create aggregating indexes** to collect needed column statistics
2. **Enable the feature** by setting the query-level setting

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET enable_automated_column_statistics = [true|false];
```

### Example

Consider two tables `employees` and `departments` with columns frequently used for grouping and filtering.
Create an aggregating index for each table to enable statistics collection:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE AGGREGATING INDEX employees_statistics
ON employees
(
    APPROX_COUNT_DISTINCT(gender),      -- for filtering by gender
    APPROX_COUNT_DISTINCT(title),       -- for filtering by title
    APPROX_COUNT_DISTINCT(dept_id)      -- for joining on department
)

CREATE AGGREGATING INDEX departments_statistics
ON departments
(
    APPROX_COUNT_DISTINCT(dept_name),   -- for grouping by department name
    APPROX_COUNT_DISTINCT(location)     -- for filtering by location
);
```

Enable and use automated column statistics:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
-- Enable the feature
SET enable_automated_column_statistics = true;

SELECT
    d.dept_name,
    e.gender,
    COUNT(*) as employee_count,
    AVG(e.salary) as avg_salary
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id
WHERE d.location = 'New York'
  AND e.title = 'Senior Developer'
GROUP BY d.dept_name, e.gender
ORDER BY employee_count DESC;
```

## Limiting max threads

Controls the maximum number of threads per query, limiting the degree of parallelism for query processing.
This setting can affect query performance and memory usage and applies per stage, meaning distributed queries may use more threads.

By default, `max_threads` equals the number of physical CPU cores on each node.
The default number of threads will also be used if `max_threads=0`.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET max_threads = <number_of_threads_per_node>;
```

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT ... WITH (max_threads = <number_of_threads_per_node>);
```

### Example

Limit per-node parallelism to eight threads per stage for a single query:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT *
FROM READ_PARQUET(URL => 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/*.parquet')
WITH (max_threads = 8);
```

## Limiting max insert threads

Controls the maximum number of threads for `INSERT` statements, limiting the degree of parallelism for tablet writing operations.
This can reduce memory footprint during ingestion.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
INSERT INTO ... WITH (max_insert_threads = <number_of_threads_per_node>);
```

### Example

Limit insert write operations to a single thread so that only one tablet is written at a time, reducing memory usage by avoiding multiple concurrent buffers:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
INSERT INTO t
SELECT * FROM READ_PARQUET(URL => 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/*.parquet')
WITH (max_insert_threads = 1);
```

## Guessing Iceberg metadata version

In Apache Iceberg, REST catalogs provide metadata file locations through their REST API, while file-based catalogs provide a `version-hint.text` file. If a file-based catalog lacks a `version-hint.text` file, Firebolt can attempt to guess the correct metadata file.

This behavior is disabled by default because it may violate atomicity guarantees. To enable, set `iceberg_unsafe_version_guessing_mode`.

### Accepted values

* `disabled` (default): Disables version guessing.
* `version_num`: Selects metadata files based on the numeric portion of filenames such as `v1.metadata.json` or `00002-abcdabcd-abcd-abcd-abcd-abcdabcdabcd.metadata.json`. If multiple files share the same version number, the behavior is non-deterministic.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET iceberg_unsafe_version_guessing_mode = <mode>;
```

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT * FROM read_iceberg( ... ) ... WITH iceberg_unsafe_version_guessing_mode = <mode>;
```

### Example

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT * FROM read_iceberg('s3://my-bucket/path/to/iceberg/table')
LIMIT 1
WITH iceberg_unsafe_version_guessing_mode = 'version_num';
```

## Using Iceberg partitioned scans for co-located joins and aggregations

The `enable_iceberg_partitioned_scan` setting controls whether to scan partitioned Iceberg tables in a partitioned manner on multi-node engines. When enabled, Firebolt assigns entire partitions of the tables to nodes, allowing joins and aggregations on the tables' partition keys to be run fully locally.  For joins, this requires compatible partitioning schemes on both sides of the join.  Note that enabling this setting cause all partinioned Iceberg tables to be scanned in a partitioned manner, even if no joins or aggregations eligible for co-located execution are present in the query.

**Caution:** when partition sizes are imbalanced or the number of partitions is smaller than the number of nodes, enabling this setting can cause significant load imbalance between the nodes, degrading query performance.

### Example

In the following example, we join two Iceberg tables partitioned by `month("timestamp")`.  The join is eligible for co-located execution because both tables are partitioned by the same key.  Note how with `enable_iceberg_partitioned_scan` set to `true`, the *iceberg file listing* is shuffled by the tables' partition values, but no further shuffling is performed before the join.  With the setting disabled, the iceberg file listing is shuffled by a weighted partitioning of the file sizes to make sure that each node is assigned approximately the same amount of data to scan, ignoring the partition values.  This requires an additional hash shuffle of both tables before the join, which can be expensive for very large tables.  Enabling partitioned scans allows this shuffle to be avoided.

<CodeGroup>
  ```sql Join query theme={"theme":{"light":"github-light","dark":"github-dark"}}
  EXPLAIN (PHYSICAL)
  SELECT
    *
  FROM
    read_iceberg(location => 'table1') table1  -- partitioned by `month("timestamp")`
  JOIN
    read_iceberg(location => 'table2') table2  -- partitioned by `month("timestamp")`
  USING ("timestamp");
  ```

  ```text Join plan with enable_iceberg_partitioned_scan = true {9-10,18-19} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [MaybeCache]
   \_[1] [Shuffle] Gather
     |   [Affinity]: many nodes
      \_[2] [Projection] timestamp, description, user
         \_[3] [Join] Mode: Inner [(timestamp = timestamp)]
            \_[4] [TableFuncScan] timestamp: $0.timestamp, description: $0.description
            | |   $0 = read_from_s3(url='s3://.../table1', format='PARQUET', object_pattern='*', type=Iceberg, file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_timestamp, max_timestamp, min_description, max_description, partition_value_timestamp_1000)
            | |   [Types]: timestamp: timestamp not null, description: text not null
            |  \_[5] [Shuffle] Hash by [partition_values]
            |    |   [Affinity]: many nodes
            |     \_[6] [MaybeCache]
            |        \_[7] [TableFuncScan] file_format: $0.file_format, file_size: $0.file_size, partition_columns: $0.partition_columns, partition_transforms: $0.partition_transforms, partition_values: $0.partition_values, file_name: $0.file_name, file_bucket: $0.file_bucket, file_storage_api: $0.file_storage_api, file_etag: $0.file_etag, min_timestamp: $0.min_timestamp, max_timestamp: $0.max_timestamp, min_description: $0.min_description, max_description: $0.max_description, partition_value_timestamp_1000: $0.partition_value_timestamp_1000
            |              $0 = list_iceberg_files(url => 's3://.../table1/metadata/v1.metadata.json', metadata_json_content => '****', snapshot_id => '1888593444364154112', snapshot_timestamp => NULL)
            |              [Types]: file_format: text not null, file_size: bigint not null, partition_columns: array(text not null) not null, partition_transforms: array(text not null) not null, partition_values: array(text not null) not null, file_name: text not null, file_bucket: text not null, file_storage_api: text not null, file_etag: text not null, min_timestamp: timestamp null, max_timestamp: timestamp null, min_description: text null, max_description: text null, partition_value_timestamp_1000: integer not null
            \_[8] [TableFuncScan] timestamp: $0.timestamp, user: $0.user
              |   $0 = read_from_s3(url='s3://.../table2', format='PARQUET', object_pattern='*', type=Iceberg, file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_timestamp, max_timestamp, min_user, max_user, partition_value_timestamp_1000)
              |   [Types]: timestamp: timestamp not null, user: text not null
               \_[9] [Shuffle] Hash by [partition_values]
                 |   [Affinity]: many nodes
                  \_[10] [MaybeCache]
                     \_[11] [TableFuncScan] file_format: $0.file_format, file_size: $0.file_size, partition_columns: $0.partition_columns, partition_transforms: $0.partition_transforms, partition_values: $0.partition_values, file_name: $0.file_name, file_bucket: $0.file_bucket, file_storage_api: $0.file_storage_api, file_etag: $0.file_etag, min_timestamp: $0.min_timestamp, max_timestamp: $0.max_timestamp, min_user: $0.min_user, max_user: $0.max_user, partition_value_timestamp_1000: $0.partition_value_timestamp_1000
                           $0 = list_iceberg_files(url => 's3://.../table2/metadata/v1.metadata.json', metadata_json_content => '****', snapshot_id => '2306990001985110276', snapshot_timestamp => NULL)
                           [Types]: file_format: text not null, file_size: bigint not null, partition_columns: array(text not null) not null, partition_transforms: array(text not null) not null, partition_values: array(text not null) not null, file_name: text not null, file_bucket: text not null, file_storage_api: text not null, file_etag: text not null, min_timestamp: timestamp null, max_timestamp: timestamp null, min_user: text null, max_user: text null, partition_value_timestamp_1000: integer not null
  ```

  ```text Join plan with enable_iceberg_partitioned_scan = false {6-7,12-15,22-23,28-31} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [MaybeCache]
   \_[1] [Shuffle] Gather
     |   [Affinity]: many nodes
      \_[2] [Projection] timestamp, description, user
         \_[3] [Join] Mode: Inner [(timestamp = timestamp)]
            \_[4] [Shuffle] Hash by [timestamp]
            | |   [Affinity]: many nodes
            |  \_[5] [TableFuncScan] timestamp: $0.timestamp, description: $0.description
            |    |   $0 = read_from_s3(url='s3://.../table1', format='PARQUET', object_pattern='*', type=Iceberg, file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_timestamp, max_timestamp, min_description, max_description, partition_value_timestamp_1000)
            |    |   [Types]: timestamp: timestamp not null, description: text not null
            |     \_[6] [Projection] file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_timestamp, max_timestamp, min_description, max_description, partition_value_timestamp_1000
            |        \_[7] [Shuffle] KeyIdentity by [weighted_partition_0]
            |          |   [Affinity]: many nodes
            |           \_[8] [Window] weighted_partition_0: weighted_partition(file_size, c_0) OVER (ORDER BY file_size Descending Last, file_name StrictlyAscending Last, file_format StrictlyAscending Last, partition_columns StrictlyAscending Last, partition_transforms StrictlyAscending Last, partition_values StrictlyAscending Last, file_bucket StrictlyAscending Last, file_storage_api StrictlyAscending Last, file_etag StrictlyAscending Last, min_timestamp StrictlyAscending Last, max_timestamp StrictlyAscending Last, min_description StrictlyAscending Last, max_description StrictlyAscending Last, partition_value_timestamp_1000 StrictlyAscending Last, c_0 StrictlyAscending Last, c_0 StrictlyAscending Last)
            |             |   [Types]: weighted_partition_0: bigint not null
            |              \_[9] [Projection] file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_timestamp, max_timestamp, min_description, max_description, partition_value_timestamp_1000, c_0: 1
            |                |   [Types]: c_0: bigint not null
            |                 \_[10] [MaybeCache]
            |                    \_[11] [TableFuncScan] file_format: $0.file_format, file_size: $0.file_size, partition_columns: $0.partition_columns, partition_transforms: $0.partition_transforms, partition_values: $0.partition_values, file_name: $0.file_name, file_bucket: $0.file_bucket, file_storage_api: $0.file_storage_api, file_etag: $0.file_etag, min_timestamp: $0.min_timestamp, max_timestamp: $0.max_timestamp, min_description: $0.min_description, max_description: $0.max_description, partition_value_timestamp_1000: $0.partition_value_timestamp_1000
            |                          $0 = list_iceberg_files(url => 's3://.../table1/metadata/v1.metadata.json', metadata_json_content => '****', snapshot_id => '1888593444364154112', snapshot_timestamp => NULL)
            |                          [Types]: file_format: text not null, file_size: bigint not null, partition_columns: array(text not null) not null, partition_transforms: array(text not null) not null, partition_values: array(text not null) not null, file_name: text not null, file_bucket: text not null, file_storage_api: text not null, file_etag: text not null, min_timestamp: timestamp null, max_timestamp: timestamp null, min_description: text null, max_description: text null, partition_value_timestamp_1000: integer not null
            \_[12] [Shuffle] Hash by [timestamp]
              |   [Affinity]: many nodes
               \_[13] [TableFuncScan] timestamp: $0.timestamp, user: $0.user
                 |   $0 = read_from_s3(url='s3://.../table2', format='PARQUET', object_pattern='*', type=Iceberg, file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_timestamp, max_timestamp, min_user, max_user, partition_value_timestamp_1000)
                 |   [Types]: timestamp: timestamp not null, user: text not null
                  \_[14] [Projection] file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_timestamp, max_timestamp, min_user, max_user, partition_value_timestamp_1000
                     \_[15] [Shuffle] KeyIdentity by [weighted_partition_1]
                       |   [Affinity]: many nodes
                        \_[16] [Window] weighted_partition_1: weighted_partition(file_size, c_1) OVER (ORDER BY file_size Descending Last, file_name StrictlyAscending Last, file_format StrictlyAscending Last, partition_columns StrictlyAscending Last, partition_transforms StrictlyAscending Last, partition_values StrictlyAscending Last, file_bucket StrictlyAscending Last, file_storage_api StrictlyAscending Last, file_etag StrictlyAscending Last, min_timestamp StrictlyAscending Last, max_timestamp StrictlyAscending Last, min_user StrictlyAscending Last, max_user StrictlyAscending Last, partition_value_timestamp_1000 StrictlyAscending Last, c_1 StrictlyAscending Last, c_1 StrictlyAscending Last)
                          |   [Types]: weighted_partition_1: bigint not null
                           \_[17] [Projection] file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_timestamp, max_timestamp, min_user, max_user, partition_value_timestamp_1000, c_1: 1
                             |   [Types]: c_1: bigint not null
                              \_[18] [MaybeCache]
                                 \_[19] [TableFuncScan] file_format: $0.file_format, file_size: $0.file_size, partition_columns: $0.partition_columns, partition_transforms: $0.partition_transforms, partition_values: $0.partition_values, file_name: $0.file_name, file_bucket: $0.file_bucket, file_storage_api: $0.file_storage_api, file_etag: $0.file_etag, min_timestamp: $0.min_timestamp, max_timestamp: $0.max_timestamp, min_user: $0.min_user, max_user: $0.max_user, partition_value_timestamp_1000: $0.partition_value_timestamp_1000
                                       $0 = list_iceberg_files(url => 's3://.../table2/metadata/v1.metadata.json', metadata_json_content => '****', snapshot_id => '2306990001985110276', snapshot_timestamp => NULL)
                                       [Types]: file_format: text not null, file_size: bigint not null, partition_columns: array(text not null) not null, partition_transforms: array(text not null) not null, partition_values: array(text not null) not null, file_name: text not null, file_bucket: text not null, file_storage_api: text not null, file_etag: text not null, min_timestamp: timestamp null, max_timestamp: timestamp null, min_user: text null, max_user: text null, partition_value_timestamp_1000: integer not null
  ```
</CodeGroup>

Below, we show the effect of partitioned scans on aggregations.  We use the another table that is value-partitioned by `type`.  We aggregate on `type` and see that the optimizer uses partitioned scans to avoid shuffling the data.

<CodeGroup>
  ```sql Aggregation query theme={"theme":{"light":"github-light","dark":"github-dark"}}
  EXPLAIN (PHYSICAL)
  SELECT
    type,
    count(*) as count
  FROM
    read_iceberg(location => 'table3')  -- partitioned by `type`
  GROUP BY type;
  ```

  ```text Aggregation plan with enable_iceberg_partitioned_scan = true {11-12} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [MaybeCache]
   \_[1] [Shuffle] Gather
     |   [Affinity]: many nodes
      \_[2] [Projection] type, max_0
         \_[3] [Aggregate] GroupBy: [type] Aggregates: [max_0: max(name)]
           |   [Types]: max_0: text null
            \_[4] [Projection] type, name
               \_[5] [TableFuncScan] name: $0.name, type: $0.type
                 |   $0 = read_from_s3(url='s3://.../table3', format='PARQUET', object_pattern='*', type=Iceberg, file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_name, max_name, min_type, max_type, partition_value_type_1000)
                 |   [Types]: name: text null, type: bigint null
                  \_[6] [Shuffle] Hash by [partition_values]
                    |   [Affinity]: many nodes
                     \_[7] [MaybeCache]
                        \_[8] [TableFuncScan] file_format: $0.file_format, file_size: $0.file_size, partition_columns: $0.partition_columns, partition_transforms: $0.partition_transforms, partition_values: $0.partition_values, file_name: $0.file_name, file_bucket: $0.file_bucket, file_storage_api: $0.file_storage_api, file_etag: $0.file_etag, min_name: $0.min_name, max_name: $0.max_name, min_type: $0.min_type, max_type: $0.max_type, partition_value_type_1000: $0.partition_value_type_1000
                              $0 = list_iceberg_files(url => 's3://.../table3/metadata/v1.metadata.json', metadata_json_content => '****', snapshot_id => '7056553103719432372', snapshot_timestamp => NULL)
                              [Types]: file_format: text not null, file_size: bigint not null, partition_columns: array(text not null) not null, partition_transforms: array(text not null) not null, partition_values: array(text not null) not null, file_name: text not null, file_bucket: text not null, file_storage_api: text not null, file_etag: text not null, min_name: text null, max_name: text null, min_type: bigint null, max_type: bigint null, partition_value_type_1000: bigint null
  ```

  ```text Aggregation plan with enable_iceberg_partitioned_scan = false {7-8,15-18} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [MaybeCache]
   \_[1] [Shuffle] Gather
     |   [Affinity]: many nodes
      \_[2] [Projection] type, max_1
         \_[3] [AggregateMerge] GroupBy: [type] Aggregates: [max_1: maxmerge(max_0)]
           |   [Types]: max_1: text null
            \_[4] [Shuffle] Hash by [type]
              |   [Affinity]: many nodes
               \_[5] [AggregateState partial] GroupBy: [type] Aggregates: [max_0: max(name)]
                 |   [Types]: max_0: aggregatefunction(max2ornull, text null) not null
                  \_[6] [TableFuncScan] name: $0.name, type: $0.type
                    |   $0 = read_from_s3(url='s3://.../table3', format='PARQUET', object_pattern='*', type=Iceberg, file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_name, max_name, min_type, max_type, partition_value_type_1000)
                    |   [Types]: name: text null, type: bigint null
                     \_[7] [Projection] file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_name, max_name, min_type, max_type, partition_value_type_1000
                        \_[8] [Shuffle] KeyIdentity by [weighted_partition_0]
                          |   [Affinity]: many nodes
                           \_[9] [Window] weighted_partition_0: weighted_partition(file_size, c_0) OVER (ORDER BY file_size Descending Last, file_name StrictlyAscending Last, file_format StrictlyAscending Last, partition_columns StrictlyAscending Last, partition_transforms StrictlyAscending Last, partition_values StrictlyAscending Last, file_bucket StrictlyAscending Last, file_storage_api StrictlyAscending Last, file_etag StrictlyAscending Last, min_name StrictlyAscending Last, max_name StrictlyAscending Last, min_type StrictlyAscending Last, max_type StrictlyAscending Last, partition_value_type_1000 StrictlyAscending Last, c_0 StrictlyAscending Last, c_0 StrictlyAscending Last)
                             |   [Types]: weighted_partition_0: bigint not null
                              \_[10] [Projection] file_format, file_size, partition_columns, partition_transforms, partition_values, file_name, file_bucket, file_storage_api, file_etag, min_name, max_name, min_type, max_type, partition_value_type_1000, c_0: 1
                                |   [Types]: c_0: bigint not null
                                 \_[11] [MaybeCache]
                                    \_[12] [TableFuncScan] file_format: $0.file_format, file_size: $0.file_size, partition_columns: $0.partition_columns, partition_transforms: $0.partition_transforms, partition_values: $0.partition_values, file_name: $0.file_name, file_bucket: $0.file_bucket, file_storage_api: $0.file_storage_api, file_etag: $0.file_etag, min_name: $0.min_name, max_name: $0.max_name, min_type: $0.min_type, max_type: $0.max_type, partition_value_type_1000: $0.partition_value_type_1000
                                          $0 = list_iceberg_files(url => 's3://.../table3/metadata/v1.metadata.json', metadata_json_content => '****', snapshot_id => '7056553103719432372', snapshot_timestamp => NULL)
                                          [Types]: file_format: text not null, file_size: bigint not null, partition_columns: array(text not null) not null, partition_transforms: array(text not null) not null, partition_values: array(text not null) not null, file_name: text not null, file_bucket: text not null, file_storage_api: text not null, file_etag: text not null, min_name: text null, max_name: text null, min_type: bigint null, max_type: bigint null, partition_value_type_1000: bigint null
  ```
</CodeGroup>

## Iceberg insert sharding

Firebolt provides controls for how data is distributed across nodes when [writing to partitioned Apache Iceberg tables](/reference-sql/commands/data-definition/create-iceberg-table-as-select). This is similar to [`insert_sharding`](#insert-sharding) but applies specifically to Iceberg table writes.

Use `iceberg_insert_sharding` to control the trade-off between parallelism and the number of output files per partition. By default, Firebolt shuffles data so that each partition is written by fewer nodes, which reduces per-file overhead and produces fewer, larger output files. Having fewer files lowers I/O overhead and generally improves Firebolt's query performance. This works best when you have enough partitions to distribute evenly across nodes.

### Accepted values

* `shuffle_on_write` (default): Data is shuffled by partition values before writing. This groups files of the same partition on the same nodes, resulting in fewer output files per partition. This is optimal for most use cases as it produces fewer, larger files that are more efficient to read. However, this may be inefficient if there are single partitions with heavy data - which would not be distributed across multiple nodes.

* `no_sharding`: No shuffling is performed before writing. On multi-node engines, this can result in multiple files per partition since different nodes may write data for the same partition independently. This can also increase the load per node when there are many partitions. Use this when you want to maximize write parallelism and are less concerned about the number of output files or load per node.

### Notes

* This setting only affects partitioned Iceberg table writes. Non-partitioned tables are not affected.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE ICEBERG TABLE <table_name>
  PARTITION BY (...)
  AS <select_query>
WITH LOCATION = <location_name>
     iceberg_insert_sharding = '<mode>';
```

### Example

The following example uses `no_sharding` to maximize write parallelism at the cost of producing more output files:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE ICEBERG TABLE my_partitioned_iceberg
  PARTITION BY iceberg_bucket(user_id, 16)
  AS SELECT * FROM my_source_table
WITH LOCATION = my_iceberg_location
     iceberg_insert_sharding = 'no_sharding';
```

## Control late materialization

You can control the late materialization optimization using `late_materialization_max_rows`.
By default `late_materialization_max_rows` is set to 10.
This means that late materialization will only be applied for top-k queries with a limit of 10 or smaller.
If you want to use late materialization for a query with a larger limit you can increase this value.
If you want to disable late materialization, just set it to zero.

See [late materialization](/performance-and-observability/query-planning/late-materialization) for more details.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET late_materialization_max_rows = <integer>;
```

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT ... WITH (late_materialization_max_rows = <integer>);
```

### Example

The following example demonstrates how to use the `late_materialization_max_rows` setting to apply late materialization to a top-k query with a limit larger than 10.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT * FROM table ORDER BY a LIMIT 100
WITH (late_materialization_max_rows = 100);
```

## Derive NDV from unique keys

The `derive_ndv_from_unique_keys` setting enables the query optimizer to derive the number of distinct values (NDV) for columns from a unique key. For example, if the optimizer sees that a column `a` is unique at some point and the row count is estimated to be 100, with the feature enabled, the optimizer assumes the NDV of column `a` is 100 because `a` is unique. When the feature is disabled, the optimizer does not make this assumption.

When enabled, it can help the optimizer make better cardinality estimates and plan transformations, potentially leading to improved query performance.

This is an opt-in feature. For more details on how unique constraints affect query planning, see [UNIQUE constraints as query hints](/performance-and-observability/query-planning/unique-constraint).

**Default value:** `false`

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET derive_ndv_from_unique_keys = [true|false];
SELECT ... WITH (derive_ndv_from_unique_keys = [true|false]);
```

### Example

The following example shows the difference in the join output cardinality estimation the optimizer produces with the feature turn on and off.

* With the feature turned on, the optimizer understands that the join outputs at most the same number of rows as the `orders` table.
* With the feature turned off, the optimizer over estimates by assuming the join is explosive.

<CodeGroup>
  ```sql Example case theme={"theme":{"light":"github-light","dark":"github-dark"}}
  -- 1500000 rows
  CREATE FACT TABLE orders (
     id INT NOT NULL,
     price DOUBLE,
     city TEXT
  );

  -- 10000 rows
  CREATE DIMENSION TABLE accounts (
     id INT NOT NULL,
     city TEXT UNIQUE  -- this column is unique
  )

  EXPLAIN (STATISTICS)
  SELECT *
  FROM orders o, accounts a
  WHERE o.city = a.city;
  ```

  ```text Plan with derive_ndv_from_unique_keys = true {4,10} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [Projection] orders.id, orders.price, orders.city, accounts.id, orders.city
  |   [Logical Profile]: [est. #rows=1.5e+06, column profiles={[orders.city: #distinct=10000]}, source: estimated]
   \_[1] [Join] Mode: Inner [equals(orders.city, accounts.city)]
     |   [Logical Profile]: [est. #rows=1.5e+06, column profiles={[orders.city: #distinct=10000], [accounts.city: #distinct=10000]}, source: estimated]
      \_[2] [StoredTable] Name: "orders"
      |     [Types]: orders.id: integer not null, orders.price: double precision not null, orders.city: text not null
      |     [Logical Profile]: [est. #rows=1.5e+06, source: hinted]
      \_[3] [StoredTable] Name: "accounts"
            [Types]: accounts.id: integer not null, accounts.city: text not null
            [Logical Profile]: [est. #rows=10000, column profiles={[accounts.city: #distinct=10000]}, source: hinted]
  ```

  ```text Plan with derive_ndv_from_unique_keys = false {4,10} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [Projection] orders.id, orders.price, orders.city, accounts.id, orders.city
  |   [Logical Profile]: [est. #rows=1.33118e+08, source: estimated]
   \_[1] [Join] Mode: Inner [equals(orders.city, accounts.city)]
     |   [Logical Profile]: [est. #rows=1.33118e+08, source: estimated]
      \_[2] [StoredTable] Name: "orders"
      |     [Types]: orders.id: integer not null, orders.price: double precision not null, orders.city: text not null
      |     [Logical Profile]: [est. #rows=1.5e+06, source: hinted]
      \_[3] [StoredTable] Name: "accounts"
            [Types]: accounts.id: integer not null, accounts.city: text not null
            [Logical Profile]: [est. #rows=10000, source: hinted]
  ```
</CodeGroup>

## Enable type based NDV heuristic

The `enable_type_based_ndv_heuristic` setting enables the query optimizer to assume the number of distinct values (NDV) for columns based on their data types.

When enabled:

* Boolean columns are assumed to have 2 distinct values.
* Date columns are assumed to have 3650 distinct values (10 years of dates) if the row count is more than 13,322,500.

<Note>
  The heuristic makes the optimizer estimate selectivity towards common cases rather than edge cases. It does not guarantee the estimated values are strictly better. For example, it might seem that the feature strictly improves the filter selectivity on boolean columns. However, the actual data in a boolean column might be extremely skewed that the default wild guess is actually closer to reality. In that case, use [automated column statistics](/performance-and-observability/query-planning/automated-column-statistics) instead.
</Note>

**Default value:** `true`

This is an opt-out feature. If a regression is observed due to the type-based NDV heuristic, turn the feature off.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET enable_type_based_ndv_heuristic = [true|false];
```

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT ... WITH (enable_type_based_ndv_heuristic = [true|false]);
```

### Example

The following example shows how the heuristic works. The `orders` table has more than 13,322,500 rows.

* When the feature is enabled, the optimizer assumes `cancelled` column has 2 distinct values and `order_date` column has 3650 distinct values. The filter is then estimated to select half of the rows.
* When the feature is disabled, the optimizer does not assume any column NDVs. The selectivity of the filter is a wild guess.

<CodeGroup>
  ```sql Example case theme={"theme":{"light":"github-light","dark":"github-dark"}}
  -- 15000000 rows
  CREATE FACT TABLE orders (
     id INT NOT NULL,
     price DOUBLE,
     cancelled BOOL,
     order_date DATE
  );

  EXPLAIN (STATISTICS)
  SELECT id FROM orders WHERE NOT cancelled;
  ```

  ```text Plan with enable_type_based_ndv_heuristic = true {4,7} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [Projection] orders.id, orders.order_date
  |   [Logical Profile]: [est. #rows=7.5e+06, column profiles={[orders.order_date: #distinct=3650]}, source: estimated]
   \_[1] [Filter] equals(orders.cancelled, FALSE)
     |   [Logical Profile]: [est. #rows=7.5e+06, column profiles={[orders.cancelled: #distinct=1], [orders.order_date: #distinct=3650]}, source: estimated]
      \_[2] [StoredTable] Name: "orders"
            [Types]: orders.id: integer not null, orders.cancelled: boolean not null, orders.order_date: date not null
            [Logical Profile]: [est. #rows=1.5e+07, column profiles={[orders.cancelled: #distinct=2], [orders.order_date: #distinct=3650]}, source: hinted]
  ```

  ```text Plan with enable_type_based_ndv_heuristic = false {4,7} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [Projection] orders.id, orders.order_date
  |   [Logical Profile]: [est. #rows=3873, source: estimated]
   \_[1] [Filter] equals(orders.cancelled, FALSE)
     |   [Logical Profile]: [est. #rows=3873, column profiles={[orders.cancelled: #distinct=1]}, source: estimated]
      \_[2] [StoredTable] Name: "orders"
            [Types]: orders.id: integer not null, orders.cancelled: boolean not null, orders.order_date: date not null
            [Logical Profile]: [est. #rows=1.5e+07, source: hinted]
  ```
</CodeGroup>

## Enable datetime function NDV heuristic

The `enable_date_time_func_ndv_heuristic` setting enables the query optimizer to infer the number of distinct values (NDV) for the output of some datetime functions.

When enabled, the heuristic is applied to the result of the following functions:

* [`to_yyyymm`](/reference-sql/functions-reference/date-and-time/to-yyyymm)
* [`to_yyyymmdd`](/reference-sql/functions-reference/date-and-time/to-yyyymmdd)
* [`date_trunc`](/reference-sql/functions-reference/date-and-time/date-trunc)
* [`extract`](/reference-sql/functions-reference/date-and-time/extract)

The general heuristic is that a big data set contains about 10 years of data. Based on that, the NDV of the result of `to_yyyymm` is assumed to be 120, for example.

Some NDV inferences are independent from the above heuristic. For example, the NDV of the result of `extract(quarter from some_column)` is assumed to be 4. This is purely based on the semantics of the function, not the data size.

<Note>
  The heuristic makes the optimizer estimate selectivity towards common cases rather than edge cases. It does not guarantee the estimated values are strictly better.
</Note>

**Default value:** `true`

This is an opt-out feature. If a regression is observed due to the NDV heuristic for datetime functions, turn the feature off.

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET enable_date_time_func_ndv_heuristic = [true|false];
```

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT ... WITH (enable_date_time_func_ndv_heuristic = [true|false]);
```

### Example

The following example shows how the heuristic works.

* When the feature is enabled, the optimizer assumes
  * `to_yyyymm(order_date)` produces 120 distinct values (10 years with 12 months each).
  * `date_trunc('year', order_date)` produces 10 distinct values (10 years).
  * `extract(dow FROM order_date)` produces 7 distinct values (a week has 7 days).
* When the feature is disabled, the optimizer does not assume any NDVs.

<CodeGroup>
  ```sql Example case theme={"theme":{"light":"github-light","dark":"github-dark"}}
  -- 15000000 rows
  CREATE FACT TABLE orders (
     id INT NOT NULL,
     price DOUBLE,
     cancelled BOOL,
     order_date DATE
  );

  EXPLAIN (STATISTICS)
  SELECT id, to_yyyymm(order_date), date_trunc('year', order_date), extract(dow FROM order_date) FROM orders;
  ```

  ```text Plan with enable_date_time_func_ndv_heuristic = true {3} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [Projection] orders.id, to_yyyymm_0: to_yearmonth(orders.order_date), date_trunc_0: timestampTrunc('year', orders.order_date), extractDow_0: extractDow(orders.order_date)
  |   [Types]: to_yyyymm_0: integer not null, date_trunc_0: date not null, extractDow_0: integer not null
  |   [Logical Profile]: [est. #rows=1.5e+07, column profiles={[to_yyyymm_0: #distinct=120], [date_trunc_0: #distinct=10], [extractDow_0: #distinct=7]}, source: estimated]
   \_[1] [StoredTable] Name: "orders"
         [Types]: orders.id: integer not null, orders.order_date: date not null
         [Logical Profile]: [est. #rows=1.5e+07, source: hinted]
  ```

  ```text Plan with enable_date_time_func_ndv_heuristic = false {3} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  [0] [Projection] orders.id, to_yyyymm_0: to_yearmonth(orders.order_date), date_trunc_0: timestampTrunc('year', orders.order_date), extractDow_0: extractDow(orders.order_date)
  |   [Types]: to_yyyymm_0: integer not null, date_trunc_0: date not null, extractDow_0: integer not null
  |   [Logical Profile]: [est. #rows=1.5e+07, source: hinted]
   \_[1] [StoredTable] Name: "orders"
         [Types]: orders.id: integer not null, orders.order_date: date not null
         [Logical Profile]: [est. #rows=1.5e+07, source: hinted]
  ```
</CodeGroup>

## Forcing hash-shuffled joins

On multi-node engines, Firebolt automatically picks a suitable execution strategy for joins. Fundamentally, the choice is between broadcast joins and hash-shuffled joins.

* Broadcast joins replicate the build side on every node, which avoids having to reshuffle the probe side. This can improve the speed of query execution substantially if the probe side is much larger than the build side.
* Hash-shuffled joins reshuffle both inputs of the join by the join key. These plans are the most scalable and work well even when both inputs are very large.

Firebolt chooses broadcast joins when it knows that the build side is below 5 million rows or the probe side is at least 100 times larger than the build side, as well as for semi joins (`WHERE col IN (SELECT ...)`).
By setting `force_shuffle_hash_join=true`, users can override this decision-making and force the query engine to use a hash-shuffled join.
This setting is only applicable on multi-node engines.

````sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT * FROM orders
WHERE user_id IN (SELECT user_id FROM users WHERE email IS NOT NULL)
WITH (force_shuffle_hash_join=true);
## Marking queries as HBS training data

The `as_hbs_training_data` setting marks queries so that their execution metrics are collected for [History-based statistics](/performance-and-observability/query-planning/history-based-statistics).
When set to `true` for a session or for individual queries, those queries are used as training data to build an HBS snapshot.

**Default value:** `false`

### Syntax

```sql
SET as_hbs_training_data = [true|false];
````

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT ... WITH (as_hbs_training_data = [true|false]);
```

## Using an HBS snapshot for query planning

The `hbs_object_id` setting specifies which [History-based statistics](/performance-and-observability/query-planning/history-based-statistics) snapshot the optimizer should use when planning the next query.
Set it to the UUID returned by `CALL CREATE_HBS_OBJECT();` (see [History-based statistics](/performance-and-observability/query-planning/history-based-statistics#snapshot-creation)) or from `fb_catalog.public.history_based_statistics`.

**Default value:** not set (no HBS snapshot used)

### Syntax

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SET hbs_object_id = '<uuid>';
```

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
SELECT ... WITH (hbs_object_id = '<uuid>');
```
