Skip to main content
Automated column statistics (ACS) is a feature that enables Firebolt’s query planner to leverage aggregating indexes to obtain statistical information about columns, leading to better query plans and improved performance. For example, it provides distinct count statistics (also known as ndistinct or number of distinct values (NDV) statistics).

Overview

With automated column statistics enabled, the query planner can automatically extract and utilize statistical information from aggregating indexes to make more informed decisions during query optimization. This results in:
  • Better query plans - The planner can choose more optimal join orders, aggregation strategies, and execution paths
  • Improved performance - Statistical information helps estimate cardinalities more accurately
  • Minimal impact on query planning times - Statistics are cached in memory for fast access to reduce impact on query planning times to a minimum (usually only a few microseconds)
Statistics are refreshed under transactional consistency, so they are consistent with the underlying data at all times.

Usage

There are two ways of using automated column statistics:
  1. By requesting statistics when creating a table (which manages aggregating indexes automatically). When creating a table, you can request Firebolt to automatically collect statistics for the columns you specify: CREATE TABLE employees (name TEXT, dept_id INT STATISTICS). You can read more about this in Create tables with statistics.
  2. By configuring Firebolt to read already existing aggregating indexes. This has the advantage that you get the benefits of ACS without Firebolt having to manage additional aggregating indexes. However, the information that the planner can use depends on what it can extract from the aggregating indexes. This is controlled by the boolean query-level setting infer_statistics_from_indexes to true. You can read more about this in Use existing aggregating indexes.

How it works

Statistical Information Extraction

When automated column statistics are enabled, the query planner:
  1. Identifies relevant aggregating indexes - Scans for aggregating indexes on tables referenced in the query that provide statistical information
  2. Extracts statistics - Reads the precomputed statistics from the aggregating indexes into the in-memory cache
  3. Caches statistics in memory - Caches the extracted statistics in memory for fast subsequent access (within microseconds)
  4. Serves statistics to query planning – Statistics stored in the in-memory cache are served to the query planner, so it can make more informed decisions
  5. Maintains consistency - Automatically refreshes cached statistics when the underlying data changes

Performance Characteristics

  • Initial read: Reading statistics from aggregating indexes typically takes tens of milliseconds
  • Cached access: Subsequent queries access cached statistics within microseconds
  • Automatic refresh: Cache is invalidated and refreshed after DML operations on the origin table (the table that the aggregating index is defined on)

Best Practices

Index Design

  1. Identify key columns - Maintain statistics with aggregating indexes for columns frequently used in:
    • GROUP BY clauses
    • WHERE predicates
    • JOIN conditions
  2. Balance overhead - While aggregating indexes provide benefits, they also have maintenance overhead. Always measure the effect on query performance, data ingestion, and storage cost.

Query Optimization

  1. Enable selectively - Use infer_statistics_from_indexes = true for queries that can benefit from better cardinality estimation, or create selected tables with statistics.
  2. Monitor performance - Compare query plans and execution times with and without the feature enabled
  3. Combine with other optimizations - Use alongside other Firebolt optimization features for maximum benefit

Monitoring and Troubleshooting

Inspecting Statistics Usage

Use EXPLAIN with the statistics option to see how automated column statistics affect query planning:
SET infer_statistics_from_indexes = true; -- when using the manual approach

EXPLAIN (logical, statistics)
SELECT gender, COUNT(*)
FROM employees
GROUP BY gender;
Look for entries in the logical profile that indicate statistics source and distinct count estimates. You can find an explained query plan with statistics in Using Automated Column Statistics.

Cache Behavior

  • Cache invalidation - Statistics cache is automatically invalidated after DML statements
  • Memory usage - Cached statistics consume memory; monitor engine memory usage (however, the amount of memory required should be negligible)
  • Refresh timing - First query after data modification will refresh the cache