Overview
With automated column statistics enabled, the query planner can automatically extract and utilize statistical information from aggregating indexes to make more informed decisions during query optimization. This results in:- Better query plans - The planner can choose more optimal join orders, aggregation strategies, and execution paths
- Improved performance - Statistical information helps estimate cardinalities more accurately
- Minimal impact on query planning times - Statistics are cached in memory for fast access to reduce impact on query planning times to a minimum (usually only a few microseconds)
Configuration
Automated column statistics is controlled by the boolean query-level settingenable_automated_column_statistics
.
Default value: false
(opt-in feature)
How it works
Statistical Information Extraction
Whenenable_automated_column_statistics
is enabled, the query planner:
- Identifies relevant aggregating indexes - Scans for aggregating indexes on tables referenced in the query that provide statistical information
- Extracts statistics - Reads the precomputed statistics from the aggregating indexes into the in-memory cache
- Caches statistics in memory - Caches the extracted statistics in memory for fast subsequent access (within microseconds)
- Serves statistics to query planning – Statistics stored in the in-memory cache are served to the query planner, so it can make more informed decisions
- Maintains consistency - Automatically refreshes cached statistics when the underlying data changes
Performance Characteristics
- Initial read: Reading statistics from aggregating indexes typically takes tens of milliseconds
- Cached access: Subsequent queries access cached statistics within microseconds
- Automatic refresh: Cache is invalidated and refreshed after DML operations on the origin table (the table that the aggregating index is defined on)
Example Usage
Consider two tables with columns that are frequently used for grouping, filtering, or joining:Creating an Aggregating Index for Statistics
To enable automated column statistics collection for relevant columns, create an aggregating index:Using Automated Column Statistics
Let’s run a query with the aggregating index in place and the feature enabled.departments.dept_name
, departments.location
, employees.gender
, employees.title
, and employees.dept_id
.
These statistics help with further estimation.
Here are some key observations:
- The filter
[6] [Filter] (employees.title = 'Senior Developer')
is estimated to produce 6 rows because ACS provides an appropriate distinct count of 27 foremployees.title
. This estimation can be far off if data is skewed, but it often improves significantly over default estimations. - The filter
[9] [Filter] (departments.location = 'New York')
is estimated to produce only 3 rows because ACS provides an appropriate distinct count of 3 fordepartments.location
. This estimation can be far off if data is skewed, but it often improves significantly over default estimations. - The join
[4] [Join] Mode: Inner [(employees.dept_id = departments.dept_id)]
is estimated to produce 5 rows. The query planner is able to infer from the distinct count statistics ondepartments.dept_id
that this column is unique (and likely a primary key). Consequently, the join cannot produce more rows than are coming from the foreign key side. Because some rows fromdepartments
have been filtered out, the planner cannot guarantee that every row from the foreign key side will find a join partner and be produced. It performs a conservative estimation that one row will not be produced and the other 5 rows will be produced. - The aggregation
[2] [Aggregate] GroupBy: [employees.gender, departments.dept_name] Aggregates: [count_0: count(*), avg2_0: avg2(employees.salary)]
is estimated to produce 5 groups. While ACS suggests 2 distinct genders and 3 distinct department names, the product2 * 3 = 6
is bounded by the expected row count of 5. Hence, the planner predicts 5 groups.
Current Limitations
Currently, automated column statistics supports only distinct count statistics (APPROX_COUNT_DISTINCT
).
We plan to extend ACS by:
- Minimum and maximum value statistics for better range estimation
- Additional aggregate statistics from aggregating indexes
APPROX_COUNT_DISTINCT()
functions.
In the second milestone of ACS, we will enable users to define through DDL for which columns ACS should be provided.
This allows for a more declarative style of leveraging ACS and unburdens users from manual index creation, design, and management.
The syntax will be roughly
Best Practices
Index Design
-
Identify key columns - Maintain statistics with aggregating indexes for columns frequently used in:
GROUP BY
clausesWHERE
predicatesJOIN
conditions
- Balance overhead - While aggregating indexes provide benefits, they also have maintenance overhead. Always measure the effect on query performance, data ingestion, and storage cost.
Query Optimization
-
Enable selectively - Use
enable_automated_column_statistics = true
for queries that can benefit from better cardinality estimation - Monitor performance - Compare query plans and execution times with and without the feature enabled
- Combine with other optimizations - Use alongside other Firebolt optimization features for maximum benefit
Monitoring and Troubleshooting
Inspecting Statistics Usage
UseEXPLAIN
with the statistics
option to see how automated column statistics affect query planning:
Cache Behavior
- Cache invalidation - Statistics cache is automatically invalidated after DML statements
- Memory usage - Cached statistics consume memory; monitor engine memory usage (however, the amount of memory required should be negligible)
- Refresh timing - First query after data modification will refresh the cache
Related Topics
- Aggregating Index - Learn about creating and managing aggregating indexes
- Cardinality Estimation - Understand how statistics affect query planning
- Inspecting Query Plans - Learn to analyze query execution plans
- System Settings - Complete reference of Firebolt system settings