Syntax
In the create table statement, you can specify the columns you want to collect statistics for by adding theSTATISTICS keyword to the column definition.
STATISTICS ALL to collect statistics for all columns.
Alternatives
You can also allow Firebolt to infer column statistics from existing aggregating indexes. To do this, you have to manually create the aggregating indexes, and modify the queries to use the statistics. This has the advantage of giving you more control over which queries use the statistics. See Using existing aggregating indexes.Example
Consider two tables with columns that are frequently used for grouping, filtering, or joining. Because of this, we configure the columns to collect statistics.departments.dept_name, departments.location, employees.gender, employees.title, and employees.dept_id.
These statistics help with further estimation.
Here are some key observations:
- The filter
[6] [Filter] (employees.title = 'Senior Developer')is estimated to produce 6 rows because ACS provides an appropriate distinct count of 27 foremployees.title. This estimation can be far off if data is skewed, but it often improves significantly over default estimations. - The filter
[9] [Filter] (departments.location = 'New York')is estimated to produce only 3 rows because ACS provides an appropriate distinct count of 3 fordepartments.location. This estimation can be far off if data is skewed, but it often improves significantly over default estimations. - The join
[4] [Join] Mode: Inner [(employees.dept_id = departments.dept_id)]is estimated to produce 5 rows. The query planner is able to infer from the distinct count statistics ondepartments.dept_idthat this column is unique (and likely a primary key). Consequently, the join cannot produce more rows than are coming from the foreign key side. Because some rows fromdepartmentshave been filtered out, the planner cannot guarantee that every row from the foreign key side will find a join partner and be produced. It performs a conservative estimation that one row will not be produced and the other 5 rows will be produced. - The aggregation
[2] [Aggregate] GroupBy: [employees.gender, departments.dept_name] Aggregates: [count_0: count(*), avg2_0: avg2(employees.salary)]is estimated to produce 5 groups. While ACS suggests 2 distinct genders and 3 distinct department names, the product2 * 3 = 6is bounded by the expected row count of 5. Hence, the planner predicts 5 groups.