Infer from user created indexes

Automated Column Statistics is a feature that enables Firebolt’s query planner to leverage aggregating indexes to obtain statistical information about columns. This enables the planner to make more informed decisions, potentially leading to better query plans and improved performance. On this page, we explain how to create tables that automatically collect statistics.

Syntax

To enable inferring column statistics from existing aggregating indexes, use the query-level setting infer_statistics_from_indexes. The default value is false, so this is an opt-in feature.

SET infer_statistics_from_indexes = true;
SELECT ...;

Alternatives

You can also let Firebolt manage the aggregating indexes for you. This has the advantage that you do not have to manually change the query settings and always profit when using a table that has the feature enabled. See Create tables with statistics.

Example

Consider two tables with columns that are frequently used for grouping, filtering, or joining.

CREATE TABLE employees
(
    name TEXT,
    date_of_birth DATE,
    gender TEXT,
    title TEXT,
    salary REAL,
    dept_id INT
);

CREATE TABLE departments
(
    dept_id INT,
    dept_name TEXT,
    location TEXT,
    budget REAL,
    manager_id INT
);

To enable automated column statistics collection for relevant columns, create an aggregating index:

CREATE AGGREGATING INDEX employees_statistics
ON employees
(
    APPROX_COUNT_DISTINCT(gender),      -- for filtering by gender
    APPROX_COUNT_DISTINCT(title),       -- for filtering by title
    APPROX_COUNT_DISTINCT(dept_id)      -- for joining on department id
);

CREATE AGGREGATING INDEX departments_statistics
ON departments
(
    APPROX_COUNT_DISTINCT(dept_id),     -- for joining on department id
    APPROX_COUNT_DISTINCT(dept_name),   -- for grouping by department name
    APPROX_COUNT_DISTINCT(location)     -- for filtering by location
);

Finally, before running a query, enable the feature by calling SET infer_statistics_from_indexes = true;. Then you can see the following effect on the query plan:

SELECT
    d.dept_name,
    e.gender,
    COUNT(*) as employee_count,
    AVG(e.salary) as avg_salary
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id
WHERE d.location = 'New York'
  AND e.title = 'Senior Developer'
GROUP BY d.dept_name, e.gender
ORDER BY employee_count DESC;

We can observe that the logical profiles contain distinct counts for departments.dept_name, departments.location, employees.gender, employees.title, and employees.dept_id. These statistics help with further estimation. Here are some key observations:

The filter [6] [Filter] (employees.title = 'Senior Developer') is estimated to produce 6 rows because ACS provides an appropriate distinct count of 27 for employees.title. This estimation can be far off if data is skewed, but it often improves significantly over default estimations.
The filter [9] [Filter] (departments.location = 'New York') is estimated to produce only 3 rows because ACS provides an appropriate distinct count of 3 for departments.location. This estimation can be far off if data is skewed, but it often improves significantly over default estimations.
The join [4] [Join] Mode: Inner [(employees.dept_id = departments.dept_id)] is estimated to produce 5 rows. The query planner is able to infer from the distinct count statistics on departments.dept_id that this column is unique (and likely a primary key). Consequently, the join cannot produce more rows than are coming from the foreign key side. Because some rows from departments have been filtered out, the planner cannot guarantee that every row from the foreign key side will find a join partner and be produced. It performs a conservative estimation that one row will not be produced and the other 5 rows will be produced.
The aggregation [2] [Aggregate] GroupBy: [employees.gender, departments.dept_name] Aggregates: [count_0: count(*), avg2_0: avg2(employees.salary)] is estimated to produce 5 groups. While ACS suggests 2 distinct genders and 3 distinct department names, the product 2 * 3 = 6 is bounded by the expected row count of 5. Hence, the planner predicts 5 groups.

What is Firebolt?

Overview

Performance and Observability

Guides

SQL reference

General reference

API reference

Syntax

Alternatives

Example

What is Firebolt?

Overview

Performance and Observability

Guides

SQL reference

General reference

API reference

​Syntax

​Alternatives

​Example

Syntax

Alternatives

Example