> ## Documentation Index
> Fetch the complete documentation index at: https://docs.firebolt.io/llms.txt
> Use this file to discover all available pages before exploring further.

<AgentInstructions>

## Submitting Feedback

If you encounter incorrect, outdated, or confusing documentation on this page, submit feedback:

POST https://docs.firebolt.io/feedback

```json
{
  "path": "/performance-and-observability/query-planning/automated-column-statistics/table-with-statistics",
  "feedback": "Description of the issue"
}
```

Only submit feedback when you have something specific and actionable to report.

</AgentInstructions>

> How to create a table that automatically collects and updates column statistics.

# Create tables with column statistics

[Automated Column Statistics](/performance-and-observability/query-planning/automated-column-statistics)
is a feature that enables Firebolt's query planner to leverage aggregating indexes to obtain statistical information about columns.
This enables the planner to make more informed decisions, potentially leading to better query plans and improved performance.
On this page, we explain how to create a table that automatically collects and updates statistics for the columns you specify.

## Creating tables with statistics

In the [create table](/reference-sql/commands/data-definition/create-fact-dimension-table) statement,
you can specify the columns you want to collect statistics for by adding the `STATISTICS` keyword to the column definition.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE employees (name TEXT, dept_id INT STATISTICS);
```

Alternatively, as a short-hand, you can use `STATISTICS ALL` to collect statistics for all columns.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE employees (name TEXT, dept_id INT, STATISTICS ALL);
```

Optionally, you can specify which type of statistics to collect:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE employees (name TEXT, dept_id INT STATISTICS (NDISTINCT, MINMAX));
CREATE TABLE employees (name TEXT, dept_id INT, STATISTICS (dept_id) TYPE NDISTINCT TYPE MINMAX);
```

If any query references a column that has statistics collected, the query planner will leverage these statistics during planning.
This can potentially lead to a better plans and therefore improved performance.

## Altering tables to add statistics

You can also [alter an existing table](/reference-sql/commands/data-definition/alter-table) to add all types of statistics for a column.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
ALTER TABLE employees ADD STATISTICS (dept_id);
```

Alternatively, you can specify which type of statistics to collect:

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
ALTER TABLE employees ADD STATISTICS (dept_id) TYPE NDISTINCT TYPE MINMAX;
```

## Alternatives

You can also allow Firebolt to infer column statistics from existing aggregating indexes.
To do this, you have to manually create the aggregating indexes, and modify the queries to use the statistics.
This has the advantage of giving you more control over which queries use the statistics.
See [Using existing aggregating indexes](/performance-and-observability/query-planning/automated-column-statistics/user-created-indexes).

## Example

Consider two tables with columns that are frequently used for grouping, filtering, or joining.
Because of this, we configure the columns to collect statistics.

```sql theme={"theme":{"light":"github-light","dark":"github-dark"}}
CREATE TABLE employees
(
    name TEXT,
    date_of_birth DATE,
    gender TEXT STATISTICS,
    title TEXT STATISTICS,
    salary REAL,
    dept_id INT STATISTICS
);

CREATE TABLE departments
(
    dept_id INT STATISTICS,
    dept_name TEXT STATISTICS,
    location TEXT STATISTICS,
    budget REAL,
    manager_id INT
);
```

Because these columns are now configured to collect statistics, the query planner can leverage the statistics, in this case improving the query plan.
The following example shows how the statistics are used.

<CodeGroup>
  ```sql Example query theme={"theme":{"light":"github-light","dark":"github-dark"}}
  SELECT
      d.dept_name,
      e.gender,
      COUNT(*) as employee_count,
      AVG(e.salary) as avg_salary
  FROM employees e
  JOIN departments d ON e.dept_id = d.dept_id
  WHERE d.location = 'New York'
    AND e.title = 'Senior Developer'
  GROUP BY d.dept_name, e.gender
  ORDER BY employee_count DESC;
  ```

  ```plain EXPLAIN plan with statistics {9,13,17,20,24,27} theme={"theme":{"light":"github-light","dark":"github-dark"}}
  explain TEXT
  [0] [Projection] departments.dept_name, employees.gender, employee_count: count_0, avg_salary: avg2_0
  |   [Types]: employee_count: bigint not null, avg_salary: double precision null
  |   [Logical Profile]: [est. #rows=5, source: estimated]
   \_[1] [Sort] OrderBy: [count_0 Descending First, departments.dept_name Ascending Last, employees.gender Ascending Last]
     |   [Logical Profile]: [est. #rows=5, source: estimated]
      \_[2] [Aggregate] GroupBy: [employees.gender, departments.dept_name] Aggregates: [count_0: count(*), avg2_0: avg2(employees.salary)]
        |   [Types]: count_0: bigint not null, avg2_0: double precision null
        |   [Logical Profile]: [est. #rows=5, column profiles={[employees.gender: #distinct=2], [departments.dept_name: #distinct=3]}, source: estimated]
         \_[3] [Projection] employees.gender, employees.salary, departments.dept_name
           |   [Logical Profile]: [est. #rows=5, column profiles={[employees.gender: #distinct=2], [departments.dept_name: #distinct=3]}, source: estimated]
            \_[4] [Join] Mode: Inner [(employees.dept_id = departments.dept_id)]
              |   [Logical Profile]: [est. #rows=5, column profiles={[employees.gender: #distinct=2], [employees.dept_id: #distinct=3], [departments.dept_id: #distinct=3], [departments.dept_name: #distinct=3]}, source: estimated]
               \_[5] [Projection] employees.gender, employees.salary, employees.dept_id
               | |   [Logical Profile]: [est. #rows=6, column profiles={[employees.gender: #distinct=2], [employees.dept_id: #distinct=4]}, source: estimated]
               |  \_[6] [Filter] (employees.title = 'Senior Developer')
               |    |   [Logical Profile]: [est. #rows=6, column profiles={[employees.gender: #distinct=2], [employees.title: #distinct=1], [employees.dept_id: #distinct=4]}, source: estimated]
               |     \_[7] [StoredTable] Name: "employees"
               |           [Types]: employees.gender: text null, employees.title: text null, employees.salary: real null, employees.dept_id: integer null
               |           [Logical Profile]: [est. #rows=141, column profiles={[employees.gender: #distinct=2], [employees.title: #distinct=27], [employees.dept_id: #distinct=8]}, source: automated column statistics]
               \_[8] [Projection] departments.dept_id, departments.dept_name
                 |   [Logical Profile]: [est. #rows=3, column profiles={[departments.dept_id: #distinct=3], [departments.dept_name: #distinct=3]}, source: estimated]
                  \_[9] [Filter] (departments.location = 'New York')
                    |   [Logical Profile]: [est. #rows=3, column profiles={[departments.dept_id: #distinct=3], [departments.dept_name: #distinct=3], [departments.location: #distinct=1]}, source: estimated]
                     \_[10] [StoredTable] Name: "departments"
                           [Types]: departments.dept_id: integer null, departments.dept_name: text null, departments.location: text null
                           [Logical Profile]: [est. #rows=8, column profiles={[departments.dept_id: #distinct=8], [departments.dept_name: #distinct=6], [departments.location: #distinct=3]}, source: automated column statistics]
  ```
</CodeGroup>

We can observe that the logical profiles contain distinct counts for `departments.dept_name`, `departments.location`, `employees.gender`, `employees.title`, and `employees.dept_id`.
These statistics help with further estimation.
Here are some key observations:

* The filter `[6] [Filter] (employees.title = 'Senior Developer')` is estimated to produce 6 rows because ACS provides an appropriate distinct count of 27 for `employees.title`.
  This estimation can be far off if data is skewed, but it often improves significantly over default estimations.
* The filter `[9] [Filter] (departments.location = 'New York')` is estimated to produce only 3 rows because ACS provides an appropriate distinct count of 3 for `departments.location`.
  This estimation can be far off if data is skewed, but it often improves significantly over default estimations.
* The join `[4] [Join] Mode: Inner [(employees.dept_id = departments.dept_id)]` is estimated to produce 5 rows.
  The query planner is able to infer from the distinct count statistics on `departments.dept_id` that this column is unique (and *likely* a primary key).
  Consequently, the join cannot produce more rows than are coming from the foreign key side.
  Because some rows from `departments` have been filtered out, the planner cannot guarantee that every row from the foreign key side will find a join partner and be produced.
  It performs a conservative estimation that one row will not be produced and the other 5 rows will be produced.
* The aggregation `[2] [Aggregate] GroupBy: [employees.gender, departments.dept_name] Aggregates: [count_0: count(*), avg2_0: avg2(employees.salary)]` is estimated to produce 5 groups.
  While ACS suggests 2 distinct genders and 3 distinct department names, the product `2 * 3 = 6` is bounded by the expected row count of 5.
  Hence, the planner predicts 5 groups.
