> ## Documentation Index
> Fetch the complete documentation index at: https://docs.firebolt.io/llms.txt
> Use this file to discover all available pages before exploring further.

> Skip granules that cannot match a predicate using min/max data skipping indexes.

# Data skipping index

## Overview

A data skipping index lets Firebolt skip [granules](/performance-and-observability/storage-and-indexing#granule) that are guaranteed to contain no row matching a query's filter, so they are never read from storage. It is the granule-level pruning step for columns that are *not* the primary index, [stage 3 of the scan pipeline](/performance-and-observability/storage-and-indexing#stage-3-data-skipping-minmax-index).

```text theme={"theme":{"light":"css-variables","dark":"css-variables"}}
Firebolt storage and scan pipeline reference (self-contained)...
Architecture: object storage (S3) holds immutable tablets; each node caches blocks on local SSD (LRU) and prefetches.
Tablet = immutable file, rows sorted in PRIMARY INDEX order, auto-vacuumed toward tablet_max_size_bytes (4 GiB default).
Granule = smallest readable block, 8192 rows by default (index_granularity, power of 2 in [128,8192]); sparse PI stores 1 key per granule.
Deletes/updates: no in-place rewrite. One cumulative per-tablet Roaring deletion mask (merge-on-read): a new delete starts from the current mask and adds rows, so a tablet has exactly one active mask, folded in at read time, never OR-ed; prior versions kept for snapshot isolation. Empty tablet is dropped as metadata. UPDATE = delete old row + insert new row (lands in a new tablet).
Roaring bitmaps back both deletion masks and inverted-index posting lists.
Scan pipeline (each step only removes tablets/granules/rows; exact, except approximate vector search which is a separate path):
  [1] tablet pruning     partition key + per-tablet min/max         -> drop whole tablets
  [2] primary index      binary search over sparse marks            -> drop granule ranges
  [3] data skipping      stored min/max per index granule (minmax)  -> drop granules
  [4] inverted/text      row-level Roaring posting lists            -> drop granules + produce a row filter
  [5] seed selection     deletion mask intersect inverted filter  -> initial live-row set per granule
  [6] refine + read      iterative predicate pushdown, most-selective first, lazy column reads -> narrow, read survivors
  [7] output
  [8] late materialization (plan-level, above the scan; rank on key cols, fetch wide payload for top-k finalists only)
join pruning (see query-planning/join-pruning): a join build side becomes a runtime pruning set (make_pruning_set) pushed into the probe-side scan; prunes tablets/granules like WHERE key IN (...); inner/right/semi joins; adaptive: the set is built at run time and discarded with no penalty if the build side is not selective, so it is applied freely.
This page is stage 3: per-granule min/max pruning for a non-primary-index column or expression. minmax is the only type.
Pruning needs the query predicate's expression to match the indexed expression exactly. GRANULARITY groups N data granules per min/max pair.
```

Firebolt currently supports one kind of data skipping index, the **`minmax`** index.
It records the **minimum** and **maximum** value of an indexed column (or expression) within each granule (or, with a higher [`GRANULARITY`](#choosing-a-granularity), each group of granules).
At query time, the engine compares the predicate in your `WHERE` clause against these precomputed bounds and skips any granule whose `[min, max]` range cannot contain a matching row.
It is most effective on columns whose values are clustered or roughly ordered within the table, so that filters can rule out large contiguous ranges of granules.

### When to use

| Use case                                   | Example                                    |
| :----------------------------------------- | :----------------------------------------- |
| Range filter on a non-primary-index column | `WHERE amount > 1000000`                   |
| Equality filter on a clustered column      | `WHERE status_code = 500`                  |
| Date / time range scans                    | `WHERE event_date < '2024-01-01'`          |
| `BETWEEN` and `IN` predicates              | `WHERE b BETWEEN 100000 AND 200000`        |
| Filtering on a derived value               | Index `a + b`, query `WHERE a + b > 50000` |

A `minmax` index gives little benefit when the indexed column's values are randomly distributed, because every granule then tends to span the whole value range and nothing can be skipped.

## How it relates to other pruning techniques

A `minmax` index is the **granule-level** pruning step for non-primary-index columns. It runs after the coarser tablet-level steps have chosen which tablets to open, and refines the work within them. For the full ordering of pruning, read, and filter stages, see [the scan pipeline](/performance-and-observability/storage-and-indexing#the-scan-pipeline); the table below summarizes how a `minmax` index relates to the other pruning mechanisms.

| Technique                                                                                        | Scope               | Driven by                                                                                                  | Applies to                                                                                       | Relationship to a `minmax` index                                                                                                                                                                                                                                                 |
| :----------------------------------------------------------------------------------------------- | :------------------ | :--------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Tablet pruning**                                                                               | Whole tablets       | Per-tablet min/max statistics kept for every column                                                        | Any column with a comparison predicate                                                           | Runs first, at a coarser scope. Also min/max-based, but per tablet, **not** the data skipping index. A `minmax` index then refines what survives, granule by granule.                                                                                                            |
| **Primary index pruning**                                                                        | Granules            | The sorted order of the [primary index](/performance-and-observability/storage-and-indexing/primary-index) | The `PRIMARY INDEX` key columns                                                                  | Both prune granules, but the primary index is far more effective because data is physically sorted on its keys. A `minmax` index brings granule pruning to non-sort-key columns, and works best when their values are still naturally clustered (for example, a load timestamp). |
| **Data skipping (`minmax`) index**                                                               | Granules            | Stored min/max per index granule                                                                           | Range / comparison predicates (`>`, `<`, `=`, `BETWEEN`, `IN`) on indexed columns or expressions | This index.                                                                                                                                                                                                                                                                      |
| **[Inverted index](/performance-and-observability/storage-and-indexing/inverted-index) pruning** | Granules **+ rows** | Per-value posting lists (Roaring bitmaps)                                                                  | Exact value membership                                                                           | A different access pattern: row-level membership versus granule-level bounds. Use it for point lookups, value lists, and narrow ranges; a `minmax` index for wide, open-ended, or continuous-type ranges. They coexist, and the planner picks the one that fits each predicate.  |

These mechanisms stack: one query can prune whole tablets from statistics, skip granules via the primary index on one column, and skip more via a `minmax` index on another.

## Quick start

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
-- 1. Create a table
CREATE TABLE sales (
  id     INT,
  region TEXT,
  amount BIGINT
) PRIMARY INDEX id;

-- 2. Create a minmax data skipping index on a non-primary-index column
CREATE INDEX dsi_amount ON sales
  USING SKIP_INDEX(amount) WITH (TYPE = minmax, GRANULARITY = 1);

-- 3. Load data
INSERT INTO sales SELECT i, 'us', i * 10 FROM generate_series(1, 30000) gs(i);

-- 4. Query: granules whose amount range is entirely <= 250000 are skipped
SELECT count(*) FROM sales WHERE amount > 250000;
```

## SQL reference

### CREATE INDEX

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
CREATE INDEX <index_name>
  ON <table_name>
  USING SKIP_INDEX(<expression>)
  WITH (TYPE = minmax, GRANULARITY = <granularity>);
```

| Parameter      | Description                                                                                                                                        |
| :------------- | :------------------------------------------------------------------------------------------------------------------------------------------------- |
| `<index_name>` | Unique name for the index across all tables, indexes, and views in the database.                                                                   |
| `<table_name>` | The `FACT` or `DIMENSION` table to index. External tables are not supported.                                                                       |
| `<expression>` | A single column reference or scalar expression of a comparable type (see [Supported types](#supported-types)).                                     |
| `TYPE`         | The index type. Only `minmax` is currently supported.                                                                                              |
| `GRANULARITY`  | A positive integer: how many consecutive data granules are summarized by one index granule. See [Choosing a granularity](#choosing-a-granularity). |

The keywords `SKIP_INDEX`, `TYPE`, `GRANULARITY`, and `minmax` are case-insensitive; this reference uses uppercase for the method and parameter keywords to match the other `USING` forms.
Both `TYPE` and `GRANULARITY` are required; supplying any other set of parameters is an error.

You may create multiple data skipping indexes on the same table, but **at most one index per expression**.
Each index must have a unique name.

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
-- One index per column is fine
CREATE INDEX dsi_b ON t USING SKIP_INDEX(b) WITH (TYPE = minmax, GRANULARITY = 1);
CREATE INDEX dsi_c ON t USING SKIP_INDEX(c) WITH (TYPE = minmax, GRANULARITY = 1);
```

#### Supported types

The indexed expression must resolve to one of the following comparable types (both nullable and non-nullable variants are allowed):

`INT`, `BIGINT`, `REAL`, `DOUBLE PRECISION`, `DECIMAL`, `DATE`, `TIMESTAMP`, `TIMESTAMPTZ`, `TEXT`, `BOOLEAN`, `BYTEA`.

<Warning>
  `ARRAY` and `STRUCT` expressions are not supported and produce an error:

  ```
  ERROR: Data skipping index expression must have a comparable type
  (int, bigint, real, double, decimal, date, timestamp, timestamptz,
  string, boolean, bytea). Got: ARRAY(INTEGER)
  ```

  For token membership on `TEXT` or `ARRAY(TEXT)` columns, use an [inverted index](/performance-and-observability/storage-and-indexing/inverted-index) instead.
</Warning>

#### Expression-based indexing

You can index any scalar expression, not just a bare column.
The expression is evaluated once at ingest time and its min/max are stored.
At query time the predicate's expression must match the indexed expression exactly for pruning to apply:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
-- Index a derived value
CREATE INDEX dsi_sum ON t USING SKIP_INDEX(a + b) WITH (TYPE = minmax, GRANULARITY = 1);
SELECT sum(a + b) FROM t WHERE a + b > 50000;   -- pruned

-- Index a function result on a text column c
CREATE INDEX dsi_lower ON t USING SKIP_INDEX(lower(c)) WITH (TYPE = minmax, GRANULARITY = 1);
SELECT * FROM t WHERE lower(c) = 'high';          -- pruned
SELECT * FROM t WHERE c = 'high';                 -- NOT pruned (expression differs)
```

#### Creating an index on a table that already has data

When you create a data skipping index on a table that already contains data, the existing tablets have no index data and are not pruned until they are rewritten.
Run [`VACUUM`](/reference-sql/commands/data-management/vacuum) with `REINDEX=TRUE` to backfill the index for pre-existing tablets:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
CREATE INDEX dsi_amount ON sales USING SKIP_INDEX(amount) WITH (TYPE = minmax, GRANULARITY = 1);
VACUUM (REINDEX=TRUE) sales;
```

All subsequent inserts automatically populate the index.

### DROP INDEX

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
DROP INDEX [IF EXISTS] <index_name>;
```

Dropping a data skipping index is a metadata operation.
To reclaim the physical storage occupied by index data, run `VACUUM` on the table afterward.

<Note>
  A column that is referenced by a data skipping index cannot be dropped while the index exists:

  ```
  ERROR: Cannot drop column 'c' because it is indexed by the following one or more
  data skipping indexes: dsi_c. Drop the data skipping indexes first.
  ```

  Renaming a column does **not** require dropping the index.
  The index follows the rename automatically.
</Note>

## Capabilities

### Supported predicates

The optimizer derives pruning bounds for the common comparison and range predicates.
For a predicate on an indexed column `b`, a granule is kept only when its `[min, max]` range could still contain a match:

| Predicate             | A granule is pruned when                             |
| :-------------------- | :--------------------------------------------------- |
| `b > k`               | `max(b) <= k`                                        |
| `b >= k`              | `max(b) < k`                                         |
| `b < k`               | `min(b) >= k`                                        |
| `b <= k`              | `min(b) > k`                                         |
| `b = k`               | `min(b) > k` or `max(b) < k`                         |
| `b BETWEEN lo AND hi` | `max(b) < lo` or `min(b) > hi`                       |
| `b IN (k1, k2, ...)`  | every value's range falls outside `[min(b), max(b)]` |

Multiple predicates combined with `AND` each contribute independent pruning conditions.

### NULL handling

`NULL` values never satisfy a comparison predicate.
A granule that contains only `NULL` values in the indexed column is pruned for predicates such as `b > 150`, because no non-null value exists that could match.

### Coexistence with other index types

Data skipping indexes coexist with all other index types on the same table: primary indexes, [aggregating indexes](/performance-and-observability/storage-and-indexing/aggregating-index), [inverted indexes](/performance-and-observability/storage-and-indexing/inverted-index), and [vector search indexes](/performance-and-observability/storage-and-indexing/vector-search-index).
Each serves a different access pattern, and their pruning combines within a single query.
See [How it relates to other pruning techniques](#how-it-relates-to-other-pruning-techniques) for how a `minmax` index fits alongside tablet, primary index, and inverted index pruning.

## Choosing a granularity

`GRANULARITY` controls how many consecutive **data granules** are summarized by a single **index granule** (a single min/max pair):

* **`GRANULARITY = 1`** stores one min/max pair per data granule (\~8,192 rows). This gives the finest-grained pruning and the most precise skipping, at the cost of a larger index.
* **Higher values** group more data granules under one min/max pair. The index is smaller, but pruning is coarser: a single matching row anywhere in the group prevents the whole group from being skipped.

Start with `GRANULARITY = 1` for the most aggressive pruning.
Increase it only if index size becomes a concern and your filters still prune effectively at a coarser resolution.

<Note>
  The size of a *data* granule is governed by the table's [`index_granularity`](/performance-and-observability/storage-and-indexing#index-granularity) (default 8,192 rows).
  The data skipping index's `GRANULARITY` then sets how many of those granules share one min/max summary.
</Note>

## Observability

### information\_schema.indexes

Use [`information_schema.indexes`](/reference-sql/information-schema/indexes) to list data skipping indexes. They report an `index_type` of `data_skipping`:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
SELECT index_name, index_type
FROM information_schema.indexes
WHERE table_name = 'sales'
ORDER BY index_name;
```

```
index_name TEXT, index_type TEXT
'dsi_amount',    'data_skipping'
'primary_sales', 'primary'
```

### EXPLAIN (ANALYZE)

Use `EXPLAIN (ANALYZE)` to confirm the index is used and to see how many granules were pruned. The scan operator gains two annotations and a pruning metric:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
EXPLAIN (ANALYZE) SELECT * FROM sales WHERE amount > 250000;
```

```
\_[3] [TableFuncScan] sales.id: $0.id, sales.region: $0.region, sales.amount: $0.amount
  |   $0 = read_tablets(table_name => sales, tablet)
  |   [DataSkippingIndexes] {index=dsi_amount, granularity=1}
  |   [DataSkippingConditions] ((max_dsi_amount <= 250000) IS DISTINCT FROM TRUE)
  |   [Types]: sales.id: integer null, sales.region: text null, sales.amount: bigint null
  |   [Execution Metrics]: ..., granules: 1/4, data skipping index pruned granules: 3
```

| Annotation / metric                      | Meaning                                                                                      |
| :--------------------------------------- | :------------------------------------------------------------------------------------------- |
| `[DataSkippingIndexes]`                  | Confirms the optimizer attached the index to the scan. Shows the index name and granularity. |
| `[DataSkippingConditions]`               | The derived bound(s) evaluated against the stored min/max columns.                           |
| `granules: X/Y`                          | `X` granules were read out of `Y` total.                                                     |
| `data skipping index pruned granules: N` | `N` granules were skipped by the index.                                                      |

If these annotations do not appear, verify that:

1. The predicate's expression matches the indexed expression exactly.
2. The table's tablets contain index data. Run `VACUUM (REINDEX=TRUE)` if the index was created after the data was loaded.

## How it works

During ingestion, as each data granule is written, Firebolt scans the indexed expression's values in that granule and records their minimum and maximum.
Every `GRANULARITY` data granules, these are flushed as one index granule's min/max pair and stored alongside the tablet's data.

At query time, the optimizer rewrites each supported `WHERE` predicate into an equivalent condition over the stored `min`/`max` columns (the `[DataSkippingConditions]` shown in `EXPLAIN`).
For every tablet the query reads, the storage layer evaluates those conditions per index granule and skips the granules whose `[min, max]` range provably cannot contain a match.
Only the surviving granules are decompressed and scanned.
Because the conditions only ever exclude granules that cannot contain a matching row, query results are identical with or without the index.
The index changes only how much data is read.