Skip to main content
Firebolt is a real-time analytical database built for low-latency queries over large datasets on object storage. The harder you push on latency, the more it pays to know exactly how a scan turns a WHERE clause into the least possible work: which data it skips without reading, which it reads, and the order it filters in. This page is that reference. It describes how a managed table is laid out and the precise sequence a scan runs, and each stage links to the index that controls it. Everything here applies to Firebolt managed tables. External and Iceberg tables are scanned differently; see Iceberg and external data.

Object storage and local caching

Firebolt stores table data in object storage (such as Amazon S3) and keeps storage and compute fully separated. The data files are immutable and shared: any engine reads them straight from object storage, so adding, removing, or resizing compute moves no data and copies nothing between nodes. To keep reads fast, each node caches the data it touches on local SSD. The disk cache serves repeated reads at local-storage latency and evicts least-recently-used blocks when it fills, and the engine prefetches the byte ranges a scan is about to need so object-storage round trips overlap with computation instead of stalling it. Because files are never mutated in place, cached blocks never go stale, reads run lock-free, and transactions stay snapshot-isolated.

How a table is stored

Tablet

The unit of storage. A tablet is an immutable file in object storage holding a contiguous slice of a table’s rows, sorted on the table’s PRIMARY INDEX. It carries its own column data, sparse index, min/max statistics, and any data-skipping, inverted, or vector index files. Tablets are the unit of pruning, caching, and compaction. A tablet is never edited in place: an INSERT writes new tablets, and a DELETE adds a separate deletion mask rather than rewriting (see Writes, updates, and deletes). That immutability is what makes reads lock-free and transactions snapshot-isolated.

Granule

The smallest block of rows the engine can read or skip: a contiguous run within a tablet, 8,192 rows by default. The sparse primary index stores one key sample per granule boundary (a mark), so pruning narrows a scan to a set of granule ranges before any column bytes are read. A granule is also the unit at which min/max statistics are evaluated.

Mark

The boundary between two granules, and the entry the sparse index stores there: the primary-key values of the granule’s first row plus the offsets needed to locate its column data. Selecting which granules to read is a binary search over marks, not a per-row index traversal, so the index stays small enough to keep in memory even for tables with billions of rows.

Index granularity

The index_granularity storage parameter sets the maximum rows per granule. Default 8192; accepted values are powers of two from 128 to 8192. Lowering it makes pruning finer-grained, with fewer wasted rows per matched granule, at the cost of a larger index and more per-granule overhead. See Advanced option: index granularity.

Writes, updates, and deletes

Because tablets are immutable, Firebolt never edits data in place. An INSERT writes new tablets. A DELETE rewrites no data: for each tablet it touches, it writes a deletion mask, a compact Roaring bitmap marking the deleted row positions. The mask is cumulative: a later delete on the same tablet starts from the current mask and adds its rows, so a tablet always has exactly one active mask, not a growing pile to combine. Firebolt retains the earlier versions so each transaction reads the mask that was current as of its snapshot, which is how deletes stay isolated. When a DELETE would remove every remaining row in a tablet, the engine writes no mask and drops the tablet as a pure metadata operation. An UPDATE is a delete of the old row plus an insert of the new one: the new value lands in a new tablet, and the old row is masked in its original tablet. Deletes and updates are therefore lightweight metadata commits, with no gigabyte-scale rewrites, no locks, and no effect on concurrent readers. A scan folds that one mask into the read, a pattern called merge-on-read: the bitmap seeds the row selection vector for each granule, so deleted rows drop out before any predicate runs (stage 5). There is a single mask per tablet to apply, never a set to OR together, and pushing it into the scan uses the same machinery as predicate pushdown, so it is cheap. Auto vacuum later rewrites the tablet to remove the masked rows and reclaim space. These Roaring bitmaps are a recurring primitive in the engine: the same structure backs inverted-index posting lists. Granule pruning, row filtering, and deletion all reduce to fast bitmap set operations, which is why a delete costs about the same whether it removes one row or a million.

Keeping storage optimized

Two things degrade scans over time, and compaction fixes both:
  • Streaming writes produce many small tablets. A scan then opens more files, and pruning is coarser because each tablet covers a wider range of values. Compacting them into fewer, larger tablets improves object-storage access and pruning.
  • Deletes and updates accumulate. Deleted rows stay in their tablets behind the deletion mask until a rewrite removes them, and the masking fragments the tablet. Compaction physically drops the masked rows.
Auto vacuum does this continuously on each engine, with no intervention: it merges small tablets toward a target size (up to tablet_max_size_bytes, 4 GiB by default) and rewrites tablets to drop deleted rows. You can also run VACUUM manually, for example to compact right after a large load or to backfill an index on existing data with VACUUM (REINDEX = TRUE). Storage optimization is resource-intensive, so we recommend running heavy manual vacuums on a dedicated engine. Firebolt decouples storage from compute, so several engines can operate on the same tablets at once and an operation on one never draws resources from another. A maintenance engine can compact tablets or backfill an index while the engine serving your application runs at full speed on the result. To give an engine entirely to queries, turn its background compaction off with ALTER ENGINE ... SET AUTO_VACUUM = OFF.

The scan pipeline

A scan applies a sequence of independent narrowing steps. The first four run during query planning and storage access and choose which granules to read: they touch only metadata and index files, never column data. The read steps then run per granule, seeding a row selection vector and refining it as they evaluate predicates, so a value is decoded only once the engine knows its row can survive. A final optimization, late materialization, can run above the scan. The predicates that drive the pruning steps come from your WHERE clause and, at run time, from join pruning: a selective join is turned into a filter on the table it reads, pruning the same tablets and granules (stages 1 and 2) that a hand-written WHERE key IN (...) would, and switching itself off when it would not help. Every step in this pipeline is exact: it can only remove tablets, granules, or rows, never add them, so results are identical with or without these indexes. They change only how much data is read and decoded. (The vector search index is the exception. It returns approximate nearest neighbors and is queried through a separate path, not this pipeline.)
Scan pipeline
=============

WHERE predicates + indexes + run-time join-pruning filters
  |
  |  == planning / storage access: metadata + index files only, no column data read yet ==
  |
All tablets in the table
  |
  |  [1] tablet pruning          (partition key + per-tablet min/max)  --> drop whole tablets
  v
Surviving tablets
  |
  |  [2] primary index pruning   (binary search over sparse-index marks)  --> drop granule ranges
  |  [3] data skipping (minmax)  (stored min/max per index granule)        --> drop more granules
  |  [4] inverted / text index   (row-level Roaring posting lists)         --> drop granules + seed a row filter
  v
Granules to read
  |
  |  == per-granule read: seed a row selection vector, then iteratively refine it, reading the least data possible ==
  |
  |  [5] seed selection vector    deletion mask + inverted row filter from [4]   --> known-live rows
  |
  |  [6] refine + read   (loop over predicates, most selective first):
  |        +--> read this predicate's columns --> evaluate --> narrow the selection vector --+
  |        |                                                                                 |
  |        +------------------------- next predicate, fewer rows each pass -----------------+
  |      then read the output columns for the surviving rows only
  v
[7] output chunk (+ virtual columns)
  |
  |  == above the scan, when it pays off ==
  |
  |  [8] late materialization    (e.g. rank on key columns, fetch wide payload for the top-k finalists only)
  v
rest of the query plan

Stage 1: Tablet pruning

Before any tablet is opened, the planner evaluates the query’s filters against two pieces of per-tablet metadata: the partition key, and the min/max value kept for every column in the tablet. Any tablet whose ranges cannot satisfy the predicate is dropped whole. This is the coarsest and cheapest step: it discards gigabytes by reading a few bytes of metadata. Per-tablet min/max statistics exist for every column automatically; you do not declare them. They are distinct from a data skipping index, which refines pruning within a tablet at granule resolution.

Stage 2: Primary index pruning

The primary index is per tablet, not a single global structure. Within each tablet that survived stage 1, rows are stored in the sort order of the PRIMARY INDEX, and a sparse index holds one key sample per granule (one entry per 8,192 rows by default). The engine turns the predicate into a range over the sorted key and finds the matching granules by binary search over those marks, so selecting them costs on the order of log2(rows_in_tablet / 8192) comparisons per tablet, negligible even at billions of rows. Only granule ranges that can contain a matching key survive. Because the rows are physically ordered on these columns, this is the most effective pruning step when your filters align with the primary index. A point lookup on the key does not search every tablet: stage 1 first rules out the tablets whose min/max range cannot hold the value, and the binary search then runs only in the few that remain. The predicate here does not have to come from the query text. When a primary-index column is a join key, join pruning supplies the set of matching values from the other side of the join at run time, and this stage prunes on it exactly as it would on a literal WHERE key IN (...).

Stage 3: Data skipping (minmax) index

For predicates on columns that are not the primary index, a data skipping index stores the min/max of an indexed column or expression per index granule. The optimizer rewrites each supported comparison into a condition over those stored bounds and drops any granule whose [min, max] range provably cannot match. It is most effective on columns whose values land together in the sort order, such as a load timestamp, where it rules out long runs of granules.

Stage 4: Inverted and text index

For exact token membership (has_all_tokens) and substring search (search), an inverted index maps each token (or, for full-text search, each n-gram) to a Roaring bitmap of the row IDs that contain it. This is finer-grained than the granule-level steps above: intersecting the relevant posting lists yields the exact set of matching rows, which both drops granules with no matches and seeds a row-level filter that the reader reuses at stage 6, so matching rows are never re-evaluated. When index files live in object storage, these lookups issue posting-list reads concurrently and cancel in-flight reads the moment the running intersection becomes empty.

Stage 5: Seed the selection vector

The reader builds a per-granule selection vector, the set of rows still in play, and seeds it from what the earlier steps already established. The tablet’s one deletion mask removes deleted rows, and if an inverted or text index matched, its row-level bitmap restricts the granule to the rows that contain the searched tokens. Both are Roaring bitmaps, so seeding is a bitmap intersection. Every later step only narrows this vector.

Stage 6: Refine the selection and read

Firebolt does not read every referenced column up front and then filter. It reads columns lazily and refines the selection vector in steps. The optimizer orders the predicates from most to least selective and walks them one at a time: read only the columns the current predicate needs, evaluate it against the rows still alive, and narrow the selection vector. Each pass carries the smaller vector into the next predicate, so a complex conjunction loads and filters iteratively, reading less on every step. Only once the surviving rows are known does the engine read the output columns, for those rows alone. A granule eliminated by an early, cheap predicate never pays to decode the expensive columns. This in-scan filtering always runs, and is itself a basic form of late materialization. Firebolt takes the idea further at the plan level (stage 8).

Stage 7: Output

Surviving rows are assembled into an output chunk, any internal row-identifier columns the plan asked for (used for deletes and late materialization) are attached, and the chunk leaves the scan.

Stage 8: Late materialization

Stages 5 and 6 live inside the scan and always run. Late materialization is a further, plan-level optimization that the optimizer applies when it pays off, most visibly for top-k queries (ORDER BY ... LIMIT k). Instead of carrying wide columns through the sort, the plan ranks on the few columns the ORDER BY needs, takes the top k rows, and only then fetches the wide payload for those k. Because it sits above the scan instead of being folded into it, it can skip work the in-scan filtering (stages 5 and 6) cannot. See Late materialization.

Confirming what fired

EXPLAIN (ANALYZE) annotates the scan operator with the index metadata the planner attached and the per-step granule counts it achieved. The most useful signals:
SignalMeaning
granules: X/YX granules were read out of Y candidates after all pruning steps.
primary index pruned granules: NGranules dropped by stage 2.
data skipping index pruned granules: NGranules dropped by stage 3.
inverted index pruned granules: NGranules dropped by stage 4.
[DataSkippingIndexes], [InvertedIndexFilters], [TextSearchFilters]Confirm the optimizer attached a given index to the scan.
Each index page shows the exact annotations to look for and how to read them: