> ## Documentation Index
> Fetch the complete documentation index at: https://docs.firebolt.io/llms.txt
> Use this file to discover all available pages before exploring further.

> Speed up point lookups with inverted indexes in Firebolt.

# Inverted index

## Overview

Inverted indexes in Firebolt are general-purpose [inverted indexes](https://en.wikipedia.org/wiki/Inverted_index) that map discrete values to the rows that contain them. They accelerate queries that filter on exact value membership: a point lookup like `customer_id = 90218`, a value list like `status IN ('error', 'timeout')`, or a bounded range like `event_date BETWEEN DATE '2026-01-01' AND DATE '2026-01-07'`.

Once an index exists on a column, the planner uses it automatically. You write ordinary SQL, and `=`, `IN`, and bounded range (`BETWEEN`, `>= AND <=`) predicates over that column are pushed into the scan with no change to your queries. Index a scalar column (`BOOLEAN`, `INTEGER`, `BIGINT`, `NUMERIC`, `DATE`, `TIMESTAMP`, `TIMESTAMPTZ`) or a `TEXT` or `ARRAY(TEXT)` expression. For `ARRAY(TEXT)` columns, the [`has_all_tokens`](#multi-value-matching-with-has_all_tokens) function matches rows whose array contains every value you list.

Under the hood, every token's posting list is stored as a compressed [Roaring bitmap](https://roaringbitmap.org/), giving the engine a compact, set-operation-friendly representation that scales from a handful of rows to billions. Inverted-index filtering is [stage 4 of the scan pipeline](/performance-and-observability/storage-and-indexing#stage-4-inverted-and-text-index): it both prunes [granules](/performance-and-observability/storage-and-indexing#granule) and supplies a row bitmap that the reader reuses to filter rows.

```text theme={"theme":{"light":"css-variables","dark":"css-variables"}}
Firebolt storage and scan pipeline reference (self-contained)...
Architecture: object storage (S3) holds immutable tablets; each node caches blocks on local SSD (LRU) and prefetches.
Tablet = immutable file, rows sorted in PRIMARY INDEX order, auto-vacuumed toward tablet_max_size_bytes (4 GiB default).
Granule = smallest readable block, 8192 rows by default (index_granularity); sparse PI stores 1 key per granule.
Deletes/updates: no in-place rewrite. One cumulative per-tablet Roaring deletion mask (merge-on-read): a new delete starts from the current mask and adds rows, so a tablet has exactly one active mask, folded in at read time, never OR-ed; prior versions kept for snapshot isolation. Empty tablet is dropped as metadata. UPDATE = delete old row + insert new row (lands in a new tablet).
Roaring bitmaps back both deletion masks and inverted-index posting lists.
Scan pipeline (each step only removes tablets/granules/rows; exact, except approximate vector search which is a separate path):
  [1] tablet pruning     partition key + per-tablet min/max         -> drop whole tablets
  [2] primary index      binary search over sparse marks            -> drop granule ranges
  [3] data skipping      stored min/max per index granule (minmax)  -> drop granules
  [4] inverted/text      row-level Roaring posting lists            -> drop granules + produce a row filter
  [5] seed selection     deletion mask intersect inverted filter  -> initial live-row set per granule
  [6] refine + read      iterative predicate pushdown, most-selective first, lazy column reads -> narrow, read survivors
  [7] output
  [8] late materialization (plan-level, above the scan; rank on key cols, fetch wide payload for top-k finalists only)
join pruning (see query-planning/join-pruning): a join build side becomes a runtime pruning set (make_pruning_set) pushed into the probe-side scan; prunes tablets/granules like WHERE key IN (...); inner/right/semi joins; adaptive: the set is built at run time and discarded with no penalty if the build side is not selective, so it is applied freely.
This page is stage 4: exact value membership. Posting lists are row-level Roaring bitmaps (row IDs), so it is
finer-grained than minmax granule pruning: it drops granules with no matches and seeds the stage-5 row filter. Pushdown applies only
when the predicate is in a top-level AND (not under OR/NOT/subquery). filter_mode => 'prune_only' skips the row-level filter.
Indexed expression must match the query expression exactly. Full-text search (search()) uses the same on-disk format with n-gram tokens.
```

<Note>
  Inverted indexes are designed for **exact value matching**, not fuzzy or substring search. If you need substring or pattern-based search, use a [full-text search index](/performance-and-observability/storage-and-indexing/full-text-search-index) with n-grams instead.
</Note>

### Key capabilities

* **Automatic predicate matching**: with an index in place, ordinary `=`, `IN`, and bounded range (`BETWEEN`, `>= AND <=`) predicates over the column are pushed into the scan with no change to your SQL.
* **Text and scalar columns**: index scalar `BOOLEAN`, `INTEGER`, `BIGINT`, `NUMERIC`, `DATE`, `TIMESTAMP`, and `TIMESTAMPTZ` columns, or `TEXT` and `ARRAY(TEXT)` expressions.
* **Granule pruning**: entire [granules](/performance-and-observability/storage-and-indexing#granule) are skipped when no row in the range matches, cutting I/O.
* **Row-level filtering**: within qualifying granules, only matching rows are passed to downstream operators, cutting CPU work.
* **Membership semantics**: `=` matches a single value; `IN` and bounded ranges union posting lists, matching rows that hold *any* of the values; multi-value array matching intersects them, matching rows that contain *all* listed values.
* **Expression-based indexing**: index the output of any scalar expression, such as `LOWER(tags)` or a struct field like `event.type`.
* **ARRAY(TEXT) support**: each array element is indexed as a separate token, enabling efficient multi-value tag filtering.
* **Async cloud I/O**: index lookups against cloud storage use coroutine-based parallelism to overlap multiple S3 reads, minimizing latency.

### When to use

| Use case                          | Example                                                            |
| :-------------------------------- | :----------------------------------------------------------------- |
| Point lookup on a scalar column   | `WHERE customer_id = 90218`                                        |
| Value-list filtering              | `WHERE status IN ('error', 'timeout', 'rejected')`                 |
| Bounded date / integer range      | `WHERE event_date BETWEEN DATE '2026-01-01' AND DATE '2026-01-07'` |
| Case-insensitive matching         | Index `LOWER(hostname)`, query `WHERE LOWER(hostname) = 'web-01'`  |
| Struct field filtering            | Index `event.category`, query `WHERE event.category = 'click'`     |
| High-cardinality column pruning   | Prune tablets for a column with millions of distinct values        |
| Multi-value tag / label filtering | `WHERE has_all_tokens(tags, ARRAY['billing', 'error'])`            |

## Quick start

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
-- 1. Create a table
CREATE TABLE events (
  id          INT,
  customer_id INT,
  status      TEXT,
  event_date  DATE
) PRIMARY INDEX id;

-- 2. Create an inverted index on each column you filter on
CREATE INDEX idx_customer ON events USING INVERTED_INDEX(customer_id);
CREATE INDEX idx_status   ON events USING INVERTED_INDEX(status);
CREATE INDEX idx_date     ON events USING INVERTED_INDEX(event_date);

-- 3. Load data
INSERT INTO events VALUES
  (1, 90218, 'success', DATE '2026-01-03'),
  (2, 90218, 'timeout', DATE '2026-01-05'),
  (3, 90311, 'success', DATE '2026-01-06'),
  (4, 90218, 'error',   DATE '2026-02-10');

-- 4. Query as usual. The planner uses the indexes automatically, with no
--    change to your SQL:
SELECT id FROM events WHERE customer_id = 90218;
SELECT id FROM events WHERE status IN ('error', 'timeout');
SELECT id FROM events WHERE event_date BETWEEN DATE '2026-01-01' AND DATE '2026-01-07';
```

The last query returns:

```
id
----
 1
 2
 3
```

## SQL reference

### CREATE INDEX

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
CREATE INDEX [IF NOT EXISTS] <index_name>
  ON <table_name>
  USING INVERTED_INDEX(<expression>);
```

| Parameter      | Description                                                                                                         |
| :------------- | :------------------------------------------------------------------------------------------------------------------ |
| `<index_name>` | Unique name for the index across all tables, indexes, and views in the database.                                    |
| `<table_name>` | The table to index.                                                                                                 |
| `<expression>` | A column reference or scalar expression that returns `TEXT`, `ARRAY(TEXT)`, or a supported scalar type (see below). |

**Supported expression types:**

| Expression type | Example                           | Behavior                                                                                 |
| :-------------- | :-------------------------------- | :--------------------------------------------------------------------------------------- |
| Plain column    | `INVERTED_INDEX(tags)`            | Each cell value (or array element) becomes a token.                                      |
| Scalar function | `INVERTED_INDEX(LOWER(hostname))` | The function is evaluated at ingest time; query predicates must use the same expression. |
| Struct field    | `INVERTED_INDEX(event.category)`  | Indexes a nested struct field directly.                                                  |

You can create multiple inverted indexes on the same table with different expressions. Each index must have a unique name.

**Supported data types:**

The tokenizing expression must return one of the following. Equality (`=`) and `IN` matching applies to every supported type; bounded range matching (`BETWEEN`, `>= AND <=`) applies only to the discrete, unit-step ordered types (`INTEGER`, `BIGINT`, `DATE`, `BOOLEAN`).

| Type                       | Equality / `IN` | Bounded range |
| :------------------------- | :-------------: | :-----------: |
| `TEXT`, `ARRAY(TEXT)`      |       yes       |       no      |
| `BOOLEAN`                  |       yes       |      yes      |
| `INTEGER`, `BIGINT`        |       yes       |      yes      |
| `DATE`                     |       yes       |      yes      |
| `NUMERIC`                  |       yes       |       no      |
| `TIMESTAMP`, `TIMESTAMPTZ` |       yes       |       no      |

<Warning>
  The tokenizing expression must return `TEXT`, `ARRAY(TEXT)`, or a supported scalar type. Attempting to create an index on an unsupported type results in an error:

  ```
  ERROR: Inverted index tokenizing expression must return TEXT, ARRAY(TEXT), or a supported scalar type (BOOLEAN, INTEGER, BIGINT, NUMERIC, DATE, TIMESTAMP, TIMESTAMPTZ). Got: REAL
  ```
</Warning>

**Creating an index on an existing table with data:**

When you create an inverted index on a table that already contains data, existing tablets do not automatically have index data. Run [`VACUUM`](/reference-sql/commands/data-management/vacuum) with `REINDEX=TRUE` to rebuild the index for pre-existing tablets:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
CREATE INDEX idx_tags ON my_table USING INVERTED_INDEX(tags);
VACUUM (REINDEX=TRUE) my_table;
```

All subsequent inserts automatically populate the index.

### DROP INDEX

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
DROP INDEX [IF EXISTS] <index_name>;
```

Dropping an inverted index is a metadata-only operation. To reclaim the physical storage occupied by index files, run `VACUUM` on the table after dropping the index.

<Note>
  You cannot drop a table while inverted indexes still depend on it. Either drop the indexes first, or use `DROP TABLE ... CASCADE` to drop the table and all its dependent indexes in one statement.
</Note>

## Capabilities

### Automatic predicate matching

You do not call any special function to use an inverted index. When an index exists on a column, the planner recognizes ordinary comparison predicates against that column and pushes them into the scan, reusing the same posting-list machinery. This happens by default; you write the query you would write without an index.

| Predicate                                          | Probe                                                        | Matched types                          |
| :------------------------------------------------- | :----------------------------------------------------------- | :------------------------------------- |
| `col = lit`, `lit = col`                           | Single value, intersect mode (`match=all`)                   | All supported types                    |
| `col IN (lit, ...)`                                | Union of point lookups (`match=any`)                         | All supported types                    |
| `col BETWEEN lo AND hi`, `col >= lo AND col <= hi` | Bounded range enumerated to a value set, union (`match=any`) | `INTEGER`, `BIGINT`, `DATE`, `BOOLEAN` |

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
CREATE INDEX idx_customer ON orders USING INVERTED_INDEX(customer_id);
CREATE INDEX idx_date     ON orders USING INVERTED_INDEX(order_date);

-- All three use the index automatically, with no special syntax:
SELECT * FROM orders WHERE customer_id = 90218;
SELECT * FROM orders WHERE customer_id IN (90218, 90219, 90220);
SELECT * FROM orders WHERE order_date BETWEEN DATE '2026-01-01' AND DATE '2026-01-07';
```

**Type coercion.** Equality matching unifies the literal to the column type, so `b = 300` matches a `BIGINT` index even though `300` is written as an `INTEGER`.

**`IN` semantics.** `col IN (...)` is the union of point lookups: the index reads each value's posting list and unions them, returning rows that contain *any* listed value. `NULL` list elements are dropped (they can never match).

**Range matching is adaptive.** An inverted-index probe is a union of point lookups, so a range belongs on the index only when it covers few enough values to enumerate cheaply. The planner makes that judgment per query: it computes how many discrete values the range spans and routes the predicate to whichever access path wins, with no hint and no tuning.

* A **small, discrete range** over a unit-step ordered type (`INTEGER`, `BIGINT`, `DATE`, `BOOLEAN`) is enumerated into its exact set of values and probed against the index as a union of point lookups, giving row-level precision. `BETWEEN lo AND hi` is desugared to `col >= lo AND col <= hi`, so both spellings (and the reversed `lo <= col AND col <= hi`) behave identically. Exclusive bounds are normalized to the adjacent inclusive value (`col > 9 AND col <= 12` covers `10, 11, 12`), a single-point range folds to an equality probe, and an empty range (`lo > hi`) is dropped before the index is consulted.
* A **wide range** that would expand into a large value set stays off the index. Materializing thousands of point lookups would cost more than it saves, so the planner leaves it to the [min/max data-skipping index](/performance-and-observability/storage-and-indexing#stage-3-data-skipping-minmax-index), which prunes by bound without enumerating a single value. The switchover is automatic and bounded, so planning never blows up on a large range.
* An **open-ended comparison** (`col > 5`, `col < 100`, with no bound on the other side) spans an unbounded set and likewise falls to the min/max index.
* A range over a type with **no natural unit step** (`NUMERIC`, `TIMESTAMP`, `TIMESTAMPTZ`) is never enumerated: even a one-second timestamp window spans millions of microsecond ticks. These ranges are served by the min/max index; equality and `IN` on the same column still use the inverted index.

The result is identical whichever path runs; only the cost differs. The inverted index earns its keep on exact membership and tight ranges, the min/max index on wide and continuous ones, and the planner picks between them so you do not have to.

### Granule pruning

When a matched predicate (`has_all_tokens`, `=`, `IN`, or a bounded range) appears in a top-level `AND` conjunction, the query planner attaches `[InvertedIndexFilters]` metadata to the `read_tablets` scan operator. At execution time, the storage layer uses the index to skip entire [granules](/performance-and-observability/storage-and-indexing#granule) whose posting lists have no matching rows, significantly reducing the amount of data read from disk or cloud storage.

<Warning>
  Pushdown is **only** applied to predicates in a top-level `AND` conjunction. The index is **not** used for pruning when the predicate is inside an `OR`, behind a `NOT`, or nested in a subquery.
</Warning>

### Expression-based indexing

You can index any scalar expression, not just a bare column. The expression is evaluated once at ingest time, and the resulting tokens are stored in the index. At query time, the predicate's expression must match the indexed expression exactly for the planner to push down the filter:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
-- Create an index on the lowercased column
CREATE INDEX idx_lower_host ON logs USING INVERTED_INDEX(LOWER(hostname));

-- Query must use the same expression
SELECT * FROM logs WHERE LOWER(hostname) = 'web-01';
```

### Struct field indexing

You can index a field nested inside a `STRUCT` column:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
CREATE TABLE traces (
  id INT,
  event STRUCT(category TEXT, severity TEXT)
) PRIMARY INDEX id;

CREATE INDEX idx_category ON traces USING INVERTED_INDEX(event.category);

SELECT * FROM traces WHERE event.category = 'click';
```

### Coexistence with other index types

Inverted indexes coexist with all other index types on the same table: primary indexes, aggregating indexes, [full-text search indexes](/performance-and-observability/storage-and-indexing/full-text-search-index), and vector search indexes. Each index type serves a different access pattern, and the planner selects the appropriate one based on the query predicate.

## Multi-value matching with has\_all\_tokens

Equality, `IN`, and range predicates cover scalar columns automatically. For an `ARRAY(TEXT)` column, where a single row holds many values, use `has_all_tokens` to match rows whose array contains every value you list. It is also available on scalar columns when you want to call the index explicitly.

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
-- Single token
has_all_tokens(<expression>, <token>)

-- Multiple tokens: row matches only if it contains all of them
has_all_tokens(<expression>, ARRAY[<token1>, <token2>, ...])

-- With filter mode
has_all_tokens(<expression>, <token>, filter_mode => '<mode>')
```

| Parameter                | Description                                                                                                                                                             |
| :----------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `<expression>`           | A `TEXT` or `ARRAY(TEXT)` column or expression. Must match the indexed expression exactly for the index to be used.                                                     |
| `<token>` / `ARRAY[...]` | One or more constant string tokens. All supplied tokens must be present in the row for it to match.                                                                     |
| `filter_mode`            | Optional. `'filter'` (default): prunes granules *and* filters rows. `'prune_only'`: prunes granules but does not apply a row-level filter, trading precision for speed. |

For example, find rows tagged both `billing` and `timeout`:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
CREATE INDEX idx_tags ON events USING INVERTED_INDEX(tags);

SELECT id FROM events WHERE has_all_tokens(tags, ARRAY['billing', 'timeout']);
```

**Behavior with different column types:**

| First argument type             | Semantics                                                                                                                                                                                |
| :------------------------------ | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Scalar (`TEXT`, `INTEGER`, ...) | The entire cell value is treated as a single token. A single-token query checks equality; a multi-token query can only match if there is exactly one token and it equals the cell value. |
| `ARRAY(TEXT)`                   | Each array element is a separate token. A multi-token query returns `true` when every queried token appears as an element somewhere in the array.                                        |

**Scalar token typing:** for a scalar expression, the token type must match the expression type exactly (modulo nullability). `has_all_tokens` does not coerce: probing a `BIGINT` column requires `has_all_tokens(b, 300::bigint)`, not `has_all_tokens(b, 300)`. A `NULL` scalar token is rejected at planning time, because it has no meaningful index key to probe with. (Automatic `=` matching does coerce, so plain `b = 300` works without a cast.)

**NULL handling:** `NULL` values in the indexed column never match any token. `has_all_tokens(NULL, 'x')` returns `NULL`.

### Prune-only mode

By default, `has_all_tokens` both prunes non-matching granules and applies a row-level filter. When the token is highly selective, the row-level filter may be redundant. Use `filter_mode => 'prune_only'` to skip the row-level filter and let downstream operators handle any false positives:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
SELECT *
FROM events
WHERE has_all_tokens(tags, 'billing', filter_mode => 'prune_only')
  AND id > 100;
```

In prune-only mode, the `has_all_tokens` predicate is removed from the `Filter` node in the query plan. Only the remaining predicates (for example, `id > 100`) are evaluated per row.

## Observability

### information\_schema.indexes

Use [`information_schema.indexes`](/reference-sql/information-schema/indexes) to inspect inverted index metadata:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
SELECT
  index_name,
  index_type,
  index_definition,
  compressed_bytes,
  number_of_tablets
FROM information_schema.indexes
WHERE index_type = 'inverted_index';
```

| Column              | Description                                                          |
| :------------------ | :------------------------------------------------------------------- |
| `index_name`        | The name of the index.                                               |
| `index_type`        | `'inverted_index'` for inverted indexes.                             |
| `index_definition`  | The full `CREATE INDEX` statement, including the indexed expression. |
| `compressed_bytes`  | Total on-disk size of all index files across all tablets.            |
| `number_of_tablets` | Number of tablets that contain index data.                           |

### EXPLAIN (ANALYZE)

Use `EXPLAIN (ANALYZE)` to verify that the index is being used and to see how many granules were pruned:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
EXPLAIN (ANALYZE)
SELECT id FROM events WHERE customer_id = 90218;
```

Look for these indicators in the output:

```
[TableFuncScan] ...
|   $0 = read_tablets(table_name => events, tablet)
|   [InvertedIndexFilters] {index=idx_customer, tokens=[90218], match=all}
|   [Execution Metrics]: output cardinality = ..., granules: 1/4,
|       inverted index pruned granules: 3
```

| Metric                              | Meaning                                                                                                                                                      |
| :---------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `[InvertedIndexFilters]`            | Confirms the planner pushed the predicate into the scan. Shows the index name, the queried tokens, and the match mode. There is one entry per matched index. |
| `match=all` / `match=any`           | `all` intersects posting lists (from `has_all_tokens` and `=`); `any` unions them (from `IN` and range predicates).                                          |
| `granules: X/Y`                     | `X` granules were read out of `Y` total.                                                                                                                     |
| `inverted index pruned granules: N` | `N` granules were skipped because the index determined they contain no matching rows.                                                                        |

A pushed-down `=`, `IN`, or range predicate produces the same `[InvertedIndexFilters]` annotation. For example, `WHERE n IN (10, 20)` shows `{index=t_inv_n, tokens=[10,20], match=any}`, and `WHERE n BETWEEN 10 AND 12` shows `{index=t_inv_n, tokens=[10,11,12], match=any}`.

If `[InvertedIndexFilters]` does not appear, verify that:

1. The predicate's expression matches the indexed expression exactly.
2. The predicate is in a top-level `AND` conjunction (not inside `OR` or `NOT`).
3. For a range predicate, the type is enumerable (`INTEGER`, `BIGINT`, `DATE`, `BOOLEAN`) and the range is narrow; wide, open-ended, or continuous-type ranges are served by the min/max index instead.

## How it works

This section explains the internal architecture for users who want to understand the performance characteristics in depth.

### Three-file on-disk layout

Each inverted index is stored as three files per tablet:

```
<index_name>.idx      Sparse index (maps token prefixes to dict block offsets)
<index_name>.dict     Dictionary (sorted tokens + posting list metadata)
<index_name>.posting  Posting lists (Roaring bitmaps with matching row IDs)
```

```
Lookup flow
===========

Query token
  |
  v
.idx (sparse index) -- binary search --> block offset
  |
  v
.dict (dictionary block) -- sequential scan --> posting offset / inline
  |
  v
.posting (Roaring bitmap) -- deserialize --> row ID set
  |
  v
Intersect across tokens --> final row IDs
```

**Sparse index (.idx):** A lightweight file that stores the first token of each dictionary block and the corresponding byte offset into the `.dict` file. The lookup begins with a binary search over these entries to locate the relevant dictionary block. Offsets are delta-encoded (v2) to minimize file size.

**Dictionary (.dict):** Stores all tokens sorted in byte order (for text, this is alphabetical; for scalars, it is the order-preserving key encoding), grouped into blocks of 512 tokens. Each block is individually LZ4-compressed. For each token, the dictionary records its encoding type and a pointer into the posting file. Because blocks are independently compressed, a lookup reads and decompresses only the single block that may contain the target token.

**Posting (.posting):** Contains the actual row ID sets. The encoding varies by posting list size to balance space efficiency with decode speed (see [Tiered encoding](#tiered-encoding) below).

### Tiered encoding

Not all posting lists are the same size, and using a single encoding for all of them would waste either space (for small lists) or CPU (for large lists). The index uses three encoding tiers:

| Tier            | Condition       | Storage location              | Format                          |
| :-------------- | :-------------- | :---------------------------- | :------------------------------ |
| **Inline**      | ≤ 6 row IDs     | Directly in the `.dict` block | VarUInt-encoded row IDs         |
| **SmallSorted** | 7 to 12 row IDs | `.posting` file               | Delta-encoded VarUInt list      |
| **Roaring**     | > 12 row IDs    | `.posting` file               | Full Roaring64Map serialization |

**Inline** entries avoid a round-trip to the posting file entirely, because the row IDs are embedded in the dictionary block itself. This is particularly effective for high-cardinality, low-frequency tokens (for example, unique request IDs) where most tokens appear in only a few rows.

**SmallSorted** entries use delta encoding: the first row ID is stored as an absolute value, and each subsequent ID is stored as the difference from the previous one. This produces small VarUInt values that compress well.

**Roaring** entries use the full [Roaring64Map](https://roaringbitmap.org/) serialization, which internally switches between array, bitmap, and run-length containers depending on the density of each 2^16-element chunk. This ensures efficient storage across the full range of posting list sizes.

### LZ4 dictionary compression

Each 512-token dictionary block is individually LZ4-compressed before being written to the `.dict` file. The file stores: `VarUInt(compressed_size) || compressed_bytes` for each block. This block-level compression scheme has two advantages:

1. **Selective decompression**: only the block containing the target token is decompressed, avoiding wasted CPU on unrelated blocks.
2. **High compression ratio**: tokens within a block tend to share common prefixes (because they are sorted), and LZ4 exploits this locality effectively.

### Async coroutine-based cloud I/O

When index files reside in cloud storage (S3), the engine uses C++ coroutines ([folly::coro](https://github.com/facebook/folly/blob/main/folly/coro/README.md)) to issue non-blocking reads. The lookup flow for a multi-token query proceeds as follows:

1. **Read `.idx` file**: a single small async read retrieves the entire sparse index.
2. **Locate dictionary blocks**: binary search (CPU-only, no I/O) finds the block offset for each token.
3. **Read dictionary blocks in parallel**: one async read per block, all launched concurrently.
4. **Read posting entries in parallel**: for non-inline tokens, the exact byte range of each posting list is fetched concurrently.
5. **Eager intersection**: as each bitmap arrives (in completion order, not submission order), it is immediately intersected with the running result. If the intersection becomes empty, all remaining in-flight I/O is cancelled.

Each async read suspends the calling coroutine until the buffer manager completes the I/O, allowing the thread to service other coroutines in the meantime. This achieves high throughput without dedicating a thread per I/O.

**Local caching:** When index files are fully primed in the local layered storage (disk cache), the engine switches to POSIX-based reads, bypassing the S3 path entirely. This provides the lowest possible latency for repeated queries against the same tablets.

### Write path

During data ingestion (INSERT), the index is built incrementally:

1. **Tokenization**: for each row, the indexed expression is evaluated and tokens are extracted. For scalar columns, each cell value is a single token. For `ARRAY(TEXT)` columns, each array element is a separate token.
2. **In-memory accumulation**: tokens are collected in memory with their row IDs. To bound memory, accumulated entries are flushed to Roaring bitmaps when they reach 512 entries.
3. **Bitmap optimization**: before serialization, all bitmaps are run-optimized and shrunk to minimize on-disk size.
4. **Sorted serialization**: tokens are sorted in byte order, then the posting file, dictionary, and sparse index are written in a single pass.

The write path is designed so that row IDs arrive in sorted order within each vector, enabling efficient `addMany` batch insertions into the Roaring bitmaps without per-element overhead.

### Merge and VACUUM

When tablets are merged (via `VACUUM` or automatic background compaction), the inverted index is rebuilt from scratch for the merged tablet. If you create an inverted index on a table that already contains data, the existing tablets do not have index data until you run `VACUUM (REINDEX=TRUE)`, which reconstructs the index for all tablets.
