Skip to main content

Overview

Inverted indexes in Firebolt are general-purpose inverted indexes that map discrete tokens to the rows that contain them. They accelerate queries that filter on exact token membership, such as finding log lines tagged "error", matching rows whose tags array contains both "urgent" and "billing", or filtering events by a lowercased hostname. Under the hood, every token’s posting list is stored as a compressed Roaring bitmap, giving the engine a compact, set-operation-friendly representation that scales from a handful of rows to billions. Inverted-index filtering is stage 4 of the scan pipeline: it both prunes granules and supplies a row bitmap that the reader reuses to filter rows.
Inverted indexes are designed for exact token matching, not fuzzy or substring search. If you need substring or pattern-based search, use a full-text search index with n-grams instead.

Key capabilities

  • Granule pruning: entire granules are skipped when no row in the range matches, cutting I/O.
  • Row-level filtering: within qualifying granules, only matching rows are passed to downstream operators, cutting CPU work.
  • AND semantics: when you supply multiple tokens, the index intersects their posting lists and returns only rows that contain all of them.
  • Expression-based indexing: index the output of any scalar expression, such as LOWER(tags) or a struct field like event.type.
  • ARRAY(TEXT) support: each array element is indexed as a separate token, enabling efficient multi-value tag filtering.
  • Async cloud I/O: index lookups against cloud storage use coroutine-based parallelism to overlap multiple S3 reads, minimizing latency.

When to use

Use caseExample
Tag / label filteringWHERE has_all_tokens(tags, ARRAY['billing', 'error'])
Log-level searchWHERE has_all_tokens(level, 'ERROR')
Case-insensitive matchingIndex LOWER(hostname), query has_all_tokens(LOWER(hostname), 'web-01')
Struct field filteringIndex event.category, query has_all_tokens(event.category, 'click')
High-cardinality column pruningPrune tablets for a column with millions of distinct string values

Quick start

-- 1. Create a table
CREATE TABLE events (
  id    INT,
  level TEXT,
  tags  ARRAY(TEXT NOT NULL) NOT NULL,
  body  TEXT
) PRIMARY INDEX id;

-- 2. Create a text index on the tags column
CREATE INDEX idx_tags ON events USING INVERTED_INDEX(tags);

-- 3. Load data
INSERT INTO events VALUES
  (1, 'INFO',  ['billing', 'success'],        'Payment processed'),
  (2, 'ERROR', ['billing', 'timeout'],         'Gateway timeout'),
  (3, 'WARN',  ['auth', 'rate-limit'],         'Too many attempts'),
  (4, 'ERROR', ['billing', 'timeout', 'retry'],'Retry after timeout');

-- 4. Query: find all billing errors
SELECT id, body
FROM events
WHERE has_all_tokens(tags, ARRAY['billing', 'timeout']);
id | body
---+---------------------
 2 | Gateway timeout
 4 | Retry after timeout

SQL reference

CREATE INDEX

CREATE INDEX [IF NOT EXISTS] <index_name>
  ON <table_name>
  USING INVERTED_INDEX(<expression>);
ParameterDescription
<index_name>Unique name for the index across all tables, indexes, and views in the database.
<table_name>The table to index.
<expression>A column reference or scalar expression that returns TEXT or ARRAY(TEXT).
Supported expression types:
Expression typeExampleBehavior
Plain columnINVERTED_INDEX(tags)Each cell value (or array element) becomes a token.
Scalar functionINVERTED_INDEX(LOWER(hostname))The function is evaluated at ingest time; query predicates must use the same expression.
Struct fieldINVERTED_INDEX(event.category)Indexes a nested struct field directly.
You can create multiple inverted indexes on the same table with different expressions. Each index must have a unique name.
The tokenizing expression must return TEXT or ARRAY(TEXT). Attempting to create an index on a non-text type results in an error:
ERROR: Inverted index tokenizing expression must return TEXT or ARRAY(TEXT). Got: INTEGER
Creating an index on an existing table with data: When you create an inverted index on a table that already contains data, existing tablets do not automatically have index data. Run VACUUM with REINDEX=TRUE to rebuild the index for pre-existing tablets:
CREATE INDEX idx_tags ON my_table USING INVERTED_INDEX(tags);
VACUUM (REINDEX=TRUE) my_table;
All subsequent inserts automatically populate the index.

has_all_tokens

has_all_tokens is the query-time function that evaluates token membership against an inverted index (or falls back to a full scan when no matching index exists).
-- Single token
has_all_tokens(<expression>, <token>)

-- Multiple tokens (AND semantics)
has_all_tokens(<expression>, ARRAY[<token1>, <token2>, ...])

-- With filter mode
has_all_tokens(<expression>, <token>, filter_mode => '<mode>')
ParameterDescription
<expression>A TEXT or ARRAY(TEXT) column or expression. Must match the indexed expression exactly for the index to be used.
<token> / ARRAY[...]One or more constant string tokens. All supplied tokens must be present in the row for it to match (AND semantics).
filter_modeOptional. 'filter' (default): prunes granules and filters rows. 'prune_only': prunes granules but does not apply a row-level filter, trading precision for speed.
Behavior with different column types:
First argument typeSemantics
TEXTThe entire cell value is treated as a single token. A single-token query checks equality; a multi-token query can only match if there is exactly one token and it equals the cell value.
ARRAY(TEXT)Each array element is a separate token. A multi-token query returns true when every queried token appears as an element somewhere in the array.
NULL handling: NULL values in the indexed column never match any token. has_all_tokens(NULL, 'x') returns NULL.

DROP INDEX

DROP INDEX [IF EXISTS] <index_name>;
Dropping an inverted index is a metadata-only operation. To reclaim the physical storage occupied by index files, run VACUUM on the table after dropping the index.
You cannot drop a table while inverted indexes still depend on it. Either drop the indexes first, or use DROP TABLE ... CASCADE to drop the table and all its dependent indexes in one statement.

Capabilities

Granule pruning

When has_all_tokens appears in a top-level AND conjunction, the query planner attaches [InvertedIndexFilters] metadata to the read_tablets scan operator. At execution time, the storage layer uses the index to skip entire granules whose posting lists have no matching rows, significantly reducing the amount of data read from disk or cloud storage.
Pushdown is only applied when has_all_tokens appears in a top-level AND conjunction. The index is not used for pruning when the predicate is inside an OR, behind a NOT, or nested in a subquery.

Prune-only mode

By default, has_all_tokens both prunes non-matching granules and applies a row-level filter. When the token is highly selective, the row-level filter may be redundant. Use filter_mode => 'prune_only' to skip the row-level filter and let downstream operators handle any false positives:
SELECT *
FROM events
WHERE has_all_tokens(tags, 'billing', filter_mode => 'prune_only')
  AND id > 100;
In prune-only mode, the has_all_tokens predicate is removed from the Filter node in the query plan. Only the remaining predicates (for example, id > 100) are evaluated per row.

Expression-based indexing

You can index any scalar expression, not just a bare column. The expression is evaluated once at ingest time, and the resulting tokens are stored in the index. At query time, the predicate’s expression must match the indexed expression exactly for the planner to push down the filter:
-- Create an index on the lowercased column
CREATE INDEX idx_lower_host ON logs USING INVERTED_INDEX(LOWER(hostname));

-- Query must use the same expression
SELECT * FROM logs WHERE has_all_tokens(LOWER(hostname), 'web-01');

Struct field indexing

You can index a text field nested inside a STRUCT column:
CREATE TABLE traces (
  id INT,
  event STRUCT(category TEXT, severity TEXT)
) PRIMARY INDEX id;

CREATE INDEX idx_category ON traces USING INVERTED_INDEX(event.category);

SELECT * FROM traces WHERE has_all_tokens(event.category, 'click');

Coexistence with other index types

Inverted indexes coexist with all other index types on the same table: primary indexes, aggregating indexes, full-text search indexes, and vector search indexes. Each index type serves a different access pattern, and the planner selects the appropriate one based on the query predicate.

Observability

information_schema.indexes

Use information_schema.indexes to inspect inverted index metadata:
SELECT
  index_name,
  index_type,
  index_definition,
  compressed_bytes,
  number_of_tablets
FROM information_schema.indexes
WHERE index_type = 'inverted_index';
ColumnDescription
index_nameThe name of the index.
index_type'inverted_index' for inverted indexes.
index_definitionThe full CREATE INDEX statement, including the indexed expression.
compressed_bytesTotal on-disk size of all index files across all tablets.
number_of_tabletsNumber of tablets that contain index data.

EXPLAIN (ANALYZE)

Use EXPLAIN (ANALYZE) to verify that the index is being used and to see how many granules were pruned:
EXPLAIN (ANALYZE)
SELECT id FROM events WHERE has_all_tokens(tags, 'billing');
Look for these indicators in the output:
[TableFuncScan] ...
|   $0 = read_tablets(table_name => events, tablet)
|   [InvertedIndexFilters] {index=idx_tags, tokens=[billing]}
|   [Execution Metrics]: output cardinality = ..., granules: 1/4,
|       inverted index pruned granules: 3
MetricMeaning
[InvertedIndexFilters]Confirms the planner pushed the predicate into the scan. Shows the index name and queried tokens.
granules: X/YX granules were read out of Y total.
inverted index pruned granules: NN granules were skipped because the index determined they contain no matching rows.
If [InvertedIndexFilters] does not appear, verify that:
  1. The has_all_tokens expression matches the indexed expression exactly.
  2. The predicate is in a top-level AND conjunction (not inside OR or NOT).

How it works

This section explains the internal architecture for users who want to understand the performance characteristics in depth.

Three-file on-disk layout

Each inverted index is stored as three files per tablet:
<index_name>.idx      Sparse index (maps token prefixes to dict block offsets)
<index_name>.dict     Dictionary (sorted tokens + posting list metadata)
<index_name>.posting  Posting lists (Roaring bitmaps with matching row IDs)
Lookup flow
===========

Query token
  |
  v
.idx (sparse index) -- binary search --> block offset
  |
  v
.dict (dictionary block) -- sequential scan --> posting offset / inline
  |
  v
.posting (Roaring bitmap) -- deserialize --> row ID set
  |
  v
Intersect across tokens --> final row IDs
Sparse index (.idx): A lightweight file that stores the first token of each dictionary block and the corresponding byte offset into the .dict file. The lookup begins with a binary search over these entries to locate the relevant dictionary block. Offsets are delta-encoded (v2) to minimize file size. Dictionary (.dict): Stores all tokens sorted alphabetically, grouped into blocks of 512 tokens. Each block is individually LZ4-compressed. For each token, the dictionary records its encoding type and a pointer into the posting file. Because blocks are independently compressed, a lookup reads and decompresses only the single block that may contain the target token. Posting (.posting): Contains the actual row ID sets. The encoding varies by posting list size to balance space efficiency with decode speed (see Tiered encoding below).

Tiered encoding

Not all posting lists are the same size, and using a single encoding for all of them would waste either space (for small lists) or CPU (for large lists). The index uses three encoding tiers:
TierConditionStorage locationFormat
Inline≤ 6 row IDsDirectly in the .dict blockVarUInt-encoded row IDs
SmallSorted7 to 12 row IDs.posting fileDelta-encoded VarUInt list
Roaring> 12 row IDs.posting fileFull Roaring64Map serialization
Inline entries avoid a round-trip to the posting file entirely, because the row IDs are embedded in the dictionary block itself. This is particularly effective for high-cardinality, low-frequency tokens (for example, unique request IDs) where most tokens appear in only a few rows. SmallSorted entries use delta encoding: the first row ID is stored as an absolute value, and each subsequent ID is stored as the difference from the previous one. This produces small VarUInt values that compress well. Roaring entries use the full Roaring64Map serialization, which internally switches between array, bitmap, and run-length containers depending on the density of each 2^16-element chunk. This ensures efficient storage across the full range of posting list sizes.

LZ4 dictionary compression

Each 512-token dictionary block is individually LZ4-compressed before being written to the .dict file. The file stores: VarUInt(compressed_size) || compressed_bytes for each block. This block-level compression scheme has two advantages:
  1. Selective decompression: only the block containing the target token is decompressed, avoiding wasted CPU on unrelated blocks.
  2. High compression ratio: tokens within a block tend to share common prefixes (because they are sorted), and LZ4 exploits this locality effectively.

Async coroutine-based cloud I/O

When index files reside in cloud storage (S3), the engine uses C++ coroutines (folly::coro) to issue non-blocking reads. The lookup flow for a multi-token query proceeds as follows:
  1. Read .idx file: a single small async read retrieves the entire sparse index.
  2. Locate dictionary blocks: binary search (CPU-only, no I/O) finds the block offset for each token.
  3. Read dictionary blocks in parallel: one async read per block, all launched concurrently.
  4. Read posting entries in parallel: for non-inline tokens, the exact byte range of each posting list is fetched concurrently.
  5. Eager intersection: as each bitmap arrives (in completion order, not submission order), it is immediately intersected with the running result. If the intersection becomes empty, all remaining in-flight I/O is cancelled.
Each async read suspends the calling coroutine until the buffer manager completes the I/O, allowing the thread to service other coroutines in the meantime. This achieves high throughput without dedicating a thread per I/O. Local caching: When index files are fully primed in the local layered storage (disk cache), the engine switches to POSIX-based reads, bypassing the S3 path entirely. This provides the lowest possible latency for repeated queries against the same tablets.

Write path

During data ingestion (INSERT), the index is built incrementally:
  1. Tokenization: for each row, the indexed expression is evaluated and tokens are extracted. For TEXT columns, each cell value is a single token. For ARRAY(TEXT) columns, each array element is a separate token.
  2. In-memory accumulation: tokens are collected in memory with their row IDs. To bound memory, accumulated entries are flushed to Roaring bitmaps when they reach 512 entries.
  3. Bitmap optimization: before serialization, all bitmaps are run-optimized and shrunk to minimize on-disk size.
  4. Sorted serialization: tokens are sorted alphabetically, then the posting file, dictionary, and sparse index are written in a single pass.
The write path is designed so that row IDs arrive in sorted order within each vector, enabling efficient addMany batch insertions into the Roaring bitmaps without per-element overhead.

Merge and VACUUM

When tablets are merged (via VACUUM or automatic background compaction), the inverted index is rebuilt from scratch for the merged tablet. If you create an inverted index on a table that already contains data, the existing tablets do not have index data until you run VACUUM (REINDEX=TRUE), which reconstructs the index for all tablets.