Skip to main content

Overview

Full-text search indexes in Firebolt accelerate substring matching on TEXT columns. Under the hood they are n-gram inverted indexes: at ingest time the engine breaks every cell value into overlapping character sequences of a fixed length (n), and maps each n-gram to the rows that contain it. At query time the search() function decomposes the search pattern into the same n-grams, looks up their posting lists, intersects them, and returns only the rows whose text contains the pattern, often skipping entire granules without scanning them. Like the inverted index it shares its format with, full-text search runs at stage 4 of the scan pipeline. Full-text search indexes share the same on-disk format and async I/O engine as inverted indexes. The key difference is tokenization:
Inverted indexFull-text search index
SyntaxUSING INVERTED_INDEX(expr)USING FULL_TEXT(column) WITH (ngram_size = N)
TokenizationEach cell value (or array element) is a single discrete tokenSliding n-gram window over the cell value
Query functionhas_all_tokens()search() (equivalent to LIKE '%pattern%')
Typical use caseExact tag / label lookupSubstring search, log grep, free-text filtering
search() is case-sensitive. If you need case-insensitive matching, store a lowercased copy of the column and index that instead. See the case-insensitive matching example below.

Key capabilities

  • N-gram granule pruning: entire granules are skipped when their n-gram posting lists prove no row can contain the pattern.
  • Row-level filtering: within qualifying granules, only rows whose text actually contains the pattern are passed to downstream operators, cutting CPU work.
  • Automatic index selection: the planner picks the best index for each search() call. If you have multiple full-text search indexes with different ngram_size values on the same column, the planner chooses the largest n-gram size whose n-grams fit the search pattern.
  • Short-pattern fallback: when the search pattern is shorter than the n-gram size, the index cannot prune and the engine falls back to a full scan. Choose your n-gram size accordingly.
  • Unicode support: n-gram tokenization works on Unicode characters (not bytes), so Chinese, Hebrew, emoji, and other multi-byte text is indexed correctly.
  • Async cloud I/O: index lookups against cloud storage use coroutine-based parallelism to overlap multiple S3 reads, minimizing latency.

When to use

Use caseExample
Log searchWHERE search(message, 'timeout')
Free-text filteringWHERE search(description, 'blue widget')
Multi-column text searchWHERE search(name, 'foo') AND search(body, 'bar')
LIKE replacementsearch(col, 'pattern') is equivalent to col LIKE '%pattern%' with index acceleration
If your data consists of discrete tokens (tags, labels, enum values) rather than free text, an inverted index with has_all_tokens() is more efficient.

Quick start

-- 1. Create a table
CREATE TABLE articles (
  id    INT,
  title TEXT,
  body  TEXT
) PRIMARY INDEX id;

-- 2. Create a full-text search index (3-gram)
CREATE INDEX idx_title ON articles USING FULL_TEXT(title) WITH (ngram_size = 3);

-- 3. Load data
INSERT INTO articles VALUES
  (1, 'Introduction to Firebolt', 'Learn about Firebolt analytics'),
  (2, 'Query optimization tips',  'How to speed up your queries'),
  (3, 'Firebolt architecture',    'Deep dive into the engine');

-- 4. Substring search: find articles mentioning "Firebolt"
SELECT id, title
FROM articles
WHERE search(title, 'Firebolt');
id | title
---+--------------------------
 1 | Introduction to Firebolt
 3 | Firebolt architecture

SQL reference

CREATE INDEX

CREATE INDEX [IF NOT EXISTS] <index_name>
  ON <table_name>
  USING FULL_TEXT(<column>)
  WITH (ngram_size = <N>);
ParameterDescription
<index_name>Unique name for the index across all tables, indexes, and views in the database.
<table_name>The table to index.
<column>A TEXT column reference. Arbitrary expressions are not supported for full-text search indexes; use a plain column name.
<N>The n-gram size (positive integer). Smaller values match shorter patterns but produce larger indexes. Typical values range from 2 to 4.
The column must be of type TEXT. Attempting to create an index on a non-text column results in an error:
ERROR: First argument of text search index must be a text column. Got: INTEGER
Multiple indexes on the same column: You can create multiple full-text search indexes with different ngram_size values on the same column. The planner automatically selects the most selective index for each query:
CREATE INDEX idx_title_3 ON articles USING FULL_TEXT(title) WITH (ngram_size = 3);
CREATE INDEX idx_title_4 ON articles USING FULL_TEXT(title) WITH (ngram_size = 4);

-- Searching for 'fire' (4 chars): planner picks idx_title_4 (ngram=4)
-- Searching for 'fir' (3 chars): planner picks idx_title_3 (ngram=3)
-- Searching for 'fi' (2 chars): neither index can prune (pattern < smallest ngram)
Creating an index on an existing table with data: When you create a full-text search index on a table that already contains data, existing tablets do not automatically have index data. Run VACUUM with REINDEX=TRUE to build the index for pre-existing tablets:
CREATE INDEX idx_body ON articles USING FULL_TEXT(body) WITH (ngram_size = 3);
VACUUM (REINDEX=TRUE) articles;
All subsequent inserts automatically populate the index. search() tests whether a text value contains a given substring. It is semantically equivalent to column LIKE '%pattern%' but can leverage a full-text search index for granule pruning.
-- Basic usage
search(<column>, <pattern>)

-- With filter mode
search(<column>, <pattern>, filter_mode => '<mode>')
ParameterDescription
<column>A TEXT column or expression.
<pattern>A constant string literal. The function returns true if the column value contains this substring.
filter_modeOptional. 'filter' (default): prunes granules and filters rows. 'prune_only': prunes granules only. The search() predicate is removed from the filter, and false positives may reach downstream operators.
Return type: BOOLEAN (nullable if the input column is nullable). NULL handling: search(NULL, 'x') returns NULL. Empty pattern: search(col, '') matches all non-NULL rows (an empty string is a substring of everything). Case sensitivity: search() is case-sensitive. search(col, 'Hello') does not match 'hello world'. Special characters: Characters like %, _, and \ are treated as literals in the search pattern; no LIKE-style wildcards are applied. search() vs. LIKE: search(col, 'pattern') is functionally equivalent to col LIKE '%pattern%', but when a full-text search index exists on the column, search() pushes the n-gram lookup into the scan operator for granule pruning. If no matching index exists, search() falls back to the same behavior as LIKE.

DROP INDEX

DROP INDEX [IF EXISTS] <index_name>;
Dropping an index is a metadata-only operation. Run VACUUM on the table afterward to reclaim the physical storage occupied by index files.
You cannot drop a table while indexes still depend on it. Either drop the indexes first, or use DROP TABLE ... CASCADE to drop the table and all its dependent indexes in one statement.

Capabilities

N-gram granule pruning

When a search() predicate appears in a top-level AND conjunction, the planner attaches [TextSearchFilters] metadata to the read_tablets scan operator. At execution time, the engine decomposes the search pattern into n-grams, looks up each n-gram’s posting list in the index, intersects them, and skips any granule that has no matching rows.
Pushdown is only applied when search() appears in a top-level AND conjunction. The index is not used when the predicate is inside an OR, behind a NOT, or in a subquery that cannot be flattened.

Choosing an ngram_size

The n-gram size determines the shortest pattern that can leverage the index for pruning:
ngram_sizeMinimum useful pattern lengthTrade-off
22 charactersLarger index, more false positives per n-gram
33 charactersGood balance for most workloads
44 charactersSmaller index, fewer false positives, but short patterns fall back to scan
A good starting point is ngram_size = 3. If your typical search patterns are 4+ characters, consider ngram_size = 4 for a smaller index with fewer false-positive granule reads.

Case-insensitive matching

search() is case-sensitive. To support case-insensitive search, store a lowercased copy of the data in a separate column and index that:
ALTER TABLE articles ADD COLUMN title_lower TEXT
  DEFAULT LOWER(title);

CREATE INDEX idx_title_lower ON articles
  USING FULL_TEXT(title_lower) WITH (ngram_size = 3);

-- Case-insensitive search
SELECT * FROM articles WHERE search(title_lower, 'firebolt');

Using search() in JOINs and CTEs

search() works in any context where a boolean expression is accepted, including JOINs, CTEs, CASE expressions, and subqueries:
-- CTE with search filter
WITH matching_products AS (
  SELECT product_id, name
  FROM products
  WHERE search(name, 'Widget')
)
SELECT m.product_id, m.name, c.category_name
FROM matching_products m
JOIN product_categories pc ON m.product_id = pc.product_id
JOIN categories c ON pc.category_id = c.category_id;

Coexistence with other index types

Full-text search indexes coexist with all other index types on the same table: primary indexes, aggregating indexes, inverted indexes, and vector search indexes. Each index type serves a different access pattern, and the planner selects the appropriate one based on the query predicate.

Observability

information_schema.indexes

Use information_schema.indexes to inspect full-text search index metadata:
SELECT
  index_name,
  index_type,
  index_definition,
  compressed_bytes,
  number_of_tablets
FROM information_schema.indexes
WHERE index_type = 'text_search';
ColumnDescription
index_nameThe name of the index.
index_type'text_search' for full-text search indexes.
index_definitionThe full CREATE INDEX statement, including the indexed column and NGRAM_SIZE.
compressed_bytesTotal on-disk size of all index files across all tablets.
uncompressed_bytesTotal uncompressed size of all index files.
number_of_tabletsNumber of tablets that contain index data.

EXPLAIN (ANALYZE)

Use EXPLAIN (ANALYZE) to verify that the index is being used and to see how many granules were pruned:
EXPLAIN (ANALYZE)
SELECT id, title FROM articles WHERE search(title, 'Firebolt');
Look for these indicators in the output:
[TableFuncScan] ...
|   $0 = read_tablets(table_name => articles, tablet)
|   [TextSearchFilters] {column=title, index=idx_title, ngram=3, pattern=Firebolt}
|   [Execution Metrics]: output cardinality = 2, granules: 1/2,
|       inverted index pruned granules: 1
MetricMeaning
[TextSearchFilters]Confirms the planner pushed the predicate into the scan. Shows the index name, n-gram size, and search pattern.
granules: X/YX granules were read out of Y total.
inverted index pruned granules: NN granules were skipped because the n-gram index determined they contain no matching rows.
If [TextSearchFilters] does not appear, verify that:
  1. The search() column matches the indexed column exactly.
  2. The predicate is in a top-level AND conjunction (not inside OR or NOT).
  3. The search pattern length is at least as long as the index’s ngram_size.

How it works

This section explains the internal architecture for users who want to understand the performance characteristics in depth. Full-text search indexes share the same storage format and I/O engine as inverted indexes; the difference is in the tokenization strategy.

N-gram tokenization

At ingest time, the engine slides a window of size N across each text value and extracts every overlapping substring of length N (measured in Unicode characters, not bytes). For example, with ngram_size = 3 and the value "hello":
h-e-l  ->  "hel"
 e-l-l  ->  "ell"
  l-l-o  ->  "llo"
Each n-gram is stored as a token in the inverted index with a posting list of the rows that contain it. At query time, search(col, 'ello') is decomposed into its 3-grams (ell, llo) and the engine intersects their posting lists. Only rows that contain all n-grams are candidates. A final verification pass (LIKE '%ello%') confirms the match and eliminates false positives where the n-grams appear but not in the right order.

Three-file on-disk layout

Each full-text search index is stored as three files per tablet, identical to the inverted index file format:
<index_name>.idx      Sparse index (token prefixes to dict block offsets)
<index_name>.dict     Dictionary (sorted n-grams + posting list metadata)
<index_name>.posting  Posting lists (Roaring bitmaps with matching row IDs)
Lookup flow
===========

Query n-grams
  |
  v
.idx (sparse index) -- binary search --> block offset
  |
  v
.dict (dictionary block) -- sequential scan --> posting offset / inline
  |
  v
.posting (Roaring bitmap) -- deserialize --> row ID set
  |
  v
Intersect across n-grams --> final row IDs
The dictionary stores all distinct n-grams in sorted order, grouped into blocks of 512 tokens, each individually LZ4-compressed. Posting lists use the same tiered encoding (inline, SmallSorted, Roaring) as inverted indexes.

Async coroutine-based cloud I/O

When index files reside in cloud storage (S3), the engine uses C++ coroutines (folly::coro) to issue non-blocking reads. All n-gram lookups for a single search() call are launched concurrently, and as each bitmap arrives it is immediately intersected with the running result. If the intersection becomes empty, remaining in-flight I/O is cancelled early. Local caching: When index files are fully primed in the local layered storage (disk cache), the engine switches to POSIX-based reads, bypassing the S3 path entirely. This provides the lowest possible latency for repeated queries.

Partial download

By default, the engine can read only the byte ranges it needs from each index file rather than downloading the entire file. This is particularly beneficial for large indexes where only a few dictionary blocks and posting entries are relevant to the query. The sparse index (.idx) is always read in full (it is small), but dictionary blocks and posting entries are fetched with targeted byte-range requests.

Write path

During data ingestion (INSERT), the index is built incrementally:
  1. N-gram extraction: for each row, the text value is decomposed into all overlapping n-grams of the configured size.
  2. In-memory accumulation: n-grams are collected in memory with their row IDs. Accumulated entries are flushed to Roaring bitmaps when they reach 512 entries to bound memory.
  3. Sorted serialization: n-grams are sorted alphabetically, then the posting file, dictionary, and sparse index are written in a single pass.

Merge and VACUUM

When tablets are merged (via VACUUM or automatic background compaction), the index is rebuilt from scratch for the merged tablet. If you create an index on a table that already contains data, the existing tablets do not have index data until you run VACUUM (REINDEX=TRUE), which reconstructs the index for all tablets.