> ## Documentation Index
> Fetch the complete documentation index at: https://docs.firebolt.io/llms.txt
> Use this file to discover all available pages before exploring further.

> Speed up substring and pattern matching with full-text search indexes in Firebolt.

# Full-text search index

## Overview

Full-text search indexes in Firebolt accelerate substring matching on `TEXT` columns. Under the hood they are n-gram inverted indexes: at ingest time the engine breaks every cell value into overlapping character sequences of a fixed length (*n*), and maps each n-gram to the rows that contain it. At query time the `search()` function decomposes the search pattern into the same n-grams, looks up their posting lists, intersects them, and returns only the rows whose text contains the pattern, often skipping entire [granules](/performance-and-observability/storage-and-indexing#granule) without scanning them. Like the inverted index it shares its format with, full-text search runs at [stage 4 of the scan pipeline](/performance-and-observability/storage-and-indexing#stage-4-inverted-and-text-index).

```text theme={"theme":{"light":"css-variables","dark":"css-variables"}}
Firebolt storage and scan pipeline reference (self-contained)...
Architecture: object storage (S3) holds immutable tablets; each node caches blocks on local SSD (LRU) and prefetches.
Tablet = immutable file, rows sorted in PRIMARY INDEX order, auto-vacuumed toward tablet_max_size_bytes (4 GiB default).
Granule = smallest readable block, 8192 rows by default (index_granularity); sparse PI stores 1 key per granule.
Deletes/updates: no in-place rewrite. One cumulative per-tablet Roaring deletion mask (merge-on-read): a new delete starts from the current mask and adds rows, so a tablet has exactly one active mask, folded in at read time, never OR-ed; prior versions kept for snapshot isolation. Empty tablet is dropped as metadata. UPDATE = delete old row + insert new row (lands in a new tablet).
Roaring bitmaps back both deletion masks and inverted-index posting lists.
Scan pipeline (each step only removes tablets/granules/rows; exact, except approximate vector search which is a separate path):
  [1] tablet pruning     partition key + per-tablet min/max         -> drop whole tablets
  [2] primary index      binary search over sparse marks            -> drop granule ranges
  [3] data skipping      stored min/max per index granule (minmax)  -> drop granules
  [4] inverted/text      row-level Roaring posting lists            -> drop granules + produce a row filter
  [5] seed selection     deletion mask intersect inverted filter  -> initial live-row set per granule
  [6] refine + read      iterative predicate pushdown, most-selective first, lazy column reads -> narrow, read survivors
  [7] output
  [8] late materialization (plan-level, above the scan; rank on key cols, fetch wide payload for top-k finalists only)
join pruning (see query-planning/join-pruning): a join build side becomes a runtime pruning set (make_pruning_set) pushed into the probe-side scan; prunes tablets/granules like WHERE key IN (...); inner/right/semi joins; adaptive: the set is built at run time and discarded with no penalty if the build side is not selective, so it is applied freely.
This page is stage 4 (text variant): substring search via search(col, 'pattern'), equivalent to LIKE '%pattern%' with index acceleration.
Index is an n-gram inverted index (same 3-file Roaring format as inverted-index). A search pattern shorter than ngram_size cannot prune (full scan).
Pushdown only in a top-level AND. search() is case-sensitive. Column must be TEXT; expressions not supported for FULL_TEXT.
```

Full-text search indexes share the same on-disk format and async I/O engine as [inverted indexes](/performance-and-observability/storage-and-indexing/inverted-index). The key difference is *tokenization*:

|                  | Inverted index                                                           | Full-text search index                          |
| :--------------- | :----------------------------------------------------------------------- | :---------------------------------------------- |
| Syntax           | `USING INVERTED_INDEX(expr)`                                             | `USING FULL_TEXT(column) WITH (ngram_size = N)` |
| Tokenization     | Each cell value (or array element) is a single discrete token            | Sliding n-gram window over the cell value       |
| Query function   | `=`, `IN`, bounded ranges (matched automatically), or `has_all_tokens()` | `search()` (equivalent to `LIKE '%pattern%'`)   |
| Typical use case | Exact value lookup: point lookups, value lists, ranges                   | Substring search, log grep, free-text filtering |

<Note>
  `search()` is case-sensitive. If you need case-insensitive matching, store a lowercased copy of the column and index that instead. See the [case-insensitive matching](#case-insensitive-matching) example below.
</Note>

### Key capabilities

* **N-gram granule pruning**: entire granules are skipped when their n-gram posting lists prove no row can contain the pattern.
* **Row-level filtering**: within qualifying granules, only rows whose text actually contains the pattern are passed to downstream operators, cutting CPU work.
* **Automatic index selection**: the planner picks the best index for each `search()` call. If you have multiple full-text search indexes with different `ngram_size` values on the same column, the planner chooses the largest n-gram size whose n-grams fit the search pattern.
* **Short-pattern fallback**: when the search pattern is shorter than the n-gram size, the index cannot prune and the engine falls back to a full scan. Choose your n-gram size accordingly.
* **Unicode support**: n-gram tokenization works on Unicode characters (not bytes), so Chinese, Hebrew, emoji, and other multi-byte text is indexed correctly.
* **Async cloud I/O**: index lookups against cloud storage use coroutine-based parallelism to overlap multiple S3 reads, minimizing latency.

### When to use

| Use case                 | Example                                                                                  |
| :----------------------- | :--------------------------------------------------------------------------------------- |
| Log search               | `WHERE search(message, 'timeout')`                                                       |
| Free-text filtering      | `WHERE search(description, 'blue widget')`                                               |
| Multi-column text search | `WHERE search(name, 'foo') AND search(body, 'bar')`                                      |
| LIKE replacement         | `search(col, 'pattern')` is equivalent to `col LIKE '%pattern%'` with index acceleration |

<Note>
  If your data consists of discrete values (IDs, tags, labels, enum values, dates) rather than free text, an [inverted index](/performance-and-observability/storage-and-indexing/inverted-index) is more efficient: ordinary `=`, `IN`, and range predicates match it automatically, and `has_all_tokens` covers `ARRAY(TEXT)` membership.
</Note>

## Quick start

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
-- 1. Create a table
CREATE TABLE articles (
  id    INT,
  title TEXT,
  body  TEXT
) PRIMARY INDEX id;

-- 2. Create a full-text search index (3-gram)
CREATE INDEX idx_title ON articles USING FULL_TEXT(title) WITH (ngram_size = 3);

-- 3. Load data
INSERT INTO articles VALUES
  (1, 'Introduction to Firebolt', 'Learn about Firebolt analytics'),
  (2, 'Query optimization tips',  'How to speed up your queries'),
  (3, 'Firebolt architecture',    'Deep dive into the engine');

-- 4. Substring search: find articles mentioning "Firebolt"
SELECT id, title
FROM articles
WHERE search(title, 'Firebolt');
```

```
id | title
---+--------------------------
 1 | Introduction to Firebolt
 3 | Firebolt architecture
```

## SQL reference

### CREATE INDEX

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
CREATE INDEX [IF NOT EXISTS] <index_name>
  ON <table_name>
  USING FULL_TEXT(<column>)
  WITH (ngram_size = <N>);
```

| Parameter      | Description                                                                                                                             |
| :------------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
| `<index_name>` | Unique name for the index across all tables, indexes, and views in the database.                                                        |
| `<table_name>` | The table to index.                                                                                                                     |
| `<column>`     | A `TEXT` column reference. Arbitrary expressions are not supported for full-text search indexes; use a plain column name.               |
| `<N>`          | The n-gram size (positive integer). Smaller values match shorter patterns but produce larger indexes. Typical values range from 2 to 4. |

<Warning>
  The column must be of type `TEXT`. Attempting to create an index on a non-text column results in an error:

  ```
  ERROR: First argument of text search index must be a text column. Got: INTEGER
  ```
</Warning>

**Multiple indexes on the same column:**

You can create multiple full-text search indexes with different `ngram_size` values on the same column. The planner automatically selects the most selective index for each query:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
CREATE INDEX idx_title_3 ON articles USING FULL_TEXT(title) WITH (ngram_size = 3);
CREATE INDEX idx_title_4 ON articles USING FULL_TEXT(title) WITH (ngram_size = 4);

-- Searching for 'fire' (4 chars): planner picks idx_title_4 (ngram=4)
-- Searching for 'fir' (3 chars): planner picks idx_title_3 (ngram=3)
-- Searching for 'fi' (2 chars): neither index can prune (pattern < smallest ngram)
```

**Creating an index on an existing table with data:**

When you create a full-text search index on a table that already contains data, existing tablets do not automatically have index data. Run [`VACUUM`](/reference-sql/commands/data-management/vacuum) with `REINDEX=TRUE` to build the index for pre-existing tablets:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
CREATE INDEX idx_body ON articles USING FULL_TEXT(body) WITH (ngram_size = 3);
VACUUM (REINDEX=TRUE) articles;
```

All subsequent inserts automatically populate the index.

### search()

`search()` tests whether a text value contains a given substring. It is semantically equivalent to `column LIKE '%pattern%'` but can leverage a full-text search index for granule pruning.

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
-- Basic usage
search(<column>, <pattern>)

-- With filter mode
search(<column>, <pattern>, filter_mode => '<mode>')
```

| Parameter     | Description                                                                                                                                                                                                        |
| :------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `<column>`    | A `TEXT` column or expression.                                                                                                                                                                                     |
| `<pattern>`   | A constant string literal. The function returns `true` if the column value contains this substring.                                                                                                                |
| `filter_mode` | Optional. `'filter'` (default): prunes granules *and* filters rows. `'prune_only'`: prunes granules only. The `search()` predicate is removed from the filter, and false positives may reach downstream operators. |

**Return type:** `BOOLEAN` (nullable if the input column is nullable).

**NULL handling:** `search(NULL, 'x')` returns `NULL`.

**Empty pattern:** `search(col, '')` matches all non-NULL rows (an empty string is a substring of everything).

**Case sensitivity:** `search()` is case-sensitive. `search(col, 'Hello')` does not match `'hello world'`.

**Special characters:** Characters like `%`, `_`, and `\` are treated as literals in the search pattern; no LIKE-style wildcards are applied.

**search() vs. LIKE:**

`search(col, 'pattern')` is functionally equivalent to `col LIKE '%pattern%'`, but when a full-text search index exists on the column, `search()` pushes the n-gram lookup into the scan operator for granule pruning. If no matching index exists, `search()` falls back to the same behavior as `LIKE`.

### DROP INDEX

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
DROP INDEX [IF EXISTS] <index_name>;
```

Dropping an index is a metadata-only operation. Run `VACUUM` on the table afterward to reclaim the physical storage occupied by index files.

<Note>
  You cannot drop a table while indexes still depend on it. Either drop the indexes first, or use `DROP TABLE ... CASCADE` to drop the table and all its dependent indexes in one statement.
</Note>

## Capabilities

### N-gram granule pruning

When a `search()` predicate appears in a top-level `AND` conjunction, the planner attaches `[TextSearchFilters]` metadata to the `read_tablets` scan operator. At execution time, the engine decomposes the search pattern into n-grams, looks up each n-gram's posting list in the index, intersects them, and skips any granule that has no matching rows.

<Warning>
  Pushdown is **only** applied when `search()` appears in a top-level `AND` conjunction. The index is **not** used when the predicate is inside an `OR`, behind a `NOT`, or in a subquery that cannot be flattened.
</Warning>

### Choosing an ngram\_size

The n-gram size determines the shortest pattern that can leverage the index for pruning:

| ngram\_size | Minimum useful pattern length | Trade-off                                                                  |
| :---------- | :---------------------------- | :------------------------------------------------------------------------- |
| 2           | 2 characters                  | Larger index, more false positives per n-gram                              |
| 3           | 3 characters                  | Good balance for most workloads                                            |
| 4           | 4 characters                  | Smaller index, fewer false positives, but short patterns fall back to scan |

A good starting point is `ngram_size = 3`. If your typical search patterns are 4+ characters, consider `ngram_size = 4` for a smaller index with fewer false-positive granule reads.

### Case-insensitive matching

`search()` is case-sensitive. To support case-insensitive search, store a lowercased copy of the data in a separate column and index that:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
ALTER TABLE articles ADD COLUMN title_lower TEXT
  DEFAULT LOWER(title);

CREATE INDEX idx_title_lower ON articles
  USING FULL_TEXT(title_lower) WITH (ngram_size = 3);

-- Case-insensitive search
SELECT * FROM articles WHERE search(title_lower, 'firebolt');
```

### Using search() in JOINs and CTEs

`search()` works in any context where a boolean expression is accepted, including JOINs, CTEs, CASE expressions, and subqueries:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
-- CTE with search filter
WITH matching_products AS (
  SELECT product_id, name
  FROM products
  WHERE search(name, 'Widget')
)
SELECT m.product_id, m.name, c.category_name
FROM matching_products m
JOIN product_categories pc ON m.product_id = pc.product_id
JOIN categories c ON pc.category_id = c.category_id;
```

### Coexistence with other index types

Full-text search indexes coexist with all other index types on the same table: primary indexes, aggregating indexes, [inverted indexes](/performance-and-observability/storage-and-indexing/inverted-index), and vector search indexes. Each index type serves a different access pattern, and the planner selects the appropriate one based on the query predicate.

## Observability

### information\_schema.indexes

Use [`information_schema.indexes`](/reference-sql/information-schema/indexes) to inspect full-text search index metadata:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
SELECT
  index_name,
  index_type,
  index_definition,
  compressed_bytes,
  number_of_tablets
FROM information_schema.indexes
WHERE index_type = 'text_search';
```

| Column               | Description                                                                       |
| :------------------- | :-------------------------------------------------------------------------------- |
| `index_name`         | The name of the index.                                                            |
| `index_type`         | `'text_search'` for full-text search indexes.                                     |
| `index_definition`   | The full `CREATE INDEX` statement, including the indexed column and `NGRAM_SIZE`. |
| `compressed_bytes`   | Total on-disk size of all index files across all tablets.                         |
| `uncompressed_bytes` | Total uncompressed size of all index files.                                       |
| `number_of_tablets`  | Number of tablets that contain index data.                                        |

### EXPLAIN (ANALYZE)

Use `EXPLAIN (ANALYZE)` to verify that the index is being used and to see how many granules were pruned:

```sql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
EXPLAIN (ANALYZE)
SELECT id, title FROM articles WHERE search(title, 'Firebolt');
```

Look for these indicators in the output:

```
[TableFuncScan] ...
|   $0 = read_tablets(table_name => articles, tablet)
|   [TextSearchFilters] {column=title, index=idx_title, ngram=3, pattern=Firebolt}
|   [Execution Metrics]: output cardinality = 2, granules: 1/2,
|       inverted index pruned granules: 1
```

| Metric                              | Meaning                                                                                                         |
| :---------------------------------- | :-------------------------------------------------------------------------------------------------------------- |
| `[TextSearchFilters]`               | Confirms the planner pushed the predicate into the scan. Shows the index name, n-gram size, and search pattern. |
| `granules: X/Y`                     | `X` granules were read out of `Y` total.                                                                        |
| `inverted index pruned granules: N` | `N` granules were skipped because the n-gram index determined they contain no matching rows.                    |

If `[TextSearchFilters]` does not appear, verify that:

1. The `search()` column matches the indexed column exactly.
2. The predicate is in a top-level `AND` conjunction (not inside `OR` or `NOT`).
3. The search pattern length is at least as long as the index's `ngram_size`.

## How it works

This section explains the internal architecture for users who want to understand the performance characteristics in depth. Full-text search indexes share the same storage format and I/O engine as [inverted indexes](/performance-and-observability/storage-and-indexing/inverted-index#how-it-works); the difference is in the tokenization strategy.

### N-gram tokenization

At ingest time, the engine slides a window of size *N* across each text value and extracts every overlapping substring of length *N* (measured in Unicode characters, not bytes). For example, with `ngram_size = 3` and the value `"hello"`:

```
h-e-l  ->  "hel"
 e-l-l  ->  "ell"
  l-l-o  ->  "llo"
```

Each n-gram is stored as a token in the inverted index with a posting list of the rows that contain it. At query time, `search(col, 'ello')` is decomposed into its 3-grams (`ell`, `llo`) and the engine intersects their posting lists. Only rows that contain *all* n-grams are candidates. A final verification pass (`LIKE '%ello%'`) confirms the match and eliminates false positives where the n-grams appear but not in the right order.

### Three-file on-disk layout

Each full-text search index is stored as three files per tablet, identical to the [inverted index file format](/performance-and-observability/storage-and-indexing/inverted-index#three-file-on-disk-layout):

```
<index_name>.idx      Sparse index (token prefixes to dict block offsets)
<index_name>.dict     Dictionary (sorted n-grams + posting list metadata)
<index_name>.posting  Posting lists (Roaring bitmaps with matching row IDs)
```

```
Lookup flow
===========

Query n-grams
  |
  v
.idx (sparse index) -- binary search --> block offset
  |
  v
.dict (dictionary block) -- sequential scan --> posting offset / inline
  |
  v
.posting (Roaring bitmap) -- deserialize --> row ID set
  |
  v
Intersect across n-grams --> final row IDs
```

The dictionary stores all distinct n-grams in sorted order, grouped into blocks of 512 tokens, each individually LZ4-compressed. Posting lists use the same [tiered encoding](/performance-and-observability/storage-and-indexing/inverted-index#tiered-encoding) (inline, SmallSorted, Roaring) as inverted indexes.

### Async coroutine-based cloud I/O

When index files reside in cloud storage (S3), the engine uses C++ coroutines ([folly::coro](https://github.com/facebook/folly/blob/main/folly/coro/README.md)) to issue non-blocking reads. All n-gram lookups for a single `search()` call are launched concurrently, and as each bitmap arrives it is immediately intersected with the running result. If the intersection becomes empty, remaining in-flight I/O is cancelled early.

**Local caching:** When index files are fully primed in the local layered storage (disk cache), the engine switches to POSIX-based reads, bypassing the S3 path entirely. This provides the lowest possible latency for repeated queries.

### Partial download

By default, the engine can read only the byte ranges it needs from each index file rather than downloading the entire file. This is particularly beneficial for large indexes where only a few dictionary blocks and posting entries are relevant to the query. The sparse index (`.idx`) is always read in full (it is small), but dictionary blocks and posting entries are fetched with targeted byte-range requests.

### Write path

During data ingestion (INSERT), the index is built incrementally:

1. **N-gram extraction**: for each row, the text value is decomposed into all overlapping n-grams of the configured size.
2. **In-memory accumulation**: n-grams are collected in memory with their row IDs. Accumulated entries are flushed to Roaring bitmaps when they reach 512 entries to bound memory.
3. **Sorted serialization**: n-grams are sorted alphabetically, then the posting file, dictionary, and sparse index are written in a single pass.

### Merge and VACUUM

When tablets are merged (via `VACUUM` or automatic background compaction), the index is rebuilt from scratch for the merged tablet. If you create an index on a table that already contains data, the existing tablets do not have index data until you run `VACUUM (REINDEX=TRUE)`, which reconstructs the index for all tablets.
