> ## Documentation Index
> Fetch the complete documentation index at: https://docs.firebolt.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Engine rollouts

> Rollout strategies and drain checks for Firebolt Engine.

## Rollout strategies

The Firebolt Operator currently supports two rollout strategies:

**Graceful** is the default strategy. The Firebolt Operator creates a new generation, switches traffic, waits for the old generation to drain, then deletes it. Use this for production.

**Recreate** creates a new generation, switches traffic, and immediately deletes the old generation. Use this for dev or test environments, or when interrupted queries are acceptable.

## Drain check

During graceful rollouts, the Firebolt Operator checks whether old-generation pods have finished serving in-flight queries before deleting them. The Firebolt Operator scrapes pod metrics from outside the pod to decide when it is safe to transition from `draining` to `cleaning` and delete the old-generation StatefulSet. At the same time, the engine process handles in-flight queries with `shutdown_wait_unfinished`: on SIGTERM it waits up to `terminationGracePeriodSeconds - 5s` for queries to finish before exiting.

### Signal

Both callers read the same two Prometheus gauges from the engine pod's metrics endpoint on port `9090`:

* `firebolt_running_queries`: Queries currently executing.
* `firebolt_suspended_queries`: Queries idle-waiting on a client but still holding a session.

A pod is considered drained when `firebolt_running_queries + firebolt_suspended_queries == 0`.

### Operator-side scrape

The Firebolt Operator scrapes `/metrics` through the Kubernetes API server's `pods/proxy` subresource, not the pod IP directly. Going through the API server means:

* The Firebolt Operator works identically whether it runs in-cluster or out-of-cluster (e.g. `make run` or E2E in-process), without needing to reach pod IPs directly.
* Required RBAC is `pods/proxy: get`.
* Transient scrape failures (pod starting, kubelet flaky, metric temporarily missing) are treated as "not drained yet" and the drain loop simply re-polls. They never fail the reconcile.

### Configuration

| Field                     | Default    | Description                                                                                                                                                               |
| ------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `spec.drainCheckEnabled`  | `true`     | Set to `false` to skip the Firebolt Operator-side drain check entirely. The engine's `shutdown_wait_unfinished` still runs on SIGTERM.                                    |
| `spec.drainCheckInterval` | `5s`       | How often the Firebolt Operator polls each pod. Only used when drain check is enabled.                                                                                    |
| `spec.rollout`            | `graceful` | Set to `recreate` to skip draining and delete old pods immediately. The engine's `shutdown_wait_unfinished` still runs on pod termination regardless of rollout strategy. |

Pod `terminationGracePeriodSeconds` is operator-owned at a fixed 60 s
(the engine then waits up to `grace - 5s` for in-flight queries on
SIGTERM via `shutdown_wait_unfinished`). The validating webhook
rejects user-supplied `terminationGracePeriodSeconds` on both
`FireboltEngine.spec.template.spec` and
`FireboltEngineClass.spec.template.spec`.

When `drainCheckEnabled: false`, the Firebolt Operator transitions directly from `switching` to `cleaning` without waiting. The engine's `shutdown_wait_unfinished` still gives in-flight queries a chance to finish during Kubernetes termination. `drainCheckEnabled` only controls whether the Firebolt Operator gates the rollout on top of that.
