Rollout strategies
The Firebolt Operator currently supports two rollout strategies: Graceful is the default strategy. The Firebolt Operator creates a new generation, switches traffic, waits for the old generation to drain, then deletes it. Use this for production. Recreate creates a new generation, switches traffic, and immediately deletes the old generation. Use this for dev or test environments, or when interrupted queries are acceptable.Drain check
During graceful rollouts, the Firebolt Operator checks whether old-generation pods have finished serving in-flight queries before deleting them. The Firebolt Operator scrapes pod metrics from outside the pod to decide when it is safe to transition fromdraining to cleaning and delete the old-generation StatefulSet. At the same time, the engine process handles in-flight queries with shutdown_wait_unfinished: on SIGTERM it waits up to terminationGracePeriodSeconds - 5s for queries to finish before exiting.
Signal
Both callers read the same two Prometheus gauges from the engine pod’s metrics endpoint on port9090:
firebolt_running_queries: Queries currently executing.firebolt_suspended_queries: Queries idle-waiting on a client but still holding a session.
firebolt_running_queries + firebolt_suspended_queries == 0.
Operator-side scrape
The Firebolt Operator scrapes/metrics through the Kubernetes API server’s pods/proxy subresource, not the pod IP directly. Going through the API server means:
- The Firebolt Operator works identically whether it runs in-cluster or out-of-cluster (e.g.
make runor E2E in-process), without needing to reach pod IPs directly. - Required RBAC is
pods/proxy: get. - Transient scrape failures (pod starting, kubelet flaky, metric temporarily missing) are treated as “not drained yet” and the drain loop simply re-polls. They never fail the reconcile.
Configuration
| Field | Default | Description |
|---|---|---|
spec.drainCheckEnabled | true | Set to false to skip the Firebolt Operator-side drain check entirely. The engine’s shutdown_wait_unfinished still runs on SIGTERM. |
spec.drainCheckInterval | 5s | How often the Firebolt Operator polls each pod. Only used when drain check is enabled. |
spec.rollout | graceful | Set to recreate to skip draining and delete old pods immediately. The engine’s shutdown_wait_unfinished still runs on pod termination regardless of rollout strategy. |
terminationGracePeriodSeconds is operator-owned at a fixed 60 s
(the engine then waits up to grace - 5s for in-flight queries on
SIGTERM via shutdown_wait_unfinished). The validating webhook
rejects user-supplied terminationGracePeriodSeconds on both
FireboltEngine.spec.template.spec and
FireboltEngineClass.spec.template.spec.
When drainCheckEnabled: false, the Firebolt Operator transitions directly from switching to cleaning without waiting. The engine’s shutdown_wait_unfinished still gives in-flight queries a chance to finish during Kubernetes termination. drainCheckEnabled only controls whether the Firebolt Operator gates the rollout on top of that.