Skip to main content

Rollout strategies

The Firebolt Operator currently supports two rollout strategies: Graceful is the default strategy. The Firebolt Operator creates a new generation, switches traffic, waits for the old generation to drain, then deletes it. Use this for production. Recreate creates a new generation, switches traffic, and immediately deletes the old generation. Use this for dev or test environments, or when interrupted queries are acceptable.

Drain check

During graceful rollouts, the Firebolt Operator checks whether old-generation pods have finished serving in-flight queries before deleting them. The Firebolt Operator scrapes pod metrics from outside the pod to decide when it is safe to transition from draining to cleaning and delete the old-generation StatefulSet. At the same time, the engine process handles in-flight queries with shutdown_wait_unfinished: on SIGTERM it waits up to terminationGracePeriodSeconds - 5s for queries to finish before exiting.

Signal

Both callers read the same two Prometheus gauges from the engine pod’s metrics endpoint on port 9090:
  • firebolt_running_queries: Queries currently executing.
  • firebolt_suspended_queries: Queries idle-waiting on a client but still holding a session.
A pod is considered drained when firebolt_running_queries + firebolt_suspended_queries == 0.

Operator-side scrape

The Firebolt Operator scrapes /metrics through the Kubernetes API server’s pods/proxy subresource, not the pod IP directly. Going through the API server means:
  • The Firebolt Operator works identically whether it runs in-cluster or out-of-cluster (e.g. make run or E2E in-process), without needing to reach pod IPs directly.
  • Required RBAC is pods/proxy: get.
  • Transient scrape failures (pod starting, kubelet flaky, metric temporarily missing) are treated as “not drained yet” and the drain loop simply re-polls. They never fail the reconcile.

Configuration

FieldDefaultDescription
spec.drainCheckEnabledtrueSet to false to skip the Firebolt Operator-side drain check entirely. The engine’s shutdown_wait_unfinished still runs on SIGTERM.
spec.drainCheckInterval5sHow often the Firebolt Operator polls each pod. Only used when drain check is enabled.
spec.rolloutgracefulSet to recreate to skip draining and delete old pods immediately. The engine’s shutdown_wait_unfinished still runs on pod termination regardless of rollout strategy.
Pod terminationGracePeriodSeconds is operator-owned at a fixed 60 s (the engine then waits up to grace - 5s for in-flight queries on SIGTERM via shutdown_wait_unfinished). The validating webhook rejects user-supplied terminationGracePeriodSeconds on both FireboltEngine.spec.template.spec and FireboltEngineClass.spec.template.spec. When drainCheckEnabled: false, the Firebolt Operator transitions directly from switching to cleaning without waiting. The engine’s shutdown_wait_unfinished still gives in-flight queries a chance to finish during Kubernetes termination. drainCheckEnabled only controls whether the Firebolt Operator gates the rollout on top of that.