Gateway query routing
The Envoy gateway proxy acts as the entry point for client queries and is the only entry point on which the Firebolt Operator promises zero downtime across engine lifecycle events. It uses a Lua filter to pick the target engine from theX-Firebolt-Engine request header and a dynamic forward proxy (DFP) to resolve the per-engine headless Service at request time.
Configuration
The gateway ConfigMap ({instance}-gateway-config) is a pure function of the FireboltInstance. It does not depend on the set of engines. The Firebolt Operator does not regenerate it on engine create/delete/scale/blue-green events, so those events never trigger a gateway rollout. The configuration contains:
- A Lua HTTP filter that validates
X-Firebolt-Engineas a single RFC 1123 DNS label (lowercase alphanumerics and hyphens, ≤63 chars, no leading or trailing hyphen, no dots) and rewrites:authorityto{engine}-service.{namespace}.svc.cluster.local:3473. - A dynamic forward proxy in sub-cluster mode (
sub_clusters_config, notdns_cache_config). Each authority synthesizes a full STRICT_DNS sub-cluster on first use, so all A-records of the headless engine Service become individual upstream hosts with normal load-balancing, outlier-detection andprevious_hostsretry semantics. DNS-cache mode would have collapsed the headless Service back into a single sticky IP per cluster and madeprevious_hostsa no-op. max_requests_per_connection: 1on the DFP cluster: every query gets a fresh TCP connect and therefore a fresh DNS lookup. This collapses the stale-IP window after a selector flip to a single TCP handshake instead of the STRICT_DNS refresh interval.- Active HTTP health checks on
/health/readyevery1swithhealthy_threshold: 1/unhealthy_threshold: 1. The engine flips this endpoint to 503 immediately on SIGTERM, so Envoy ejects a draining pod from the load-balanced set within one probe interval, independently of DNS. - A route-level retry policy that retries on transport-level failures (
connect-failure,refused-stream,reset) and on responses carrying theX-Firebolt-Drainedheader (retriable-headers). The header is set only by the engine’s pre-work shutdown fence (see Graceful pod shutdown), so that one specific shape of 503 is provably side-effect free and safe to retry. Bare 5xx is not retried, because once an engine has executed a request it may have applied side effects and a retry could duplicate them.num_retries: 50combined with theprevious_hostsretry-host predicate means each successive attempt lands on a pod we have not tried yet, until either the sub-cluster’s host set is exhausted or the client deadline expires. per_connection_buffer_limit_byteshard-coded to 2 MiB on both the listener and the DFP cluster (kept in lockstep). This is the request-replay budget: a retry, including the X-Firebolt-Drained one, can only be issued when the full request body fits in this buffer. Requests larger than the limit are dispatched without buffering and any 503 they receive propagates to the client unretried. The value is intentionally not surfaced on the CR. See the rationale comment ongatewayPerConnectionBufferLimitBytesininstance_gateway.go. It sits at the center of two Firebolt Operator-owned invariants: retry coverage and gateway memory budget. A per-instance override would invite settings that silently break either. Memory budget per gateway pod scales asconcurrent_requests * (1 + retry_factor) * 2 MiB. Sizegateway.replicasandgateway.resources.limits.memoryto that envelope. The standalone Helm chart (firebolt-instance-helm) renders the same 2 MiB value through its own gateway-configmap so the two deployment paths behave identically. See Gateway sizing for the operational guidance.- Per-engine circuit breakers on the DFP cluster (
circuit_breakers.thresholds[priority=DEFAULT]):max_connections=1024,max_pending_requests=1024,max_requests=1024,max_retries=256. Because each authority materializes its own STRICT_DNS sub-cluster, these thresholds apply per engine, per gateway pod. A misbehaving engine cannot consume more than its share of connection-pool slots, pending-request queue, in-flight stream budget, or retries, and therefore cannot starve sibling engines sharing the same gateway. The constants live ininstance_gateway.gonext togatewayPerConnectionBufferLimitBytes. See gateway sizing for the sizing rationale and the operational signal (synthetic 503 with response flagUO/UAEX/URX) that indicates a tripped breaker. - An admin listener on
127.0.0.1:9901used by the gateway pod’s ownpreStophook (POST /healthcheck/fail) to fail the gateway’s readiness before the kubelet sends SIGTERM, so service load-balancers stop sending it new requests before its filters tear down.
ClusterIP: None), the :authority hostname resolves directly to the set of ready pod IPs. kube-proxy is not in the data path, so there is no terminating-endpoint race where a SYN would be DNAT’d to a pod whose listener has already closed.
Traffic path
maxSurge: 25% and maxUnavailable: 0, ensuring zero downtime during gateway upgrades.
Graceful pod shutdown
When a blue-green cutover or scale-down deletes the old-generation StatefulSet, the kubelet sends SIGTERM to its pods while client queries may still be in flight at the gateway. Zero-downtime is preserved end-to-end by a chain of independent mechanisms. No Firebolt Operator-side gate on EndpointSlice routability is required. In order of when they fire after SIGTERM:- Engine
/health/readyflips to 503. The kubelet readiness probe sees this on its next scrape and marks the pod NotReady. The K8s endpoint controller removes the pod from the cluster Service’s EndpointSlices. CoreDNS stops returning that pod IP for the headless Service. - Envoy active health check ejects the host. Envoy probes
/health/readydirectly on each pod IP every1s. Withunhealthy_threshold: 1, one failed probe is enough to remove the host from the load-balanced set. This is independent of DNS, so it does not wait on EndpointSlice / CoreDNS propagation. max_requests_per_connection: 1+ per-request DNS. Any query that arrives at the gateway after the host is excluded from DNS opens a fresh TCP connection (no keep-alive reuse), does a fresh DNS lookup, and never sees the dying pod.- Engine pre-work shutdown fence. A query that did slip onto an open connection or a stale-DNS host before either of the above hides it hits the engine’s HTTP handler, which fast-fails with
503 Service Unavailable,Connection: close, and theX-Firebolt-Drainedheader before any executor or Storage Manager work runs. The connection drops out of any pool, and the request is provably side-effect free. - Gateway retries the 503 on a different host. The
retriable-headersrule matchesX-Firebolt-Drained: present. Combined with theprevious_hostsretry-host predicate, the retry never picks the same draining pod. The client sees a single 200 response, provided the request body fits withinper_connection_buffer_limit_bytes(hard-coded at 2 MiB). Bodies that exceed the buffer are dispatched without buffering and the 503 is propagated to the client unretried. Workloads that send single requests above 2 MiB are out of scope for the zero-downtime contract.
shutdown_wait_unfinished (= terminationGracePeriodSeconds - 5s). The fence only fences new requests. It does not interrupt work already in progress.