Skip to main content
This document describes how the Firebolt Operator exposes Prometheus metrics for the components it manages.

Metrics endpoints

ComponentPortNamePathWhat it exposes
Engine pods9090metrics/metricsfirebolt_running_queries, firebolt_suspended_queries, and other engine-internal gauges. The Firebolt Operator scrapes the first two via Pods/proxy to drive both the drain check and auto-stop.
Gateway pods (Envoy)9090 (default)metrics/stats/prometheusEnvoy connection, request, and cluster stats
Firebolt Operator podConfigurable via metrics.bindAddresshttps or http/metricscontroller-runtime reconciliation, workqueue, REST client, and Go runtime metrics
The gateway metrics port defaults to 9090 and is configurable per FireboltInstance CR via spec.gateway.metricsPort. Metadata pods do not currently expose a Prometheus metrics endpoint.

Firebolt Operator metrics mode

The Firebolt Operator metrics endpoint mode is controlled by two Helm values:
Modemetrics.securemetrics.bindAddressPort nameScheme
HTTPS (default)true:8443httpshttps with authn/authz and self-signed TLS
HTTPfalse:8080httpplain http
The Firebolt Operator PodMonitor template automatically adapts its port reference, scheme, bearer token, and TLS configuration based on metrics.secure.

Scraping with Prometheus

The Firebolt Operator Helm chart ships optional PodMonitor resources (one per component type) that can be enabled via values.yaml:
podMonitor:
  engines:
    enabled: true
  gateway:
    enabled: true
  operator:
    enabled: true
  allNamespaces: false   # set true when the Firebolt Operator watches all namespaces
Each PodMonitor uses label selectors to match the relevant pods:
  • Engines: firebolt.io/engine (exists). Matches all engine pods regardless of engine name
  • Gateway: firebolt.io/component=gateway
  • Firebolt Operator: control-plane=controller-manager + chart selector labels
When allNamespaces is true, namespaceSelector.any: true is added so pods in any namespace are discovered. This does not apply to the Firebolt Operator PodMonitor because the Firebolt Operator always runs in the release namespace.

Per-instance monitoring

The chart-level PodMonitors apply uniform scrape configuration to all instances in scope. If you need per-instance control (different intervals, selective enablement, custom relabelings), disable the chart-level PodMonitors and deploy your own alongside each FireboltInstance or FireboltEngine CR. The label selectors to use are:
  • Engine pods: firebolt.io/engine: <engine-name>
  • Gateway pods: firebolt.io/instance: <instance-name>, firebolt.io/component: gateway

Architecture decisions

Helm chart templates, not Firebolt Operator reconciliation

PodMonitor resources are shipped as Helm templates, not created by the Firebolt Operator’s Go reconciliation loop. This follows the dominant industry pattern used by cert-manager, Strimzi, FoundationDB operator, and others. CloudNativePG tried a reconciler-managed approach and deprecated it in v1.26 because:
  • It creates a hard dependency on the Prometheus Operator CRDs. The Firebolt Operator fails to reconcile on clusters where the CRDs are not installed.
  • The Firebolt Operator overwrites user customizations (scrape intervals, relabelings, TLS config) on every reconcile.
  • It adds RBAC complexity for monitoring.coreos.com resources.
  • Platform teams want full ownership of their monitoring configuration.

Gateway stats listener

The Envoy admin interface (port 9901) is bound to 127.0.0.1 and must stay that way. It exposes mutation endpoints (POST /healthcheck/fail, POST /quitquitquit) that the preStop hook depends on for graceful shutdown. Binding admin to 0.0.0.0 would allow any pod in the cluster to drain or kill gateway pods. Instead, a separate read-only stats listener is added on the metrics port (default 9090). This listener proxies only /stats/prometheus from the admin interface via an internal static cluster, exposing no mutation endpoints.

Consistent metrics port

Engine pods and gateway pods both expose Prometheus metrics on a container port named metrics (default 9090). The gateway override lives on the FireboltInstance at spec.gateway.metricsPort. The Firebolt Operator stamps the corresponding metrics-named port on the rendered Envoy container. Engine pods carry the port via the per-FireboltEngine wiring. PodMonitors can therefore always reference port: metrics without knowing the actual port number. The metadata pod does not currently expose a Prometheus endpoint, so no metrics port is stamped there.

Cross-namespace support

When the Firebolt Operator watches all namespaces (watchNamespace is empty), engine, gateway, and metadata pods may live in namespaces other than the Firebolt Operator’s namespace. Setting podMonitor.allNamespaces: true adds namespaceSelector.any: true to the PodMonitors so Prometheus discovers pods across all namespaces.

Embedded CR status metrics

The Firebolt Operator embeds custom Prometheus metrics that expose the status of every FireboltEngine and FireboltInstance it manages. These are level-triggered gauges updated on every reconcile. There are no timers or persisted timestamps, which is consistent with the Firebolt Operator’s level-driven reconciliation model. Duration and trend analysis are left to PromQL. This follows the pattern used by ArgoCD (argocd_app_info), Flux (gotk_reconcile_condition), cert-manager (certmanager_certificate_ready_status), and Crossplane (crossplane_managed_resource_ready).

Operator metrics reference

FireboltEngine metrics

MetricTypeLabelsUpdatedDescription
firebolt_engine_status_phaseGaugenamespace, name, instance, phaseEvery reconcileStateSet-style: 1 for the current phase, 0 for all others. Phases: stable, creating, switching, draining, cleaning, stopped.
firebolt_engine_status_conditionGaugenamespace, name, instance, typeEvery reconcile1 when the condition is True, 0 when False or Unknown. Types: Ready, InstanceReady.
firebolt_engine_spec_replicasGaugenamespace, name, instanceEvery reconcileDesired replica count from spec.replicas.
firebolt_engine_active_generationGaugenamespace, name, instanceEvery reconcileGeneration number currently serving traffic.
firebolt_engine_pods_readyGaugenamespace, name, instanceEvery reconcileNumber of ready pods in the active generation.
firebolt_engine_pods_totalGaugenamespace, name, instanceEvery reconcileTotal pods in the active generation (includes non-ready).
firebolt_engine_draining_generationGaugenamespace, name, instanceEvery reconcileGeneration being drained, or -1 if no drain is in progress.
firebolt_engine_last_reconciled_timestampGaugenamespace, name, instanceEvery reconcileUnix timestamp of the last successful reconcile.
firebolt_engine_drain_check_errors_totalCounternamespace, name, instanceOn drain probe failureCumulative count of drain probe failures (pod unreachable, metrics missing).

FireboltInstance metrics

MetricTypeLabelsUpdatedDescription
firebolt_instance_status_phaseGaugenamespace, name, phaseEvery reconcileStateSet-style: 1 for the current phase, 0 for all others. Phases: Provisioning, Ready, Degraded, Failed.
firebolt_instance_status_conditionGaugenamespace, name, typeEvery reconcile1 when the condition is True, 0 when False or Unknown. Types: Ready, MetadataReady, GatewayReady.
firebolt_instance_infoGaugenamespace, name, id, postgres_modeEvery reconcileAlways 1. Carries static metadata: instance ID and postgres mode (internal or external).
firebolt_instance_last_reconciled_timestampGaugenamespace, nameEvery reconcileUnix timestamp of the last successful reconcile.

Label glossary

LabelMeaning
namespaceKubernetes namespace of the CR
nameName of the FireboltEngine or FireboltInstance CR
instanceName of the parent FireboltInstance (from spec.instanceRef on engines)
phaseCurrent lifecycle phase
typeCondition type (e.g., Ready, MetadataReady)
idStable instance ID (ULID)
postgres_modeinternal (Firebolt Operator-managed) or external (user-provided)

Example PromQL queries

Engine not ready:
firebolt_engine_status_condition{type="Ready"} == 0
Engine stuck in draining phase for more than 10 minutes:
firebolt_engine_status_phase{phase="draining"} == 1
  unless firebolt_engine_status_phase{phase="draining"} offset 10m == 0
Scaling in progress (ready pods less than desired):
firebolt_engine_pods_ready < firebolt_engine_spec_replicas
Instance degraded:
firebolt_instance_status_phase{phase="Degraded"} == 1
Stuck controller (no reconcile for 5 minutes):
time() - firebolt_engine_last_reconciled_timestamp > 300
Drain probe failures spiking:
rate(firebolt_engine_drain_check_errors_total[5m]) > 0
Fleet overview (ready engines per instance):
count by (namespace, instance) (firebolt_engine_status_condition{type="Ready"} == 1)

Cardinality

Each FireboltEngine produces approximately 15 time series (6 phases + 2 conditions + 7 scalar gauges). Each FireboltInstance produces approximately 10 series (4 phases + 4 conditions + 1 info + 1 timestamp). For a cluster with 10 instances and 50 engines, expect roughly 850 series from the Firebolt Operator. This is negligible for any Prometheus deployment. Metric label sets are cleaned up when CRs are deleted, so terminated engines do not leave stale series.