Monitoring

This document describes how the Firebolt Operator exposes Prometheus metrics for the components it manages.

Metrics endpoints

Component	Port	Name	Path	What it exposes
Engine pods	9090	`metrics`	`/metrics`	`firebolt_running_queries`, `firebolt_suspended_queries`, and other engine-internal gauges. The Firebolt Operator scrapes the first two via `Pods/proxy` to drive both the drain check and auto-stop.
Gateway pods (Envoy)	9090 (default)	`metrics`	`/stats/prometheus`	Envoy connection, request, and cluster stats
Firebolt Operator pod	Configurable via `metrics.bindAddress`	`https` or `http`	`/metrics`	controller-runtime reconciliation, workqueue, REST client, and Go runtime metrics

The gateway metrics port defaults to 9090 and is configurable per FireboltInstance CR via spec.gateway.metricsPort. Metadata pods do not currently expose a Prometheus metrics endpoint.

Firebolt Operator metrics mode

The Firebolt Operator metrics endpoint mode is controlled by two Helm values:

Mode	`metrics.secure`	`metrics.bindAddress`	Port name	Scheme
HTTPS (default)	`true`	`:8443`	`https`	`https` with authn/authz and self-signed TLS
HTTP	`false`	`:8080`	`http`	plain `http`

The Firebolt Operator PodMonitor template automatically adapts its port reference, scheme, bearer token, and TLS configuration based on metrics.secure.

Scraping with Prometheus

The Firebolt Operator Helm chart ships optional PodMonitor resources (one per component type) that can be enabled via values.yaml:

podMonitor:
  engines:
    enabled: true
  gateway:
    enabled: true
  operator:
    enabled: true
  allNamespaces: false   # set true when the Firebolt Operator watches all namespaces

Each PodMonitor uses label selectors to match the relevant pods:

Engines: firebolt.io/engine (exists). Matches all engine pods regardless of engine name
Gateway: firebolt.io/component=gateway
Firebolt Operator: control-plane=controller-manager + chart selector labels

When allNamespaces is true, namespaceSelector.any: true is added so pods in any namespace are discovered. This does not apply to the Firebolt Operator PodMonitor because the Firebolt Operator always runs in the release namespace.

Per-instance monitoring

The chart-level PodMonitors apply uniform scrape configuration to all instances in scope. If you need per-instance control (different intervals, selective enablement, custom relabelings), disable the chart-level PodMonitors and deploy your own alongside each FireboltInstance or FireboltEngine CR. The label selectors to use are:

Engine pods: firebolt.io/engine: <engine-name>
Gateway pods: firebolt.io/instance: <instance-name>, firebolt.io/component: gateway

Architecture decisions

Helm chart templates, not Firebolt Operator reconciliation

PodMonitor resources are shipped as Helm templates, not created by the Firebolt Operator’s Go reconciliation loop. This follows the dominant industry pattern used by cert-manager, Strimzi, FoundationDB operator, and others. CloudNativePG tried a reconciler-managed approach and deprecated it in v1.26 because:

It creates a hard dependency on the Prometheus Operator CRDs. The Firebolt Operator fails to reconcile on clusters where the CRDs are not installed.
The Firebolt Operator overwrites user customizations (scrape intervals, relabelings, TLS config) on every reconcile.
It adds RBAC complexity for monitoring.coreos.com resources.
Platform teams want full ownership of their monitoring configuration.

Gateway stats listener

The Envoy admin interface (port 9901) is bound to 127.0.0.1 and must stay that way. It exposes mutation endpoints (POST /healthcheck/fail, POST /quitquitquit) that the preStop hook depends on for graceful shutdown. Binding admin to 0.0.0.0 would allow any pod in the cluster to drain or kill gateway pods. Instead, a separate read-only stats listener is added on the metrics port (default 9090). This listener proxies only /stats/prometheus from the admin interface via an internal static cluster, exposing no mutation endpoints.

Consistent metrics port

Engine pods and gateway pods both expose Prometheus metrics on a container port named metrics (default 9090). The gateway override lives on the FireboltInstance at spec.gateway.metricsPort. The Firebolt Operator stamps the corresponding metrics-named port on the rendered Envoy container. Engine pods carry the port via the per-FireboltEngine wiring. PodMonitors can therefore always reference port: metrics without knowing the actual port number. The metadata pod does not currently expose a Prometheus endpoint, so no metrics port is stamped there.

Cross-namespace support

When the Firebolt Operator watches all namespaces (watchNamespace is empty), engine, gateway, and metadata pods may live in namespaces other than the Firebolt Operator’s namespace. Setting podMonitor.allNamespaces: true adds namespaceSelector.any: true to the PodMonitors so Prometheus discovers pods across all namespaces.

Embedded CR status metrics

The Firebolt Operator embeds custom Prometheus metrics that expose the status of every FireboltEngine and FireboltInstance it manages. These are level-triggered gauges updated on every reconcile. There are no timers or persisted timestamps, which is consistent with the Firebolt Operator’s level-driven reconciliation model. Duration and trend analysis are left to PromQL. This follows the pattern used by ArgoCD (argocd_app_info), Flux (gotk_reconcile_condition), cert-manager (certmanager_certificate_ready_status), and Crossplane (crossplane_managed_resource_ready).

Operator metrics reference

FireboltEngine metrics

Metric	Type	Labels	Updated	Description
`firebolt_engine_status_phase`	Gauge	`namespace`, `name`, `instance`, `phase`	Every reconcile	StateSet-style: 1 for the current phase, 0 for all others. Phases: `stable`, `creating`, `switching`, `draining`, `cleaning`, `stopped`.
`firebolt_engine_status_condition`	Gauge	`namespace`, `name`, `instance`, `type`	Every reconcile	1 when the condition is True, 0 when False or Unknown. Types: `Ready`, `InstanceReady`.
`firebolt_engine_spec_replicas`	Gauge	`namespace`, `name`, `instance`	Every reconcile	Desired replica count from `spec.replicas`.
`firebolt_engine_active_generation`	Gauge	`namespace`, `name`, `instance`	Every reconcile	Generation number currently serving traffic.
`firebolt_engine_pods_ready`	Gauge	`namespace`, `name`, `instance`	Every reconcile	Number of ready pods in the active generation.
`firebolt_engine_pods_total`	Gauge	`namespace`, `name`, `instance`	Every reconcile	Total pods in the active generation (includes non-ready).
`firebolt_engine_draining_generation`	Gauge	`namespace`, `name`, `instance`	Every reconcile	Generation being drained, or -1 if no drain is in progress.
`firebolt_engine_last_reconciled_timestamp`	Gauge	`namespace`, `name`, `instance`	Every reconcile	Unix timestamp of the last successful reconcile.
`firebolt_engine_drain_check_errors_total`	Counter	`namespace`, `name`, `instance`	On drain probe failure	Cumulative count of drain probe failures (pod unreachable, metrics missing).

FireboltInstance metrics

Metric	Type	Labels	Updated	Description
`firebolt_instance_status_phase`	Gauge	`namespace`, `name`, `phase`	Every reconcile	StateSet-style: 1 for the current phase, 0 for all others. Phases: `Provisioning`, `Ready`, `Degraded`, `Failed`.
`firebolt_instance_status_condition`	Gauge	`namespace`, `name`, `type`	Every reconcile	1 when the condition is True, 0 when False or Unknown. Types: `Ready`, `MetadataReady`, `GatewayReady`.
`firebolt_instance_info`	Gauge	`namespace`, `name`, `id`, `postgres_mode`	Every reconcile	Always 1. Carries static metadata: instance ID and postgres mode (`internal` or `external`).
`firebolt_instance_last_reconciled_timestamp`	Gauge	`namespace`, `name`	Every reconcile	Unix timestamp of the last successful reconcile.

Label glossary

Label	Meaning
`namespace`	Kubernetes namespace of the CR
`name`	Name of the FireboltEngine or FireboltInstance CR
`instance`	Name of the parent FireboltInstance (from `spec.instanceRef` on engines)
`phase`	Current lifecycle phase
`type`	Condition type (e.g., `Ready`, `MetadataReady`)
`id`	Stable instance ID (ULID)
`postgres_mode`	`internal` (Firebolt Operator-managed) or `external` (user-provided)

Example PromQL queries

Engine not ready:

firebolt_engine_status_condition{type="Ready"} == 0

Engine stuck in draining phase for more than 10 minutes:

firebolt_engine_status_phase{phase="draining"} == 1
  unless firebolt_engine_status_phase{phase="draining"} offset 10m == 0

Scaling in progress (ready pods less than desired):

firebolt_engine_pods_ready < firebolt_engine_spec_replicas

Instance degraded:

firebolt_instance_status_phase{phase="Degraded"} == 1

Stuck controller (no reconcile for 5 minutes):

time() - firebolt_engine_last_reconciled_timestamp > 300

Drain probe failures spiking:

rate(firebolt_engine_drain_check_errors_total[5m]) > 0

Fleet overview (ready engines per instance):

count by (namespace, instance) (firebolt_engine_status_condition{type="Ready"} == 1)

Cardinality

Each FireboltEngine produces approximately 15 time series (6 phases + 2 conditions + 7 scalar gauges). Each FireboltInstance produces approximately 10 series (4 phases + 4 conditions + 1 info + 1 timestamp). For a cluster with 10 instances and 50 engines, expect roughly 850 series from the Firebolt Operator. This is negligible for any Prometheus deployment. Metric label sets are cleaned up when CRs are deleted, so terminated engines do not leave stale series.

Overview

Performance and Observability

Security

Self-Managed

Managed service

Guides

SQL reference

Release notes

API reference

Legal

Metrics endpoints

Firebolt Operator metrics mode

Scraping with Prometheus

Per-instance monitoring

Architecture decisions

Helm chart templates, not Firebolt Operator reconciliation

Gateway stats listener

Consistent metrics port

Cross-namespace support

Embedded CR status metrics

Operator metrics reference

FireboltEngine metrics

FireboltInstance metrics

Label glossary

Example PromQL queries

Cardinality

​Metrics endpoints

​Firebolt Operator metrics mode

​Scraping with Prometheus

​Per-instance monitoring

​Architecture decisions

​Helm chart templates, not Firebolt Operator reconciliation

​Gateway stats listener

​Consistent metrics port

​Cross-namespace support

​Embedded CR status metrics

​Operator metrics reference

​FireboltEngine metrics

​FireboltInstance metrics

​Label glossary

​Example PromQL queries

​Cardinality

Metrics endpoints

Firebolt Operator metrics mode

Scraping with Prometheus

Per-instance monitoring

Architecture decisions

Helm chart templates, not Firebolt Operator reconciliation

Gateway stats listener

Consistent metrics port

Cross-namespace support

Embedded CR status metrics

Operator metrics reference

FireboltEngine metrics

FireboltInstance metrics

Label glossary

Example PromQL queries

Cardinality