Metrics endpoints
| Component | Port | Name | Path | What it exposes |
|---|---|---|---|---|
| Engine pods | 9090 | metrics | /metrics | firebolt_running_queries, firebolt_suspended_queries, and other engine-internal gauges. The Firebolt Operator scrapes the first two via Pods/proxy to drive both the drain check and auto-stop. |
| Gateway pods (Envoy) | 9090 (default) | metrics | /stats/prometheus | Envoy connection, request, and cluster stats |
| Firebolt Operator pod | Configurable via metrics.bindAddress | https or http | /metrics | controller-runtime reconciliation, workqueue, REST client, and Go runtime metrics |
spec.gateway.metricsPort. Metadata pods do not currently expose a Prometheus metrics endpoint.
Firebolt Operator metrics mode
The Firebolt Operator metrics endpoint mode is controlled by two Helm values:| Mode | metrics.secure | metrics.bindAddress | Port name | Scheme |
|---|---|---|---|---|
| HTTPS (default) | true | :8443 | https | https with authn/authz and self-signed TLS |
| HTTP | false | :8080 | http | plain http |
metrics.secure.
Scraping with Prometheus
The Firebolt Operator Helm chart ships optionalPodMonitor resources (one per component type) that can be enabled via values.yaml:
- Engines:
firebolt.io/engine(exists). Matches all engine pods regardless of engine name - Gateway:
firebolt.io/component=gateway - Firebolt Operator:
control-plane=controller-manager+ chart selector labels
allNamespaces is true, namespaceSelector.any: true is added so pods in any namespace are discovered. This does not apply to the Firebolt Operator PodMonitor because the Firebolt Operator always runs in the release namespace.
Per-instance monitoring
The chart-level PodMonitors apply uniform scrape configuration to all instances in scope. If you need per-instance control (different intervals, selective enablement, custom relabelings), disable the chart-level PodMonitors and deploy your own alongside each FireboltInstance or FireboltEngine CR. The label selectors to use are:- Engine pods:
firebolt.io/engine: <engine-name> - Gateway pods:
firebolt.io/instance: <instance-name>,firebolt.io/component: gateway
Architecture decisions
Helm chart templates, not Firebolt Operator reconciliation
PodMonitor resources are shipped as Helm templates, not created by the Firebolt Operator’s Go reconciliation loop. This follows the dominant industry pattern used by cert-manager, Strimzi, FoundationDB operator, and others. CloudNativePG tried a reconciler-managed approach and deprecated it in v1.26 because:- It creates a hard dependency on the Prometheus Operator CRDs. The Firebolt Operator fails to reconcile on clusters where the CRDs are not installed.
- The Firebolt Operator overwrites user customizations (scrape intervals, relabelings, TLS config) on every reconcile.
- It adds RBAC complexity for
monitoring.coreos.comresources. - Platform teams want full ownership of their monitoring configuration.
Gateway stats listener
The Envoy admin interface (port 9901) is bound to127.0.0.1 and must stay that way. It exposes mutation endpoints (POST /healthcheck/fail, POST /quitquitquit) that the preStop hook depends on for graceful shutdown. Binding admin to 0.0.0.0 would allow any pod in the cluster to drain or kill gateway pods.
Instead, a separate read-only stats listener is added on the metrics port (default 9090). This listener proxies only /stats/prometheus from the admin interface via an internal static cluster, exposing no mutation endpoints.
Consistent metrics port
Engine pods and gateway pods both expose Prometheus metrics on a container port namedmetrics (default 9090). The gateway override lives on the FireboltInstance at spec.gateway.metricsPort. The Firebolt Operator stamps the corresponding metrics-named port on the rendered Envoy container. Engine pods carry the port via the per-FireboltEngine wiring. PodMonitors can therefore always reference port: metrics without knowing the actual port number. The metadata pod does not currently expose a Prometheus endpoint, so no metrics port is stamped there.
Cross-namespace support
When the Firebolt Operator watches all namespaces (watchNamespace is empty), engine, gateway, and metadata pods may live in namespaces other than the Firebolt Operator’s namespace. Setting podMonitor.allNamespaces: true adds namespaceSelector.any: true to the PodMonitors so Prometheus discovers pods across all namespaces.
Embedded CR status metrics
The Firebolt Operator embeds custom Prometheus metrics that expose the status of every FireboltEngine and FireboltInstance it manages. These are level-triggered gauges updated on every reconcile. There are no timers or persisted timestamps, which is consistent with the Firebolt Operator’s level-driven reconciliation model. Duration and trend analysis are left to PromQL. This follows the pattern used by ArgoCD (argocd_app_info), Flux (gotk_reconcile_condition), cert-manager (certmanager_certificate_ready_status), and Crossplane (crossplane_managed_resource_ready).
Operator metrics reference
FireboltEngine metrics
| Metric | Type | Labels | Updated | Description |
|---|---|---|---|---|
firebolt_engine_status_phase | Gauge | namespace, name, instance, phase | Every reconcile | StateSet-style: 1 for the current phase, 0 for all others. Phases: stable, creating, switching, draining, cleaning, stopped. |
firebolt_engine_status_condition | Gauge | namespace, name, instance, type | Every reconcile | 1 when the condition is True, 0 when False or Unknown. Types: Ready, InstanceReady. |
firebolt_engine_spec_replicas | Gauge | namespace, name, instance | Every reconcile | Desired replica count from spec.replicas. |
firebolt_engine_active_generation | Gauge | namespace, name, instance | Every reconcile | Generation number currently serving traffic. |
firebolt_engine_pods_ready | Gauge | namespace, name, instance | Every reconcile | Number of ready pods in the active generation. |
firebolt_engine_pods_total | Gauge | namespace, name, instance | Every reconcile | Total pods in the active generation (includes non-ready). |
firebolt_engine_draining_generation | Gauge | namespace, name, instance | Every reconcile | Generation being drained, or -1 if no drain is in progress. |
firebolt_engine_last_reconciled_timestamp | Gauge | namespace, name, instance | Every reconcile | Unix timestamp of the last successful reconcile. |
firebolt_engine_drain_check_errors_total | Counter | namespace, name, instance | On drain probe failure | Cumulative count of drain probe failures (pod unreachable, metrics missing). |
FireboltInstance metrics
| Metric | Type | Labels | Updated | Description |
|---|---|---|---|---|
firebolt_instance_status_phase | Gauge | namespace, name, phase | Every reconcile | StateSet-style: 1 for the current phase, 0 for all others. Phases: Provisioning, Ready, Degraded, Failed. |
firebolt_instance_status_condition | Gauge | namespace, name, type | Every reconcile | 1 when the condition is True, 0 when False or Unknown. Types: Ready, MetadataReady, GatewayReady. |
firebolt_instance_info | Gauge | namespace, name, id, postgres_mode | Every reconcile | Always 1. Carries static metadata: instance ID and postgres mode (internal or external). |
firebolt_instance_last_reconciled_timestamp | Gauge | namespace, name | Every reconcile | Unix timestamp of the last successful reconcile. |
Label glossary
| Label | Meaning |
|---|---|
namespace | Kubernetes namespace of the CR |
name | Name of the FireboltEngine or FireboltInstance CR |
instance | Name of the parent FireboltInstance (from spec.instanceRef on engines) |
phase | Current lifecycle phase |
type | Condition type (e.g., Ready, MetadataReady) |
id | Stable instance ID (ULID) |
postgres_mode | internal (Firebolt Operator-managed) or external (user-provided) |