> ## Documentation Index
> Fetch the complete documentation index at: https://docs.firebolt.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitoring

> Prometheus metrics exposed by the Firebolt Operator and managed components.

This document describes how the Firebolt Operator exposes Prometheus metrics for the components it manages.

## Metrics endpoints

| Component             | Port                                   | Name              | Path                | What it exposes                                                                                                                                                                                       |
| --------------------- | -------------------------------------- | ----------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Engine pods           | 9090                                   | `metrics`         | `/metrics`          | `firebolt_running_queries`, `firebolt_suspended_queries`, and other engine-internal gauges. The Firebolt Operator scrapes the first two via `Pods/proxy` to drive both the drain check and auto-stop. |
| Gateway pods (Envoy)  | 9090 (default)                         | `metrics`         | `/stats/prometheus` | Envoy connection, request, and cluster stats                                                                                                                                                          |
| Firebolt Operator pod | Configurable via `metrics.bindAddress` | `https` or `http` | `/metrics`          | controller-runtime reconciliation, workqueue, REST client, and Go runtime metrics                                                                                                                     |

The gateway metrics port defaults to 9090 and is configurable per FireboltInstance CR via `spec.gateway.metricsPort`. Metadata pods do not currently expose a Prometheus metrics endpoint.

### Firebolt Operator metrics mode

The Firebolt Operator metrics endpoint mode is controlled by two Helm values:

| Mode            | `metrics.secure` | `metrics.bindAddress` | Port name | Scheme                                       |
| --------------- | ---------------- | --------------------- | --------- | -------------------------------------------- |
| HTTPS (default) | `true`           | `:8443`               | `https`   | `https` with authn/authz and self-signed TLS |
| HTTP            | `false`          | `:8080`               | `http`    | plain `http`                                 |

The Firebolt Operator PodMonitor template automatically adapts its port reference, scheme, bearer token, and TLS configuration based on `metrics.secure`.

## Scraping with Prometheus

The Firebolt Operator Helm chart ships optional `PodMonitor` resources (one per component type) that can be enabled via `values.yaml`:

```yaml theme={"theme":{"light":"css-variables","dark":"css-variables"}}
podMonitor:
  engines:
    enabled: true
  gateway:
    enabled: true
  operator:
    enabled: true
  allNamespaces: false   # set true when the Firebolt Operator watches all namespaces
```

Each PodMonitor uses label selectors to match the relevant pods:

* **Engines**: `firebolt.io/engine` (exists). Matches all engine pods regardless of engine name
* **Gateway**: `firebolt.io/component=gateway`
* **Firebolt Operator**: `control-plane=controller-manager` + chart selector labels

When `allNamespaces` is true, `namespaceSelector.any: true` is added so pods in any namespace are discovered. This does not apply to the Firebolt Operator PodMonitor because the Firebolt Operator always runs in the release namespace.

### Per-instance monitoring

The chart-level PodMonitors apply uniform scrape configuration to all instances in scope. If you need per-instance control (different intervals, selective enablement, custom relabelings), disable the chart-level PodMonitors and deploy your own alongside each FireboltInstance or FireboltEngine CR. The label selectors to use are:

* Engine pods: `firebolt.io/engine: <engine-name>`
* Gateway pods: `firebolt.io/instance: <instance-name>`, `firebolt.io/component: gateway`

## Architecture decisions

### Helm chart templates, not Firebolt Operator reconciliation

PodMonitor resources are shipped as Helm templates, not created by the Firebolt Operator's Go reconciliation loop. This follows the dominant industry pattern used by cert-manager, Strimzi, FoundationDB operator, and others. CloudNativePG tried a reconciler-managed approach and deprecated it in v1.26 because:

* It creates a hard dependency on the Prometheus Operator CRDs. The Firebolt Operator fails to reconcile on clusters where the CRDs are not installed.
* The Firebolt Operator overwrites user customizations (scrape intervals, relabelings, TLS config) on every reconcile.
* It adds RBAC complexity for `monitoring.coreos.com` resources.
* Platform teams want full ownership of their monitoring configuration.

### Gateway stats listener

The Envoy admin interface (port 9901) is bound to `127.0.0.1` and must stay that way. It exposes mutation endpoints (`POST /healthcheck/fail`, `POST /quitquitquit`) that the preStop hook depends on for graceful shutdown. Binding admin to `0.0.0.0` would allow any pod in the cluster to drain or kill gateway pods.

Instead, a separate read-only stats listener is added on the metrics port (default 9090). This listener proxies only `/stats/prometheus` from the admin interface via an internal static cluster, exposing no mutation endpoints.

### Consistent metrics port

Engine pods and gateway pods both expose Prometheus metrics on a container port named `metrics` (default 9090). The gateway override lives on the FireboltInstance at `spec.gateway.metricsPort`. The Firebolt Operator stamps the corresponding `metrics`-named port on the rendered Envoy container. Engine pods carry the port via the per-FireboltEngine wiring. PodMonitors can therefore always reference `port: metrics` without knowing the actual port number. The metadata pod does not currently expose a Prometheus endpoint, so no `metrics` port is stamped there.

### Cross-namespace support

When the Firebolt Operator watches all namespaces (`watchNamespace` is empty), engine, gateway, and metadata pods may live in namespaces other than the Firebolt Operator's namespace. Setting `podMonitor.allNamespaces: true` adds `namespaceSelector.any: true` to the PodMonitors so Prometheus discovers pods across all namespaces.

### Embedded CR status metrics

The Firebolt Operator embeds custom Prometheus metrics that expose the status of every FireboltEngine and FireboltInstance it manages. These are level-triggered gauges updated on every reconcile. There are no timers or persisted timestamps, which is consistent with the Firebolt Operator's level-driven reconciliation model. Duration and trend analysis are left to PromQL.

This follows the pattern used by ArgoCD (`argocd_app_info`), Flux (`gotk_reconcile_condition`), cert-manager (`certmanager_certificate_ready_status`), and Crossplane (`crossplane_managed_resource_ready`).

## Operator metrics reference

### FireboltEngine metrics

| Metric                                      | Type    | Labels                                   | Updated                | Description                                                                                                                              |
| ------------------------------------------- | ------- | ---------------------------------------- | ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `firebolt_engine_status_phase`              | Gauge   | `namespace`, `name`, `instance`, `phase` | Every reconcile        | StateSet-style: 1 for the current phase, 0 for all others. Phases: `stable`, `creating`, `switching`, `draining`, `cleaning`, `stopped`. |
| `firebolt_engine_status_condition`          | Gauge   | `namespace`, `name`, `instance`, `type`  | Every reconcile        | 1 when the condition is True, 0 when False or Unknown. Types: `Ready`, `InstanceReady`.                                                  |
| `firebolt_engine_spec_replicas`             | Gauge   | `namespace`, `name`, `instance`          | Every reconcile        | Desired replica count from `spec.replicas`.                                                                                              |
| `firebolt_engine_active_generation`         | Gauge   | `namespace`, `name`, `instance`          | Every reconcile        | Generation number currently serving traffic.                                                                                             |
| `firebolt_engine_pods_ready`                | Gauge   | `namespace`, `name`, `instance`          | Every reconcile        | Number of ready pods in the active generation.                                                                                           |
| `firebolt_engine_pods_total`                | Gauge   | `namespace`, `name`, `instance`          | Every reconcile        | Total pods in the active generation (includes non-ready).                                                                                |
| `firebolt_engine_draining_generation`       | Gauge   | `namespace`, `name`, `instance`          | Every reconcile        | Generation being drained, or -1 if no drain is in progress.                                                                              |
| `firebolt_engine_last_reconciled_timestamp` | Gauge   | `namespace`, `name`, `instance`          | Every reconcile        | Unix timestamp of the last successful reconcile.                                                                                         |
| `firebolt_engine_drain_check_errors_total`  | Counter | `namespace`, `name`, `instance`          | On drain probe failure | Cumulative count of drain probe failures (pod unreachable, metrics missing).                                                             |

### FireboltInstance metrics

| Metric                                        | Type  | Labels                                     | Updated         | Description                                                                                                       |
| --------------------------------------------- | ----- | ------------------------------------------ | --------------- | ----------------------------------------------------------------------------------------------------------------- |
| `firebolt_instance_status_phase`              | Gauge | `namespace`, `name`, `phase`               | Every reconcile | StateSet-style: 1 for the current phase, 0 for all others. Phases: `Provisioning`, `Ready`, `Degraded`, `Failed`. |
| `firebolt_instance_status_condition`          | Gauge | `namespace`, `name`, `type`                | Every reconcile | 1 when the condition is True, 0 when False or Unknown. Types: `Ready`, `MetadataReady`, `GatewayReady`.           |
| `firebolt_instance_info`                      | Gauge | `namespace`, `name`, `id`, `postgres_mode` | Every reconcile | Always 1. Carries static metadata: instance ID and postgres mode (`internal` or `external`).                      |
| `firebolt_instance_last_reconciled_timestamp` | Gauge | `namespace`, `name`                        | Every reconcile | Unix timestamp of the last successful reconcile.                                                                  |

### Label glossary

| Label           | Meaning                                                                  |
| --------------- | ------------------------------------------------------------------------ |
| `namespace`     | Kubernetes namespace of the CR                                           |
| `name`          | Name of the FireboltEngine or FireboltInstance CR                        |
| `instance`      | Name of the parent FireboltInstance (from `spec.instanceRef` on engines) |
| `phase`         | Current lifecycle phase                                                  |
| `type`          | Condition type (e.g., `Ready`, `MetadataReady`)                          |
| `id`            | Stable instance ID (ULID)                                                |
| `postgres_mode` | `internal` (Firebolt Operator-managed) or `external` (user-provided)     |

### Example PromQL queries

Engine not ready:

```promql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
firebolt_engine_status_condition{type="Ready"} == 0
```

Engine stuck in draining phase for more than 10 minutes:

```promql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
firebolt_engine_status_phase{phase="draining"} == 1
  unless firebolt_engine_status_phase{phase="draining"} offset 10m == 0
```

Scaling in progress (ready pods less than desired):

```promql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
firebolt_engine_pods_ready < firebolt_engine_spec_replicas
```

Instance degraded:

```promql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
firebolt_instance_status_phase{phase="Degraded"} == 1
```

Stuck controller (no reconcile for 5 minutes):

```promql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
time() - firebolt_engine_last_reconciled_timestamp > 300
```

Drain probe failures spiking:

```promql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
rate(firebolt_engine_drain_check_errors_total[5m]) > 0
```

Fleet overview (ready engines per instance):

```promql theme={"theme":{"light":"css-variables","dark":"css-variables"}}
count by (namespace, instance) (firebolt_engine_status_condition{type="Ready"} == 1)
```

### Cardinality

Each FireboltEngine produces approximately 15 time series (6 phases + 2 conditions + 7 scalar gauges). Each FireboltInstance produces approximately 10 series (4 phases + 4 conditions + 1 info + 1 timestamp). For a cluster with 10 instances and 50 engines, expect roughly 850 series from the Firebolt Operator. This is negligible for any Prometheus deployment.

Metric label sets are cleaned up when CRs are deleted, so terminated engines do not leave stale series.
