This is a technical guide to understanding how concurrency autoscaling works in Firebolt engines. For overview of autoscaling, see the Concurrency auto-scaling.

Basic operation

Firebolt engines can scale for concurrency by changing the number of clusters that belong to a single engine. You set the bounds with the MIN_CLUSTERS and MAX_CLUSTERS options in CREATE ENGINE and ALTER ENGINE commands. When demand rises above the current capacity, Firebolt adds one new cluster; when load drops, it removes one. After each adjustment the service observes a short stabilization window before it evaluates the next scale decision, preventing thrashing during bursty traffic.

When a stopped engine starts, it starts with the minimum number of clusters.

Auto-scaling metrics

The autoscaler bases its decisions on three workload signals:

  • CPU utilization – averaged across nodes in each cluster, then averaged across clusters
  • RAM utilization – averaged in the same way
  • Query queue time – the maximum time any waiting query in the engine has spent in the queue

The exact thresholds and the way metrics are aggregated are tuned by Firebolt and can be changed without notice. Current default thresholds are:

MetricThresholds for adding clusters(any)Thresholds for removing clusters (all)
CPU utilizationover 90%under 75%
RAM utilizationover 70%under 50%
Max query queue timeover 3 secondsunder 1 second

The current default stabilization window is 1 minute.

For monitoring engine stats, please refer to Monitoring Engine Usage.

Monitoring autoscaling

You can check how many clusters an engine is using at any moment — and what its minimum and maximum limits are — via information_schema.engines table.

You can see how an engine changed the number of clusters over time via information_schema.engine_history table.

Gotchas

This is concurrency auto-scaling: it creates extra clusters so that new queries can start promptly. It does not re-plan or speed up queries that were already running. Adding more clusters will not make a single large query finish faster. If an individual query needs more processing power, consider scaling the engine up or out (larger TYPE or more NODES) instead. For general sizing advice, see the Sizing Engines guide.

During scale down, the engine will wait for all queries that run on a cluster that is shutting down to finish before removing it (i.e. graceful drain).

During scale up, an added cluster will start “cold” (i.e. it will not have any data cached). Because of this, Firebolt might route less traffic to the new cluster until it has warmed up to prevent overloading it. We are working on a feature that will make a new cluster proactively fetch data to be in sync with the cache of the other clusters.