VACUUM
Reference and syntax for the VACUUM command.
Optimizes tablets for query performance.
VACUUM
improves query efficiency by restructuring tablets for optimal performance. DML operations such as DELETE, UPDATE, INSERT, and COPY FROM might create tablets that are not optimally sized. Suboptimal tablets occur because DML efficiently utilizes resources in proportion to the cardinality of the data being inserted. In addition to standard SQL operations, tuples that are deleted by an update are not always physically removed from their table; they remain present until a VACUUM
is finished operating. In other words, tablets are not necessarily optimal for running queries; therefore, it’s necessary to run VACUUM
periodically, especially on frequently updated tables.
By default, any engine that processes a DML operation automatically assesses the health of tables’ data layout and runs the VACUUM
command when necessary to maintain the underlying table health. You can disable Auto VACUUM
for a specific engine using the ‘ALTER ENGINE’ statement:
Firebolt recommends keeping the default setting to maintain optimal layout of your tables’ underlying data.
You can also run VACUUM
manually using the syntax and options described below.
Syntax
Where <table|aggregating index>
is the name of the table or aggregating index to be optimized.
Options
Option name | Option value and description |
---|---|
INDEXES | FULL — (Default) Specifies whether to apply optimizations to both the table and all its aggregating indexes.INCREMENTAL — Similar to FULL , but will apply incremental optimizations to aggregating indexes, instead of complete reevaluation.NONE — Optimizes only the table. |
MAX_CONCURRENCY | <Number> — The maximum number of concurrent jobs to use during optimization. |
Examples
Optimize a table and its aggregating indexes
Optimizing a table along with its aggregating indexes ensures that both the data and aggregating indexes remain efficient, reducing query latency and improving overall performance.
The following code example optimizes the games
table and all its aggregating indexes:
The example above performs a complete rebuild of related aggregating indexes. Alternatively, you can specify to apply incremental optimization, which still improves index layout while using fewer resources, though it may not achieve optimal layout:
Optimize a table without its indexes
If you need to optimize a table without including its aggregating indexes to reduce resource usage, or retain efficient indexes, you can optimize only the table to prevent unnecessary computations.
The following code example optimizes the players
table without updating its aggregating indexes:
Optimize table named players
, using a single concurrent stream.
Usage Notes
The following are considerations for running the VACUUM
command:
What happens during VACUUM
VACUUM
analyzes the tablets, selects the ones that are too small or have too many deleted rows, and produces new versions that are optimized for query execution for both tablets and Aggregate Indexes.
VACUUM
runs as a non-blocking process, alongside other user-initiated operations. Consequently, some changes performed by VACUUM
may conflict with mutations run by the user. If VACUUM
and a user mutation modify the same data, the first committed operation takes precedence; see Transactions and concurrency for more details. This means that applications that run mutations in parallel with VACUUM
should gracefully handle transaction conflicts. It also means that benefits of the VACUUM
may be diminished by a mutation that committed data first.
Space and performance considerations
Users must be aware that VACUUM
consumes both compute and storage resources.
VACUUM
can consume a considerable amount of compute resources depending on the table size, number of tablets, and number of mutations in the table.
VACUUM
parallelizes its work into multiple concurrent streams, based on the number of CPU cores. While this can be beneficial for the speed of the operation, each stream consumes memory and CPU resources. Use the MAX_CONCURRENCY
option to limit the number of concurrent streams.
VACUUM
produces optimized versions of the data, while leaving behind older versions subject to the garbage collection (GC) process. These older tablets will continue to consume storage space until the GC process completes the clean-up.
If users would like to have precise control over VACUUM
, it may be preferable to run on a dedicated engine that could be sized and run just for VACUUM
operations. With VACUUM
running on a dedicated engine, it would not conflict with other queries’ execution and cache resources, and would provide operational separation from other scenarios.
As a general guidance, the smaller is the number of tablets the better Firebolt queries perform. Following this principle,
running VACUUM
on a single-node engine produces less tablets.
This is because VACUUM
can only merge tablets that are both:
- Assigned to the same node, and
- Belong to the same partition.
This principle is especially important for high-cardinality table partitions.
Finally, VACUUM
may introduce a temporary performance penalty as the newly created optimized tablets need to be synchronized to other engines operating on the same table(s).
VACUUM and Auto VACUUM observability
There are several aspects of VACUUM
that you can examine using the information schema tables.
Auto VACUUM settings for an engine can be checked using the information_schema.engines
view and looking at the column auto_vacuum
. Engines with Auto VACUUM explicitly enabled will have the value true
and when disabled will have the value false
. Another possible value is null
– this indicates default behavior of Auto VACUUM
which is enabled.
The fragmentation ratio, defined as the ratio of rows marked for deletion to the total number of rows, can be monitored using the information_schema.tables
view and looking at the fragmentation
column. When a VACUUM
operation completes for a table, you should expect to see the fragmentation ratio go down.
You can see the number of tablets for a table by querying information_schema.tables
and looking at the column number_of_tablets
. Typically, a VACUUM
operation will reduce this number as tablets are merged to an optimal state to maintain the underlying health of your tables.
Additionally, both information_schema.engine_running_queries
and information_schema.engine_query_history
show VACUUM
telemetry in the telemetry
column, which contains JSON. VACUUM
operation creates multiple jobs, each one responsible for specific portion of table or aggregating index. One can observe the progress and telemetry of both VACUUM
operation and its jobs by inspecting the vacuum_stats
object in the JSON.
For the VACUUM
operation itself, the fields in vacuum_stats
are:
Field | Description |
---|---|
type | Always set to vacuum |
objects | Number of tables and aggregating indexes the command operates on |
processed_objects | Number of tables and aggregating indexes have been already vacuumed |
success_jobs | Number of jobs completed successfully |
failed_jobs | Number of failed jobs |
For the VACUUM
job, the fields in vacuum_stats
are:
Field | Description |
---|---|
type | Always set to vacuum_job |
parent_query_id | The unique identifier of the VACUUM command this sub-job is part of |
node-ordinal | The ordinal number of the node the sub-job is running on |
tablets | Number of tablets that the sub-job is optimizing |
rows | Total number of rows (included deleted rows) in those tablets |
deleted_rows | Total number of deleted rows in those tablets |
Consider following example:
telemetry |
---|
{"vacuum_stats":{"type":"vacuum_job", "version":"v0.0.0", "parent_query_id":"c1bf80a4-005b-4063-a271-696de6906471", "node_ordinal":1, "tablets":1, "rows":2, "deleted_rows":1}} |
{"vacuum_stats":{"type":"vacuum_job", "version":"v0.0.0", "parent_query_id":"c1bf80a4-005b-4063-a271-696de6906471", "node_ordinal":1, "tablets":2, "rows":2, "deleted_rows":0}} |
{"vacuum_stats":{"type":"vacuum", "version":"v0.0.0", "objects":1, "processed_objects":1, "success_jobs":2, "failed_jobs":0}} |
Example with measuring the performance impact of VACUUM
Over time, operations such as INSERT
, DELETE
, and UPDATE
can create suboptimal tablets that decrease query performance. The VACUUM
command restructures these tablets by removing deleted rows and optimizing storage, leading to faster queries.
This example demonstrates the impact of VACUUM
by:
- Creating a large table with 10 million rows.
- Deleting 90% of the rows, leaving behind fragmented data.
- Running a query before and after
VACUUM
to compare run times.
The following code example loads data from a CSV file in an Amazon S3 bucket into the tutorial_vacuum
table with headers:
Script above loads 10 rows from S3 csv file; after that it inserts into the same table the cross product of the 10 inserted rows with 1 million integers.
Next, the following code example deletes all rows from the tutorial_vacuum
table where the LevelID
value is greater than 1, resulting in about 900,000 deleted rows:
The following code example runs a query computing checksum of the entire tutorial_vacuum
table before and after performing VACUUM
, allowing a comparison of query performance and efficiency improvements after optimization.
In the previous code example, the first SELECT
is run on data with many deleted rows, while the second runs after VACUUM
, benefiting from it. The following query history shows the performance benefit of the VACUUM
operation:
NO | STATEMENT | STATUS | DURATION |
---|---|---|---|
1 | SELECT hash_agg(*) FROM tutorial_vacuum; | Success | 4.43 s |
2 | VACUUM tutorial_vacuum; | Success | 17.53 s |
3 | SELECT hash_agg(*) FROM tutorial_vacuum; | Success | 0.82 s |
Note that the initial SELECT
query ran for over 4 seconds, while the identical SELECT
query ran for under a second, after running VACUUM
.