VACUUM

Optimizes tablets for query performance.

VACUUM optimizes tablets for query performance. DML operations (such as DELETE, UPDATE, INSERT and COPY FROM) might create tablets that are not optimally sized. Suboptimal tablets occur because DML efficiently utilizes resources in proportion to the cardinality of the data being inserted. In addition to standard SQL operations, tuples that are deleted by an update are not always physically removed from their table; they remain present until a VACUUM is done. In other words, tablets are not necessarily optimal for running queries; therefore, it’s necessary to do VACUUM periodically, especially on frequently updated tables.

Syntax

VACUUM <table>

Parameters

Parameter Description Supported input types
<table> The name of the table to be optimized FACT, DIMENSION table or AGGREGATE INDEX

Example

Optimize table named games.

VACUUM games;

Usage Notes

Here are considerations related to running the VACUUM command.

  • What happens during VACUUM
    VACUUM analyzes the tablets, selects the ones that are too small or have too many deleted rows, and produces new versions that are optimized for query execution for both tablets and Aggregate Indexes.
    VACUUM is a non-blocking process and runs alongside other operations executed by the user. Consequently, some changes performed by VACUUM may conflict with mutations run by the user. The operation that commits first wins (see Transaction and concurrency for details) . This means that applications that execute mutations in parallel with VACUUM should gracefully handle transaction conflicts. It also means that benefits of the VACUUM may be diminished by a mutation that committed data first.

  • Space and performance considerations
    Users must be aware that VACUUM consumes both compute and storage resources.
    VACUUM can consume a considerable amount of compute resources depending on the table size, number of tablets, and number of mutations in the table.
    VACUUM produces optimized versions of the data, while leaving behind older versions subject to the garbage collection (GC) process. These older tablets will continue to consume storage space until the GC process completes the clean-up.
    If users would like to have precise control over VACUUM, it may be preferable to execute on a dedicated engine that could be sized and run just for VACUUM operations. With VACUUM running on a dedicated engine, it would not conflict with other queries’ execution and cache resources, and would provide operational separation from other scenarios.
    VACUUM may introduce a performance penalty as the newly created optimized tablets need to be synchronized with other engines operating on the same table(s).

  • Automatic scheduling
    You can enable automatic scheduling of processes such as VACUUM by integrating with external tools. Please see section Integrate with Firebolt for more detail on our current support for these tools.

Example of running VACUUM to improve query performance.

This example demonstrates a use case for running VACUUM and its performance impact. Let’s create a large table with 10 million rows:

COPY tutorial_vacuum 
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH HEADER=TRUE;

INSERT INTO tutorial_vacuum
SELECT a.* FROM tutorial_vacuum a, GENERATE_SERIES(1, 1000000); -- This may run a couple of minutes

Script above loads 10 rows from S3 csv file; after that it inserts into the same table the cross product of the 10 inserted rows with 1 million integers. Next we will delete about 90% of rows from the table, that would result in about 900,000 deleted rows.

DELETE FROM tutorial_vacuum WHERE "LevelID" > 1; 

Let’s run a simple select from the tutorial_vacuum table; VACUUM it, and repeat the same select.

SELECT hash_agg(*) FROM tutorial_vacuum;
VACUUM tutorial_vacuum;
SELECT hash_agg(*) FROM tutorial_vacuum;

The first select is executed on data with a lot of deleted rows, while the second is run after VACUUM and benefits from it. Let’s examine query history and see the performance benefit of the VACUUM operation:

NO STATEMENT STATUS DURATION
1 SELECT hash_agg(*) FROM tutorial_vacuum; Success 4.43 s
2 VACUUM tutorial_vacuum; Success 17.53 s
3 SELECT hash_agg(*) FROM tutorial_vacuum; Success 0.82 s

Note that the first select was running for more than 4 seconds while exactly the same select after VACUUM completes in less than a second.