1. Introduction
Disk Space Alerts are a common occurrence for DBAs during database operations. During a recent inspection, such an alert occurred, presenting an opportunity to demonstrate how to use the pt-osc (pt-online-schema-change) tool for rapid data cleanup.
Before initiating any data cleanup, it's crucial to contact the relevant business stakeholders, understand the situation thoroughly, and only then proceed with the cleanup process.
2. Decision Making
Upon communication, it was revealed that a large table in the database only required data retention under specific conditions, with this subset potentially constituting as little as 10% or even less of the total data.
Can We Directly Delete Unnecessary Data?
Typically, deleting large volumes of data in one go can lead to massive transactions, adversely affecting master-slave replication. Consequently, some operations personnel opt to use pt-archiver or similar tools to delete small batches of data incrementally. Although this approach achieves the desired outcome, it proves time-consuming for large datasets and generates substantial binlogs, making it less than optimal.
Can We Approach This Differently?
In this case, focusing on the data to be retained is beneficial since it constitutes a small portion. By first importing the data to be retained into a temporary table and then swapping the two tables post-completion, the old data can be cleared via a simple DROP table operation. This method is swift, generates minimal binlogs, and offers higher efficiency. However, it necessitates a business halt as new data entries can't be synchronously mirrored to the temporary table in real-time.
Why Choose pt-osc?
Observing the execution logic of the aforementioned method, it closely resembles the Online DDL principle of pt-osc, involving data migration followed by table swapping while handling synchronization of newly written data in between. While those with strong动手能力 could develop a custom tool based on this logic, the latest pt-osc version simplifies this process.
Percona-toolkit v3.6.0 introduces a new --where parameter for pt-osc, enabling data filtering by copying only data that meets specified conditions. This parameter is key to achieving our data cleanup objective.
3. Practical Operation
First, prepare a table with data records spanning from 2025-02-01 to 2025-02-12.
1mysql> select time,count(*) from test group by time;
2+---------------------+----------+
3| time | count(*) |
4+---------------------+----------+
5| 2025-02-01 00:00:00 | 10000|
6| 2025-02-02 00:00:00 | 10000|
7| 2025-02-03 00:00:00 | 10000|
8| 2025-02-04 00:00:00 | 10000|
9| 2025-02-05 00:00:00 | 10000|
10| 2025-02-06 00:00:00 | 10000|
11| 2025-02-07 00:00:00 | 10000|
12| 2025-02-08 00:00:00 | 10000|
13| 2025-02-09 00:00:00 | 10000|
14| 2025-02-10 00:00:00 | 10000|
15| 2025-02-11 00:00:00 | 10000|
16| 2025-02-12 00:00:00 | 10000|
17+---------------------+----------+
Assuming we only need to retain the data for 2025-02-12, we can specify --where
'time="2025-02-12"'. When using the --where
parameter, if
--no-drop-new-table
and --no-swap-tables
are not specified simultaneously, --force
must be specified to proceed. The DDL operation only needs to be set to engine=innodb.
1[root@192-168-13-131 bin]# ./pt-online-schema-change -u root -p 123456 -h 10.186.60.84 -P 3306 --alter "engine=innodb" --where 'time="2025-02-12"' D=kk,t=test
2Found 1 slaves:
310-186-60-87 -> 10.186.60.87:3306
4Will check slave lag on:
510-186-60-87 -> 10.186.60.87:3306
6
7Operation, tries, wait:
8analyze_table, 10, 1
9copy_rows, 10, 0.25
10create_triggers, 10, 1
11drop_triggers, 10, 1
12swap_tables, 10, 1
13update_foreign_keys, 10, 1
14Altering `kk`.`test`...
15Creating new table...
16Created new table kk._test_new OK.
17Altering new table...
18Altered `kk`._test_new OK.
192025-02-13T14:47:50 Creating triggers...
202025-02-13T14:47:51 Created triggers OK.
212025-02-13T14:47:51 Copying approximately 109467 rows...
222025-02-13T14:48:02 Copied rows OK.
232025-02-13T14:48:02 Analyzing new table...
242025-02-13T14:48:02 Swapping tables...
252025-02-13T14:48:03 Swapped original and new tables OK.
262025-02-13T14:48:03 Dropping old table...
272025-02-13T14:48:03 Dropped old table `kk`._test_old OK.
282025-02-13T14:48:03 Dropping triggers...
292025-02-13T14:48:03 Dropped triggers OK.
30Successfully altered `kk`.`test`.
The final table retains only the data for 2025-02-12. If you need to preserve the original data, you can use the --no-drop-old-table parameter.
1mysql> select time,count(*) from test group by time;
2+---------------------+----------+
3| time | count(*) |
4+---------------------+----------+
5| 2025-02-12 00:00:00 | 10000|
6+---------------------+----------+
It's important to note that since pt-osc creates triggers to synchronize incremental data and no filtering is applied to this part of the data, if data that doesn't meet the --where condition is written or updated during this period, it will also be synchronized to the new table. You can choose to retain or delete this data based on your needs.
References
[1] pt-osc: https://docs.percona.com/percona-toolkit/pt-online-schema-change.html
[2] pt-archiver: https://docs.percona.com/percona-toolkit/pt-archiver.html