Background
Recently, I handled a server crash incident in our queue.
Incident Description: A DBA killed a SQL statement modifying a partitioned table, causing a MySQL crash.
Error.log Analysis: The crash was due to an assertion failure table->get_ref_count() == 0
.
2024-06-17T04:55:36.556026Z 303 [ERROR] [MY-013183] [InnoDB] Assertion failure: dict0dict.cc:1894:table->get_ref_count() == 0 thread 139805159565056
Simplified Backtrace:
#0 in raise
#1 in abort
#2 in my_server_abort
#3 in my_abort
#4 in ut_dbg_assertion_failed
#5 in dict_table_remove_from_cache_low
#6 in dict_table_remove_from_cache
#7 in dict_partitioned_table_remove_from_cache
#8 in innobase_dict_cache_reset
#9 in mysql_inplace_alter_table
#10 in mysql_alter_table
#11 in Sql_cmd_alter_table::execute
#12 in mysql_execute_command
#13 in dispatch_sql_command
#14 in dispatch_command
#15 in do_command
#16 in handle_connection
#17 in pfs_spawn_thread
#18 in start_thread
#19 in clone
This was the third time I encountered a similar error. Previous attempts to reproduce the issue by creating a partitioned table, executing DDL, and killing the DDL query were unsuccessful. The customer provided a corefile this time, which was crucial for analysis.
Assertion Error Analysis
The assertion failure occurred in the function dict_table_remove_from_cache_low
, which removes a table object from the dictionary cache. The function ensures the object isn't in use before removal.
static void dict_table_remove_from_cache_low(
dict_table_t *table, /*!< in, own: table */
bool lru_evict) /*!< in: true if table being evicted
to make room in the table LRU list */
{
dict_foreign_t *foreign;
dict_index_t *index;
ut_ad(table);
ut_ad(dict_lru_validate());
ut_a(table->get_ref_count() == 0);
The reference count n_ref_count
was 1, indicating the table was still in use, yet InnoDB proceeded to remove it, causing the crash.
Analyzing the Reference Count
The following code segment is crucial for understanding the reference count issue:
/** Get reference count.
@return current value of n_ref_count */
inline uint64_t dict_table_t::get_ref_count() const { return (n_ref_count); }
When analyzing the corefile, it was found that n_ref_count
was not 0
, indicating that the table was still in use:
(gdb) p table->n_ref_count._M_i
$1 = 1
This non-zero reference count suggests that there was still a thread using the table, yet InnoDB proceeded to remove it, leading to the crash.
Checking the Table Object
Upon inspecting the table object, we found that the problematic table was a_1
in the test
database.
(gdb) p table->name
$3 = {m_name = 0xfffef40228a0 "test/a_1#p#p0"}
However, the SQL statement being executed was modifying test.a
, not the a_1
table we initially saw.
(gdb) p thd->m_query_string
$5 = {
str = 0x7f26c8085030 "ALTER TABLE test.a DROP PARTITION pmax",
length = 38
}
This discrepancy indicated that InnoDB might have been removing the wrong table object, leading to the crash.
The _1
Suffix Mystery
We speculated that the _1
suffix might be an internal InnoDB usage, similar to the .index_crash_safe
suffix used in binlog indexes for recovery purposes.
int MYSQL_BIN_LOG::set_crash_safe_index_file_name(const char *base_file_name) {
...
if (fn_format(crash_safe_index_file_name, base_file_name, mysql_data_home,
".index_crash_safe",
MYF(MY_UNPACK_FILENAME | MY_SAFE_PATH | MY_REPLACE_EXT)) ==
...
}
However, searches in InnoDB and mysql-test code did not yield any meaningful results.
At this point, we wondered if InnoDB had mistakenly deleted the wrong table object.
Table Object Source
By examining the call stack, we determined that the table object came from the function dict_partitioned_table_remove_from_cache
.
This function scans each object in the data dictionary cache and checks if the table name matches.
size_t name_len = strlen(name);
for (uint32_t i = 0; i < hash_get_n_cells(dict_sys->table_id_hash); ++i) {
dict_table_t *table;
table = static_cast<dict_table_t *>(HASH_GET_FIRST(dict_sys->table_hash, i));
while (table != nullptr) {
dict_table_t *prev_table = table;
table = static_cast<dict_table_t *>(HASH_GET_NEXT(name_hash, prev_table));
if ((strncmp(name, prev_table->name.m_name, name_len) == 0) &&
dict_table_is_partition(prev_table)) {
btr_drop_ahi_for_table(prev_table);
dict_table_remove_from_cache(prev_table);
}
}
}
The condition strncmp(name, prev_table->name.m_name, name_len) == 0
only compares the first name_len
bytes. If these bytes match and the table is a partitioned table, InnoDB will remove the table from the dictionary cache.
(gdb) p name
$7 = 0x7f26ecdf4070 "test/a"
(gdb) p prev_table->name.m_name
$8 = 0xfffef40228a0 "test/a_1#p#p0"
(gdb) p name_len
$9 = 6
Memory Layout
In the yellow rectangle, the first 6 bytes of test/a
match the first 6 bytes of test/a_1#p#p0
.

Reproduction Conditions
To reproduce the issue, we needed:
1. A partitioned table with a name sharing the first n characters with the DDL query.
2. The table must be in the dictionary-object-cache.
Reproduction Steps
1. Create the tables:
create database test;
create table test.a (x int)
PARTITION BY RANGE (x) (
PARTITION p0 VALUES LESS THAN (10000),
PARTITION pmax VALUES LESS THAN MAXVALUE
);
create table test.a_1 like test.a;
2. Load test.a_1
into the data dictionary cache:
select count(*) from test.a_1;
3. In one shell, continuously kill ALTER queries:
while true; do { mysql -BNe 'select concat("kill ",id ,";") from information_schema.processlist where state = "committing alter table to storage engine";' | mysql -vvv ; } ; done
4.In another shell, continuously execute ALTER queries:
while true; do { mysql -BNe "ALTER TABLE test.a ADD PARTITION (PARTITION pmax VALUES LESS THAN MAXVALUE);" ; mysql -BNe " ALTER TABLE test.a DROP PARTITION pmax;" ; }; done
Bug Reporting
As this was an upstream bug, we created a report in Percona and linked it to MySQL.

Fix
The bug stemmed from InnoDB's table name matching issue. The fix involved adding a condition to check if the table name ends with #p#
.
if ((strncmp(name, prev_table->name.m_name, name_len) == 0) &&
dict_table_is_partition(prev_table) &&
(strncmp(dict_name::PART_SEPARATOR, prev_table->name.m_name + name_len, dict_name::PART_SEPARATOR_LEN) == 0)) {
btr_drop_ahi_for_table(prev_table);
dict_table_remove_from_cache(prev_table);
}
This ensures test/a_1#p#p0
won't match test/a
.

Patch and Resolution
Percona Server 8.0.39:
index 5c1e6896638..f99114d055e 100644
--- a/storage/innobase/dict/dict0dict.cc
+++ b/storage/innobase/dict/dict0dict.cc
@@ -2006,7 +2006,8 @@ void dict_partitioned_table_remove_from_cache(const char *name) {
}
if ((strncmp(name, prev_table->name.m_name, name_len) == 0) &&
- dict_table_is_partition(prev_table)) {
+ dict_table_is_partition(prev_table) &&
+ (strncmp(dict_name::PART_SEPARATOR, prev_table->name.m_name + name_len, dict_name::PART_SEPARATOR_LEN) == 0)) {
btr_drop_ahi_for_table(prev_table);
dict_table_remove_from_cache(prev_table);
}
Oracle addressed this issue in MySQL 8.0.40.
References