March 26, 2025

MySQL Bug#115352: From Crash to Fix

This article details diagnosing and fixing a MySQL crash from incident response to code patching, covering MySQL Bug Fix, InnoDB Assertion Failure, Partitioned Table Crash, MySQL Crash Diagnosis, and Database Optimization Techniques.

Background

Recently, I handled a server crash incident in our queue.

Incident Description: A DBA killed a SQL statement modifying a partitioned table, causing a MySQL crash.

Error.log Analysis: The crash was due to an assertion failure table->get_ref_count() == 0.

2024-06-17T04:55:36.556026Z 303 [ERROR] [MY-013183] [InnoDB] Assertion failure: dict0dict.cc:1894:table->get_ref_count() == 0 thread 139805159565056

Simplified Backtrace:

#0  in raise
#1  in abort
#2  in my_server_abort
#3  in my_abort
#4  in ut_dbg_assertion_failed
#5  in dict_table_remove_from_cache_low
#6  in dict_table_remove_from_cache
#7  in dict_partitioned_table_remove_from_cache
#8  in innobase_dict_cache_reset
#9  in mysql_inplace_alter_table
#10 in mysql_alter_table
#11 in Sql_cmd_alter_table::execute
#12 in mysql_execute_command
#13 in dispatch_sql_command
#14 in dispatch_command
#15 in do_command
#16 in handle_connection
#17 in pfs_spawn_thread
#18 in start_thread
#19 in clone

This was the third time I encountered a similar error. Previous attempts to reproduce the issue by creating a partitioned table, executing DDL, and killing the DDL query were unsuccessful. The customer provided a corefile this time, which was crucial for analysis.

Assertion Error Analysis

The assertion failure occurred in the function dict_table_remove_from_cache_low, which removes a table object from the dictionary cache. The function ensures the object isn't in use before removal.

static void dict_table_remove_from_cache_low(
    dict_table_t *table, /*!< in, own: table */
    bool lru_evict)      /*!< in: true if table being evicted
                         to make room in the table LRU list */
{
  dict_foreign_t *foreign;
  dict_index_t *index;

  ut_ad(table);
  ut_ad(dict_lru_validate());
  ut_a(table->get_ref_count() == 0);

The reference count n_ref_count was 1, indicating the table was still in use, yet InnoDB proceeded to remove it, causing the crash.

Analyzing the Reference Count

The following code segment is crucial for understanding the reference count issue:

/** Get reference count.
@return current value of n_ref_count */
inline uint64_t dict_table_t::get_ref_count() const { return (n_ref_count); }

When analyzing the corefile, it was found that n_ref_count was not 0, indicating that the table was still in use:

(gdb) p table->n_ref_count._M_i
$1 = 1

This non-zero reference count suggests that there was still a thread using the table, yet InnoDB proceeded to remove it, leading to the crash.

Checking the Table Object

Upon inspecting the table object, we found that the problematic table was a_1 in the test database.

(gdb) p table->name
$3 = {m_name = 0xfffef40228a0 "test/a_1#p#p0"}

However, the SQL statement being executed was modifying test.a, not the a_1 table we initially saw.

(gdb) p thd->m_query_string
$5 = {
  str = 0x7f26c8085030 "ALTER TABLE test.a DROP PARTITION pmax",
  length = 38
}

This discrepancy indicated that InnoDB might have been removing the wrong table object, leading to the crash.

The _1 Suffix Mystery

We speculated that the _1 suffix might be an internal InnoDB usage, similar to the .index_crash_safe suffix used in binlog indexes for recovery purposes.

int MYSQL_BIN_LOG::set_crash_safe_index_file_name(const char *base_file_name) {
...
  if (fn_format(crash_safe_index_file_name, base_file_name, mysql_data_home,
                ".index_crash_safe",
                MYF(MY_UNPACK_FILENAME | MY_SAFE_PATH | MY_REPLACE_EXT)) ==
...
}

However, searches in InnoDB and mysql-test code did not yield any meaningful results.

At this point, we wondered if InnoDB had mistakenly deleted the wrong table object.

Table Object Source

By examining the call stack, we determined that the table object came from the function dict_partitioned_table_remove_from_cache.

This function scans each object in the data dictionary cache and checks if the table name matches.

size_t name_len = strlen(name);

for (uint32_t i = 0; i < hash_get_n_cells(dict_sys->table_id_hash); ++i) {
  dict_table_t *table;
  table = static_cast<dict_table_t *>(HASH_GET_FIRST(dict_sys->table_hash, i));
  while (table != nullptr) {
    dict_table_t *prev_table = table;
    table = static_cast<dict_table_t *>(HASH_GET_NEXT(name_hash, prev_table));
    if ((strncmp(name, prev_table->name.m_name, name_len) == 0) &&
        dict_table_is_partition(prev_table)) {
      btr_drop_ahi_for_table(prev_table);
      dict_table_remove_from_cache(prev_table);
    }
  }
}

The condition strncmp(name, prev_table->name.m_name, name_len) == 0 only compares the first name_len bytes. If these bytes match and the table is a partitioned table, InnoDB will remove the table from the dictionary cache.

(gdb) p name
$7 = 0x7f26ecdf4070 "test/a"
(gdb) p prev_table->name.m_name
$8 = 0xfffef40228a0 "test/a_1#p#p0"
(gdb) p name_len
$9 = 6

Memory Layout

In the yellow rectangle, the first 6 bytes of test/a match the first 6 bytes of test/a_1#p#p0.

Reproduction Conditions

To reproduce the issue, we needed:

1. A partitioned table with a name sharing the first n characters with the DDL query.

2. The table must be in the dictionary-object-cache.

Reproduction Steps

1. Create the tables:

create database test;
create table test.a (x int)
PARTITION BY RANGE (x) (
  PARTITION p0 VALUES LESS THAN (10000),
  PARTITION pmax VALUES LESS THAN MAXVALUE
);
create table test.a_1 like test.a;

2. Load test.a_1 into the data dictionary cache:

select count(*) from test.a_1;

3. In one shell, continuously kill ALTER queries:

while true; do { mysql -BNe 'select concat("kill ",id ,";") from information_schema.processlist where state = "committing alter table to storage engine";' | mysql -vvv ; } ; done

4.In another shell, continuously execute ALTER queries:

while true; do { mysql -BNe "ALTER TABLE test.a ADD PARTITION (PARTITION pmax VALUES LESS THAN MAXVALUE);" ; mysql -BNe " ALTER TABLE test.a DROP PARTITION pmax;" ; }; done

Bug Reporting

As this was an upstream bug, we created a report in Percona and linked it to MySQL.

https://bugs.mysql.com/bug.php?id=115352

Fix

The bug stemmed from InnoDB's table name matching issue. The fix involved adding a condition to check if the table name ends with #p#.

if ((strncmp(name, prev_table->name.m_name, name_len) == 0) &&
    dict_table_is_partition(prev_table) &&
    (strncmp(dict_name::PART_SEPARATOR, prev_table->name.m_name + name_len, dict_name::PART_SEPARATOR_LEN) == 0)) {
  btr_drop_ahi_for_table(prev_table);
  dict_table_remove_from_cache(prev_table);
}

This ensures test/a_1#p#p0 won't match test/a.

Patch and Resolution

Percona Server 8.0.39:

index 5c1e6896638..f99114d055e 100644
--- a/storage/innobase/dict/dict0dict.cc
+++ b/storage/innobase/dict/dict0dict.cc
@@ -2006,7 +2006,8 @@ void dict_partitioned_table_remove_from_cache(const char *name) {
       }

       if ((strncmp(name, prev_table->name.m_name, name_len) == 0) &&
-          dict_table_is_partition(prev_table)) {
+          dict_table_is_partition(prev_table) &&
+          (strncmp(dict_name::PART_SEPARATOR, prev_table->name.m_name + name_len, dict_name::PART_SEPARATOR_LEN) == 0)) {
         btr_drop_ahi_for_table(prev_table);
         dict_table_remove_from_cache(prev_table);
       }

Oracle addressed this issue in MySQL 8.0.40.

References

Data Dictionary Object Cache

PS-9264

MySQL Bug #115352

MySQL 8.0.40 Release Notes

MySQL 8.0.40 Commit

You will get best features of ChatDBA