April 17, 2025

MySQL Hangs When Started via systemd

Analyze why MySQL hangs when started via systemd and explore solutions for process management issues in Docker containers.

As stated in the title, during automated testing scenarios, MySQL cannot be started via systemd.

Repeatedly using kill -9 to terminate the instance process and checking whether mysqld is correctly restarted after exiting.

Specific details are as follows:

Host Information: CentOS 8 (Docker Container)

Using systemd to manage the mysqld process

The systemd service is in forking mode

Startup command:

# systemd startup command
sudo -S systemctl start mysqld_11690.service

# ExecStart command in the systemd service
/opt/mysql/base/8.0.34/bin/mysqld --defaults-file=/opt/mysql/etc/11690/my.cnf --daemonize --pid-file=/opt/mysql/data/11690/mysqld.pid --user=actiontech-mysql --socket=/opt/mysql/data/11690/mysqld.sock --port=11690

‍

1 Phenomenon Description

The startup command hangs indefinitely, neither succeeding nor returning any output. After several attempts, the scenario cannot be manually reproduced.

The MySQL error log shows no information. Checking the systemd service status reveals that the startup script fails due to the missing MAIN PID parameter.

‍

The last output from systemd is: New main PID 31036 does not exist or is a zombie.

‍

2 Root Cause Summary

During the systemd startup of mysqld, the following steps are executed based on the service template configuration:

ExecStart: Starts mysqld

mysqld creates a pid file during startup

ExecStartPost: Custom scripts (adjust permissions, write pid to cgroup, etc.)

Between steps 2-3, when the pid file is just created, the host receives an automated testing command: sudo -S kill -9 $(cat /opt/mysql/data/11690/mysqld.pid).

Since the pid file and process exist (if they do not exist, kill or cat will report an error), the automated test case considers the kill operation successful. However, the mysqld.pid file is maintained by MySQL itself. From systemd's perspective, it needs to wait for step 3 to complete before considering the startup successful.

When systemd is in forking mode, it determines whether the service has started successfully based on the child process's PID.

If the child process starts successfully and does not exit unexpectedly, systemd considers the service started and uses the child process's PID as the MAIN PID.

If the child process fails to start or exits unexpectedly, systemd considers the service startup failed.

3 Conclusion

During the execution of ExecStartPost, the child process ID 31036 has already been terminated by kill. The subsequent shell script lacks the startup parameters. However, the ExecStart step has already completed, resulting in MAIN PID 31036 becoming a zombie process that only exists in systemd.

4 Investigation Process

When encountering this issue, I was initially confused. I checked basic memory and disk information, which were within expected ranges and did not indicate resource shortages.

First, I examined the MySQL Error Log for any clues. The results were as follows:

...Irrelevant content omitted...
2024-02-05T05:08:42.538326+08:00 0 [Warning] [MY-010539] [Repl] Recovery from source pos 3943309 and file mysql-bin.000001 for channel ''. Previous relay log pos and relay log file had been set to 4, /opt/mysql/log/relaylog/11690/mysql-relay.000004 respectively.
2024-02-05T05:08:42.548513+08:00 0 [System] [MY-010931] [Server] /opt/mysql/base/8.0.34/bin/mysqld: ready for connections. Version: '8.0.34'  socket: '/opt/mysql/data/11690/mysqld.sock'  port: 11690  MySQL Community Server - GPL.
2024-02-05T05:08:42.548633+08:00 0 [System] [MY-013292] [Server] Admin interface ready for connections, address: '127.0.0.1'  port: 6114
2024-02-05T05:08:42.548620+08:00 5 [Note] [MY-010051] [Server] Event Scheduler: scheduler thread started with id 5

‍

Below is the status information under normal circumstances:

By comparing the two, I gathered the following useful information:

The post-start shell script failed due to the missing -p parameter (the -p parameter is the MAIN PID, which is the PID of the forked child process).

systemd could not locate PID 31036, which either did not exist or was a zombie process.

I then checked the process ID against the mysqld.pid file:

‍

Key findings:

PID 31036 did not exist.

The mysqld.pid file existed and contained the value 31036.

The top command showed no zombie processes.

To gather more clues, I examined the journalctl -u logs:

sh-4.4# journalctl -u mysqld_11690.service
-- Logs begin at Mon 2024-02-05 04:00:35 CST, end at Mon 2024-02-05 17:08:01 CST. --
Feb 05 05:07:54 udp-11 systemd[1]: Starting MySQL Server...
Feb 05 05:07:56 udp-11 systemd[1]: Started MySQL Server.
Feb 05 05:08:31 udp-11 systemd[1]: mysqld_11690.service: Main process exited, code=killed, status=9/KILL
Feb 05 05:08:31 udp-11 systemd[1]: mysqld_11690.service: Failed with result 'signal'.
Feb 05 05:08:32 udp-11 systemd[1]: Starting MySQL Server...
Feb 05 05:08:36 udp-11 systemd[1]: Started MySQL Server.
Feb 05 05:08:37 udp-11 systemd[1]: mysqld_11690.service: Main process exited, code=killed, status=9/KILL
Feb 05 05:08:37 udp-11 systemd[1]: mysqld_11690.service: Failed with result 'signal'.
Feb 05 05:08:39 udp-11 systemd[1]: Starting MySQL Server...
Feb 05 05:08:42 udp-11 u_set_iops.sh[31507]: /etc/systemd/system/mysqld_11690.service.d/u_set_iops.sh: option requires an argument -- p
Feb 05 05:08:42 udp-11 systemd[1]: mysqld_11690.service: New main PID 31036 does not exist or is a zombie.

‍

The journalctl -u logs only described the symptoms and did not provide specific causes, similar to the systemctl status output.

I then checked the /var/log/messages system logs and found repeated memory-related error messages. After searching online, I suspected potential hardware issues. However, after consulting with the automation testing team, we concluded:

The issue was intermittent, with 2 successes and 2 failures out of 4 test cases.

All tests were executed on the same host machine and container image.

The container that hung was always the same one.

Since there were successful executions, I temporarily ruled out hardware issues.

Considering the container environment, I wondered if there were issues with the cgroup mapping to the host. From the systemctl status output, the cgroup mapping to the host directory was: CGroup: /docker/3a72b2cdc7bd9beb1c7b2abec24763046604602a38f0fcb7406d17f5d33353d2/system.slice/mysqld_11690.service.

I checked the read/write permissions of the parent folder system.slice and found no abnormalities. I temporarily ruled out cgroup mapping issues (as other systemd services on the host were using the same cgroup without problems).

I attempted to use pstack to trace where systemd was hanging. The PID of the systemctl start command was 3048143:

sh-4.4# pstack 3048143
#0  0x00007fdfaef33ade in ppoll () from /lib64/libc.so.6
#1  0x00007fdfaf7768ee in bus_poll () from /usr/lib/systemd/libsystemd-shared-239.so
#2  0x00007fdfaf6a8f3d in bus_wait_for_jobs () from /usr/lib/systemd/libsystemd-shared-239.so
#3  0x000055b4c2d59b2e in start_unit ()
#4  0x00007fdfaf7457e3 in dispatch_verb () from /usr/lib/systemd/libsystemd-shared-239.so
#5  0x000055b4c2d4c2b4 in main ()

‍

The start_unit function seemed suspicious, but it was part of the executable file used to start systemd units, which provided little help.

Based on the available clues, I deduced:

The existence of the mysqld.pid file indicated that a mysqld process with PID 31036 was indeed started.

The process was terminated by the automation test case using kill -9.

systemd obtained a MAIN PID that had already been terminated, and the post-start shell script failed, causing the fork process to fail.

By reviewing the systemd startup workflow, I concluded that the MySQL instance was likely terminated unexpectedly after the mysqld.pid file was generated.

5 Reproduction Method

With no further leads, I decided to attempt to reproduce the issue based on my deductions.

5.1 Adjust the systemd MySQL Service Template

Edit the template file /etc/systemd/system/mysqld_11690.service to include a sleep 10 command after starting mysqld, creating a time window to simulate killing the instance process.

5.2 Reload Configuration

Execute systemctl daemon-reload to apply the changes.

5.3 Reproduce the Scenario

[SSH Session A] Prepare a new container, configure it, and run sudo -S systemctl start mysqld_11690.service to start a mysqld process. The session will hang due to the sleep command.

[SSH Session B] In another session, once the start command hangs, check the mysqld.pid file and immediately execute sudo -S kill -9 $(cat /opt/mysql/data/11690/mysqld.pid) once the file is created.

Observe the systemctl status, which should match the expected behavior.

6 Resolution Method

First, terminate the hanging systemctl start command and execute systemctl stop mysqld_11690.service to allow systemd to actively terminate the zombie process. Although the stop command may report an error, it does not affect the process.

After the stop command completes, restart the service using start, and it should return to normal operation.

‍