As stated in the title, during automated testing scenarios, MySQL cannot be started via systemd.
Repeatedly using kill -9
to terminate the instance process and checking whether mysqld is correctly restarted after exiting.
Specific details are as follows:
Host Information: CentOS 8 (Docker Container)
Using systemd to manage the mysqld process
The systemd service is in forking mode
Startup command:
# systemd startup command
sudo -S systemctl start mysqld_11690.service
# ExecStart command in the systemd service
/opt/mysql/base/8.0.34/bin/mysqld --defaults-file=/opt/mysql/etc/11690/my.cnf --daemonize --pid-file=/opt/mysql/data/11690/mysqld.pid --user=actiontech-mysql --socket=/opt/mysql/data/11690/mysqld.sock --port=11690
1 Phenomenon Description
The startup command hangs indefinitely, neither succeeding nor returning any output. After several attempts, the scenario cannot be manually reproduced.
The MySQL error log shows no information. Checking the systemd service status reveals that the startup script fails due to the missing MAIN PID
parameter.

The last output from systemd is: New main PID 31036 does not exist or is a zombie
.

2 Root Cause Summary
During the systemd startup of mysqld, the following steps are executed based on the service template configuration:
ExecStart: Starts mysqld
mysqld creates a pid
file during startup
ExecStartPost: Custom scripts (adjust permissions, write pid
to cgroup, etc.)
Between steps 2-3, when the pid
file is just created, the host receives an automated testing command: sudo -S kill -9 $(cat /opt/mysql/data/11690/mysqld.pid)
.
Since the pid
file and process exist (if they do not exist, kill
or cat
will report an error), the automated test case considers the kill
operation successful. However, the mysqld.pid
file is maintained by MySQL itself. From systemd's perspective, it needs to wait for step 3 to complete before considering the startup successful.
When systemd is in forking mode, it determines whether the service has started successfully based on the child process's PID
.
If the child process starts successfully and does not exit unexpectedly, systemd considers the service started and uses the child process's PID
as the MAIN PID
.
If the child process fails to start or exits unexpectedly, systemd considers the service startup failed.
3 Conclusion
During the execution of ExecStartPost, the child process ID 31036 has already been terminated by kill
. The subsequent shell
script lacks the startup parameters. However, the ExecStart step has already completed, resulting in MAIN PID 31036 becoming a zombie process that only exists in systemd.
4 Investigation Process
When encountering this issue, I was initially confused. I checked basic memory and disk information, which were within expected ranges and did not indicate resource shortages.
First, I examined the MySQL Error Log for any clues. The results were as follows:
...Irrelevant content omitted...
2024-02-05T05:08:42.538326+08:00 0 [Warning] [MY-010539] [Repl] Recovery from source pos 3943309 and file mysql-bin.000001 for channel ''. Previous relay log pos and relay log file had been set to 4, /opt/mysql/log/relaylog/11690/mysql-relay.000004 respectively.
2024-02-05T05:08:42.548513+08:00 0 [System] [MY-010931] [Server] /opt/mysql/base/8.0.34/bin/mysqld: ready for connections. Version: '8.0.34' socket: '/opt/mysql/data/11690/mysqld.sock' port: 11690 MySQL Community Server - GPL.
2024-02-05T05:08:42.548633+08:00 0 [System] [MY-013292] [Server] Admin interface ready for connections, address: '127.0.0.1' port: 6114
2024-02-05T05:08:42.548620+08:00 5 [Note] [MY-010051] [Server] Event Scheduler: scheduler thread started with id 5

Below is the status information under normal circumstances:

By comparing the two, I gathered the following useful information:
The post-start shell
script failed due to the missing -p
parameter (the -p
parameter is the MAIN PID
, which is the PID
of the forked child process).
systemd could not locate PID 31036
, which either did not exist or was a zombie process.
I then checked the process ID
against the mysqld.pid
file:


Key findings:
PID 31036
did not exist.
The mysqld.pid
file existed and contained the value 31036
.
The top
command showed no zombie processes.
To gather more clues, I examined the journalctl -u
logs:
sh-4.4# journalctl -u mysqld_11690.service
-- Logs begin at Mon 2024-02-05 04:00:35 CST, end at Mon 2024-02-05 17:08:01 CST. --
Feb 05 05:07:54 udp-11 systemd[1]: Starting MySQL Server...
Feb 05 05:07:56 udp-11 systemd[1]: Started MySQL Server.
Feb 05 05:08:31 udp-11 systemd[1]: mysqld_11690.service: Main process exited, code=killed, status=9/KILL
Feb 05 05:08:31 udp-11 systemd[1]: mysqld_11690.service: Failed with result 'signal'.
Feb 05 05:08:32 udp-11 systemd[1]: Starting MySQL Server...
Feb 05 05:08:36 udp-11 systemd[1]: Started MySQL Server.
Feb 05 05:08:37 udp-11 systemd[1]: mysqld_11690.service: Main process exited, code=killed, status=9/KILL
Feb 05 05:08:37 udp-11 systemd[1]: mysqld_11690.service: Failed with result 'signal'.
Feb 05 05:08:39 udp-11 systemd[1]: Starting MySQL Server...
Feb 05 05:08:42 udp-11 u_set_iops.sh[31507]: /etc/systemd/system/mysqld_11690.service.d/u_set_iops.sh: option requires an argument -- p
Feb 05 05:08:42 udp-11 systemd[1]: mysqld_11690.service: New main PID 31036 does not exist or is a zombie.
The journalctl -u
logs only described the symptoms and did not provide specific causes, similar to the systemctl status output.
I then checked the /var/log/messages
system logs and found repeated memory-related error messages. After searching online, I suspected potential hardware issues. However, after consulting with the automation testing team, we concluded:
The issue was intermittent, with 2 successes and 2 failures out of 4 test cases.
All tests were executed on the same host machine and container image.
The container that hung was always the same one.
Since there were successful executions, I temporarily ruled out hardware issues.
Considering the container environment, I wondered if there were issues with the cgroup mapping to the host. From the systemctl status output, the cgroup mapping to the host directory was: CGroup: /docker/3a72b2cdc7bd9beb1c7b2abec24763046604602a38f0fcb7406d17f5d33353d2/system.slice/mysqld_11690.service
.
I checked the read/write permissions of the parent folder system.slice
and found no abnormalities. I temporarily ruled out cgroup mapping issues (as other systemd services on the host were using the same cgroup without problems).
I attempted to use pstack to trace where systemd was hanging. The PID
of the systemctl start command was 3048143
:
sh-4.4# pstack 3048143
#0 0x00007fdfaef33ade in ppoll () from /lib64/libc.so.6
#1 0x00007fdfaf7768ee in bus_poll () from /usr/lib/systemd/libsystemd-shared-239.so
#2 0x00007fdfaf6a8f3d in bus_wait_for_jobs () from /usr/lib/systemd/libsystemd-shared-239.so
#3 0x000055b4c2d59b2e in start_unit ()
#4 0x00007fdfaf7457e3 in dispatch_verb () from /usr/lib/systemd/libsystemd-shared-239.so
#5 0x000055b4c2d4c2b4 in main ()
The start_unit function seemed suspicious, but it was part of the executable file used to start systemd units, which provided little help.
Based on the available clues, I deduced:
The existence of the mysqld.pid
file indicated that a mysqld process with PID 31036
was indeed started.
The process was terminated by the automation test case using kill -9
.
systemd obtained a MAIN PID
that had already been terminated, and the post-start shell script failed, causing the fork process to fail.
By reviewing the systemd startup workflow, I concluded that the MySQL instance was likely terminated unexpectedly after the mysqld.pid
file was generated.
5 Reproduction Method
With no further leads, I decided to attempt to reproduce the issue based on my deductions.
5.1 Adjust the systemd MySQL Service Template
Edit the template file /etc/systemd/system/mysqld_11690.service
to include a sleep 10
command after starting mysqld, creating a time window to simulate killing the instance process.
5.2 Reload Configuration
Execute systemctl daemon-reload
to apply the changes.
5.3 Reproduce the Scenario
[SSH Session A] Prepare a new container, configure it, and run sudo -S systemctl start mysqld_11690.service
to start a mysqld process. The session will hang due to the sleep
command.
[SSH Session B] In another session, once the start
command hangs, check the mysqld.pid
file and immediately execute sudo -S kill -9 $(cat /opt/mysql/data/11690/mysqld.pid)
once the file is created.
Observe the systemctl status, which should match the expected behavior.
6 Resolution Method
First, terminate the hanging systemctl start command and execute systemctl stop mysqld_11690.service
to allow systemd to actively terminate the zombie process. Although the stop
command may report an error, it does not affect the process.
After the stop
command completes, restart the service using start
, and it should return to normal operation.