I have been working in MySQL database operations and maintenance for over 10 years, during which routine inspections have become an essential and serious responsibility. As my experience has grown, I’ve developed a set of highly effective best practices for conducting these inspections.
Why are best practices important? Not everyone is deeply familiar with MySQL, and having documented best practices helps avoid errors and ensures that the quality of inspections remains consistent, even in the face of staff turnover.
In my view, inspections can be categorized based on different areas of focus.
1 Classified by Inspection Method
Inspections can be categorized into manual inspection, scripted inspection, and platform-based inspection based on the method used.
- Manual inspection, or handcrafted inspection, is feasible when the number of managed database management system(dbms) is small, and the frequency of inspections is low. This method involves logging into the server and executing commands such as
df -h
ormysql -u
to check the status of the server and database, and using monitoring tools like Zabbix or Prometheus for analysis. However, this method is inefficient and heavily relies on the skill level of the personnel conducting the inspection. Different engineers may reach different conclusions during the inspection process. - Scripted inspection: At this stage, the inspection commands are consolidated into a script. Engineers log into the servers one by one to run the script. If allowed, batch operation and maintenance tools like Ansible can be used to execute the script on multiple servers at once. The script generates an HTML report or data in CSV format for analysis and summarization. Based on our experience, one engineer can easily inspect 2,000 instances and complete the report within a day using this method. This approach represents a significant leap in efficiency compared to manual inspection. However, it still requires human intervention and partially depends on the skill level of the personnel to analyze and summarize the reports.
- Platform-based inspection: Database Management Platform (DMP) product not only meets the core requirements of tenants—such as high availability management, backup and recovery management, and monitoring and alarm management—but also provides platform-based inspection functionality. The platform performs inspections at regular intervals, generating reports and using a health score model, which incorporates our operational experience to score the database instances. Instances with scores below 60 are flagged for immediate attention, automatically notifying the database administrator and providing intelligent analysis of existing issues. This system greatly reduces the need for database operation and maintenance personnel to log into the production environment and eliminates reliance on individual skill levels, achieving full standardization of inspections.
2 Classified by Time Dimension
Database inspections can be categorized into daily inspections and pre-festival inspections based on the time dimension.
- Daily Inspection: This is a routine inspection performed every day. The simplest approach is to check the monitoring systems, primarily focusing on warnings or critical alarms. Monitoring tools can typically cover more than 95% of the daily inspection requirements.
- Pre-Festival Inspection: These inspections are conducted on special occasions and are more thorough than daily inspections. In a narrow sense, they refer to checks done before high-traffic events such as Black Friday, Cyber Monday, or Valentine's Day. More broadly, they include inspections before major events requiring heightened reliability. For our database operation and maintenance team, pre-festival inspections are critical because these events often bring traffic surges, and staffing may be limited during holidays. The goal is to detect and resolve potential issues in advance. The main focus of these inspections is on database availability.
The core areas of focus for pre-festival inspection include:
System Level
- CPU
- RAM
- Disk Space
Application Level (MySQL Instance)
- Instance Status
- High Availability Status
- Replication Status
- Monitoring Status
- VIP Status
Detailed Explanation:
- CPU: Monitor instances with high CPU usage and check if the usage patterns align with normal daily trends. If any deviations or anomalies are observed, further investigation or optimization may be required.
- RAM: Identify instances where RAM usage exceeds 80%, which could indicate insufficient memory. If necessary, expand the memory to prevent an out-of-memory (OOM) situation. Additionally, check if any instances are using swap memory. If swap is being used despite having sufficient RAM, this could point to incorrect settings like NUMA configuration or improper
vm.swappiness
values. - Disk Space: Disk usage is critical in pre-festival inspections. Although the usual disk space alarm threshold is 80%, it's advisable to focus on instances with disk usage above 70%. Expanding disk capacity in advance helps avoid critical disk alarms during long holidays, sparing everyone the trouble of overtime during this period.
- Instance Status: Typically, this involves checking whether
mysqld
is running. If possible, perform a deeper health analysis of the instance. (Details on how to assess instance health are beyond the scope of this discussion.) - High Availability Status: Ensure that the database is in a "switchable" state. For instance, in an MHA (Master High Availability) architecture, the following scripts can be used to verify high availability:
masterha_check_ssh --conf=/etc/masterha/app.cnfmasterha_check_repl --conf=/etc/masterha/app.cnfmasterha_check_status --conf=/etc/masterha/app.cnf
- Replication Status: Although replication status is often checked as part of high availability, some replication setups fall outside of this scope. It is essential to ensure that asynchronous, semi-synchronous, delayed, bidirectional, and cascaded replications are functioning correctly. Additionally, verify whether disaster recovery DTS (Data Transmission Service) replication and high availability are operating as expected.
- VIP Status: For instances with dual-network redundant links and dual VIPs, loss of the VIP on a redundant link may not impact operations. However, it's still important to monitor this via an inspection mechanism. A liveness detection task can be developed to check and raise alarms at regular intervals.
This approach ensures a thorough and structured pre-festival inspection, significantly reducing the risk of issues during critical periods.
3 Classified by inspection level
Database inspections can be divided into ordinary inspections and in-depth inspections based on the inspection level. While pre-festival inspections (as mentioned earlier) are more comprehensive than daily inspections, they still fall under ordinary inspections and are not considered in-depth inspections.
So, what qualifies as an in-depth inspection? While ordinary inspections focus on operations and maintenance, primarily ensuring database availability, in-depth inspections go beyond this scope to focus on user experience and performance. In-depth inspections aim to supplement daily and pre-festival inspections, ensuring the database is not only available but also reliable and performant for the future.
In my opinion, an in-depth inspection comprises four key components:
- Availability Inspection
- Reliability Inspection
- Performance Inspection
- Analysis and Suggestions
1. Availability Inspection
In pre-festival inspections, availability is primarily evaluated from the perspective of operations and maintenance, focusing on service and instance levels. In-depth inspections, however, expand this view to consider application-level and business-level availability.
For example, during an in-depth inspection, we would examine the usage of auto-increment keys in each tenant's table. Many developers define int
without specifying whether it is signed or unsigned, which by default is unsigned. This reduces the available range of auto-increment keys by half. Additionally, auto-increment keys may be wasted if insertion fails and transactions are rolled back. As a result, even though the theoretical range may seem high (over 2.1 billion), the keys can be exhausted with as few as 1 billion rows. This scenario can lead to business-level unavailability if the table can no longer accept writes.
We once encountered a situation where a business analyzing transaction bill data exhausted its auto-increment keys within nine months due to heavy data insertion (5 million records daily, peaking at 9 million). The solution involved changing the key type from int unsigned
to bigint signed
. However, modifying the primary key column in MySQL requires locking the table, which affected the business and took 6 hours to complete.
Thus, in-depth inspections must extend availability checks to cover potential issues like these. For more availability inspections, readers can explore additional areas based on their experience.
2. Reliability Inspection
In addition to availability, in-depth inspections also address reliability. The pre-festival inspection already includes checks such as "switchability" and replication status, but in-depth inspections introduce the concept of core parameter checks, which cover:
- Verifying that database parameters comply with core handover and maintenance specifications.
- Ensuring parameter consistency between primary and standby databases.
- Confirming that running parameters align with the configuration file (
my.cnf
).
Core Parameter Inspection:This is crucial because most databases are handed over from various business departments rather than being deployed internally. It is necessary to ensure parameters like binlog_format = row
, gtid_mode = on
, sync_binlog = 1
, and others (about 80 in total) are correctly configured for data integrity and consistency.
For example, mismatched settings between primary and standby databases or inconsistent parameters between runtime and configuration files can lead to failures. In one instance, the failure to persist a critical parameter (innodb_buffer_pool_size) to the configuration file led to poor performance after a crash, despite a high-availability component restoring the instance.
In MySQL 5.7 or earlier, changing parameters involved two steps: modifying the global parameter and manually updating the configuration file. Human errors in this process can cause discrepancies, leading to reliability issues. In-depth inspections should focus on identifying and correcting such inconsistencies.
3. Performance Inspection
Performance inspections involve a range of checks, including:
- Primary Key Absence: MySQL works best with a primary key, preferably an
int signed
auto-increment key unrelated to the business. Tables without primary keys can negatively affect performance. - SQL Performance Optimization: Identify the top 10 slow queries and provide optimization recommendations to tenants. Similarly, capture SQL statements with the most frequent full table scans and assess their execution plans. Poor execution plans might not be immediately problematic but can become bottlenecks as data volumes grow.
- Additionally, review indexes:
- Redundant indexes (e.g., an index
idx_a
whenidx_a_b_c
already covers it) should be removed. - Invalid indexes (indexes that are never used) should be identified and removed after evaluation.
- Index discrimination: Evaluate whether columns have sufficient diversity in values for indexing. Columns with low discrimination (e.g., a gender column with only two values) may not benefit from indexing.
- Redundant indexes (e.g., an index
- MyISAM Storage Engine: MyISAM offers few advantages, and in 99% of cases, it should be disabled. Parameters can be set to prevent the creation of MyISAM tables:
disabled_storage_engines=ARCHIVE,BLACKHOLE,EXAMPLE,FEDERATED,MEMORY,MERGE,NDB,MyISAM
- If MyISAM tables are detected, they should be converted to InnoDB, the recommended storage engine for MySQL.
- Top 10 Large Tables: Tables larger than 100GB are highly resource-intensive during full table scans and DDL operations. Tables should be evaluated for potential splitting, archiving, or horizontal/vertical partitioning.
- A previous guideline recommended limiting a single MySQL instance to 500GB of data, with individual tables under 30GB and 10 million rows. With improved hardware, we now consider instances up to 2TB and tables up to 100GB, depending on the performance needs of the business.
4. Analysis and Suggestions
The goal of performance inspections is to provide data that tenants can use for self-analysis and optimization. Our reports include SQL-related data (slow queries, full table scans) and non-SQL-related information (index usage, table sizes), along with recommendations.
While we are not experts in SQL optimization, in the age of AI, tools like ChatDBA may provide faster and more precise insights into SQL performance issues.
ChatDBA can help you complete MySQL inspection
After having an in-depth understanding of MySQL database inspection, I believe everyone has realized the importance of efficient inspection for the stable operation of the database. And here, we would like to introduce to you a powerful tool that can help database management reach a higher level - ChatDBA.
ChatDBA is a database copilot Powered by LLM. It significantly improves the efficiency of database professionals by providing features such as database fault diagnosis, database professional knowledge learning, SQL generation, and SQL optimization through interactive dialogue. Whether for small and medium enterprises or large organizations, ChatDBA offers powerful database support to ensure smooth and efficient business operations.
If you're struggling with database management and troubleshooting, give ChatDBA a try. It offers a whole new database management experience, making your work more efficient and less stressful. Try it now and see the difference!