MySQL Performance Benchmarking: From Manual Tests to Production-Ready Analysis

高效码农

2 months ago

MySQL Performance Benchmarking: From Manual Tests to Production-Ready, Multi-Environment Analysis

What core problem does this article solve? It provides a complete, repeatable workflow for benchmarking MySQL performance using sysbench and tsar, transforming raw numbers into actionable insights for infrastructure decisions.

Performance testing is often treated as an afterthought—run a few commands, glance at the QPS, and call it a day. But when you’re choosing between cloud providers, validating new hardware, or tuning critical database parameters, gut feelings aren’t enough. You need precise, reproducible data aligned with system metrics. This guide walks through an integrated benchmarking suite that automates testing, captures second-level system performance, and generates comparable reports across multiple environments, giving you the confidence to make data-driven architecture choices.

What Makes This Tooling Different from Running Sysbench Manually?

Why not just run sysbench directly? Because manual testing creates fragmented data, misaligned monitoring, and results that can’t be compared across time or environments.

Point-select queries, read-only transactions, read-write mixes, and write-only workloads each stress different subsystems. Running them manually means preparing data inconsistently, remembering to start monitoring at the right time, and collating results by hand. This toolkit eliminates those gaps by enforcing a standardized sequence: controlled data preparation, execution during precisely monitored windows, and automatic report generation. The result is a single source of truth you can archive, revisit, and compare.

Application Scenario: Validating a New Storage Engine Before Migration

Imagine your team plans to migrate from MySQL 5.7 to 8.0 to leverage the new data dictionary and improved locking. Leadership wants proof that performance won’t degrade. Using this suite, you configure identical test parameters on both versions—same table count (16), row count (10 million), thread sequence (1 to 128), and test duration (30 seconds). After running ./mysql_benchmark.sh on each, you use merge_reports.py to produce a side-by-side report. The markdown output reveals that 8.0’s oltp_read_write TPS is 12% higher at 64 threads, but 95th percentile latency increased by 0.3 ms. This data lets you decide whether the throughput gain justifies the latency cost, and you have the CPU/IO utilization curves to back up your recommendation.

Author’s Reflection: The Cost of Ignoring Monitoring Alignment

Early in my benchmarking work, I ran sysbench while manually capturing iostat output in a separate terminal. The results looked impressive until I realized my iostat snapshots didn’t match the actual test intervals—there was a 15-second gap where the system was idle before I started watching. The QPS numbers were correct, but the CPU and IO metrics were meaningless. That mistake taught me that temporal precision is non-negotiable for credible analysis. This tool’s forced coupling of test execution and tsar logging ensures you never waste time on uncorrelated data.

How Do You Build a Trustworthy Testing Environment?

What are the non-negotiable requirements for a valid benchmark environment? Both the load generator and the database server must be correctly provisioned, and any deviation invalidates your results.

The client machine must be on the same low-latency network as the MySQL server. Testing across a WAN or through a VPN introduces network jitter that masks database performance. The server must have tsar running with the correct disk device name—this is where people fail most often.

Setting Up the Load-Generating Client

The client needs sysbench 1.0+ and Python 3.6+. On CentOS/RHEL/AlmaLinux, install with yum. On Debian/Ubuntu, use apt-get. Verify versions before proceeding.

# CentOS/RHEL/AlmaLinux
sudo yum install -y sysbench python3

# Debian/Ubuntu
sudo apt-get install -y sysbench python3

# Verify
sysbench --version  # Should be 1.0.x or higher
python3 --version   # Should be 3.6.x or higher

Application Scenario: Your CI pipeline needs to run nightly benchmarks against a staging database. You create a dedicated EC2 instance in the same VPC and Availability Zone as the RDS instance. By keeping the client and server within a 1 ms RTT, you ensure that network latency doesn’t become the bottleneck, allowing you to detect true database regressions from code deployments.

Preparing the MySQL Server and Initiating Monitoring

Install tsar on the MySQL host. If it’s not in your package manager, compile from source. Then, identify your data disk device name—this is critical.

# On the MySQL server, find the data volume
lsblk | grep -E '(nvme|sda|vda)'

# Example output:
# nvme0n1     259:0   0   1T  0 disk /data/mysql

# Launch tsar with the exact device name (no full path)
nohup tsar --cpu --io -I nvme0n1 -l -i1 >/tmp/tsar.log 2>&1 &

# Verify it's collecting data
tail -f /tmp/tsar.log  # Press Ctrl+C to exit

Set the prepared statement limit high enough for sysbench’s concurrency model:

-- Connect to MySQL and execute
SET GLOBAL max_prepared_stmt_count = 1048576;

Author’s Reflection: The Disk Device Name Trap

I once spent two hours debugging empty tsar logs, only to discover I used nvme0n1p1 (a partition) instead of nvme0n1 (the block device). Tsar silently failed to write data. Now, my first step is always lsblk without filters to see the raw device topology, then mount points with df -hT. Matching the device name exactly is a tiny detail that determines whether your entire test has systemic metrics or none at all.

What Do Configuration Parameters Actually Control?

How does each line in benchmark_config.conf translate into test behavior and data volume? Every parameter shapes workload realism and comparability.

The configuration file is the blueprint. Changing table count or row size alters the working set, which directly impacts whether your test hits disk or lives in the buffer pool. Thread count progression determines if you find the system’s true saturation point or skip past it.

Database Connection and Test Data Scale

# Connection settings
MYSQL_HOST=192.168.1.50      # Use internal IP, not public
MYSQL_PORT=3306
MYSQL_USER=sbtest
MYSQL_PASSWORD=your_password_here
MYSQL_DB=sbtest

# Data scale
TABLES=16                    # Number of test tables
TABLE_SIZE=10000000          # Rows per table (10 million)

With these defaults, you generate 160 million rows. At roughly 250 bytes per row with indexes, that’s about 40 GB of data. For a server with 30 GB of memory and a 20 GB buffer pool, this ensures not all data fits, forcing realistic I/O and eviction behavior.

Application Scenario: You’re evaluating two AWS RDS instance types: db.r5.2xlarge (64 GB RAM) and db.r5.4xlarge (128 GB RAM). For the first test, you keep TABLE_SIZE=10000000 and TABLES=16. The 40 GB dataset fits comfortably in both buffer pools, so QPS is similar. You then increase to TABLES=64 and TABLE_SIZE=25000000 (160 GB dataset). Now the 2xlarge shows 40% lower QPS due to increased I/O wait, while the 4xlarge maintains performance, justifying its higher cost for your workload size.

Test Duration and Data Preparation Strategy

TEST_TIME=30                 # Seconds per thread count
NEED_PREPARE=true            # Boolean: generate fresh data?

A 30-second duration per thread count is the sweet spot: long enough to stabilize metrics, short enough for rapid iteration. For quick smoke tests, drop this to 10 seconds. The NEED_PREPARE flag saves enormous time—data generation for 16 tables × 10M rows can take 30 minutes. Once prepared, reuse the dataset by setting this to false for parameter tuning tests.

Author’s Reflection: The Reproducibility Rule

I learned to version-control my config files. In one project, I tuned innodb_flush_log_at_trx_commit but forgot I had also adjusted TABLE_SIZE during a “quick test.” The performance delta I attributed to the parameter was actually due to the smaller dataset fitting entirely in memory. Changing only one variable at a time is scientific method 101, but it’s easy to forget when you’re moving fast. Git-committing configs for each test run saved me from repeating that mistake.

Selecting Scenarios and Thread Concurrency

SCENARIOS="oltp_point_select oltp_read_only oltp_read_write oltp_write_only"
THREADS="1 8 16 32 64 128"

Choose scenarios that mirror your production load. A read-heavy SaaS app might only need point_select and read_only. A write-intensive analytics pipeline should focus on write_only. The thread sequence must be gradual. Jumping from 1 to 128 threads hides the saturation curve—you might miss that performance plateaus at 48 threads and degrades beyond 64.

What Happens Step-by-Step During a Test Run?

How does the mysql_benchmark.sh script orchestrate sysbench and tsar to produce a unified result? The script handles sequencing, timing, and data collection that would be error-prone to manage manually.

When executed, the script performs a linear, repeatable pipeline: validate connections, prepare data if needed, loop through scenarios, loop through threads within each scenario, run the test, capture timestamps, and extract the matching tsar slice.

The Execution Pipeline

# Launch the full test sequence
./mysql_benchmark.sh benchmark_config.conf

# Equivalent to running:
# sysbench oltp_point_select ... --threads=1 --time=30 run
# sysbench oltp_point_select ... --threads=8 --time=30 run
# ... up to 128 threads
# then repeats for oltp_read_only, oltp_read_write, oltp_write_only

For each combination, sysbench outputs intermediate results every second. The script captures the final summary line (total queries, latency percentiles) and records the exact start and end timestamps.

Application Scenario: You need to generate a performance baseline every night for regression detection. You cron the script at 2 AM. Behind the scenes, it prepares data once, then runs 24 sequential tests (4 scenarios × 6 thread counts). Each test’s output is saved to a separate log file named oltp_point_select_64.log. After completion, generate_report.py reads all logs, parses the summary lines, and correlates timestamps with tsar data to build the final HTML and Markdown reports ready for review at 9 AM.

Handling Custom and Incremental Tests

# Run a 10-second smoke test
./mysql_benchmark.sh benchmark_config.conf 10

# Skip data preparation, reuse existing tables
./mysql_benchmark.sh benchmark_config.conf 30 false

These options enable rapid iteration. When tuning innodb_buffer_pool_instances, you don’t want to wait 30 minutes for data prep each time. After the initial NEED_PREPARE=true run, subsequent iterations use false to test different parameter values against a stable dataset.

Author’s Reflection: The Value of Timestamp Precision

Early scripts I wrote relied on approximate timing, which caused mismatched metrics. Once, a 30-second test ran from 14:00:05 to 14:00:35, but my monitoring script captured 14:00:00 to 14:00:30. The CPU spike from the test was cut off. This toolkit’s use of second-level timestamps and explicit extraction of the exact window eliminated that class of errors. Precision isn’t pedantic; it’s what separates signal from noise.

How Do You Extract Actionable Insights from Generated Reports?

What does performance_report.html actually tell you beyond raw QPS numbers? The report correlates throughput with resource utilization, revealing whether you’re CPU-bound, I/O-bound, or contention-limited.

The report presents a matrix: scenarios as rows, thread counts as columns, with QPS/TPS, latency percentiles, average CPU%, and average IO% in each cell. This layout immediately shows if performance scales linearly with threads and which resource saturates first.

Reading the Performance Matrix

Scenario	Threads	QPS	95% Latency (ms)	CPU%	IO%
oltp_point_select	1	17,109	0.06	1.2	0.1
oltp_point_select	128	321,410	0.41	12.9	0.1
oltp_write_only	1	12,740	0.08	0.8	2.1
oltp_write_only	128	139,564	1.83	6.4	52.3

Key Patterns to Identify:

Linear Scaling: QPS doubles when threads double, CPU% rises, IO% stays flat → healthy scaling.
Saturation: QPS plateaus while CPU% hits 95% → CPU bottleneck. Consider faster CPUs or query optimization.
IO Starvation: QPS ceiling with IO% at 90%+ and low CPU% → disk bottleneck. Evaluate NVMe upgrades or innodb_io_capacity tuning.
Latency Explosion: 95% latency jumps from 1ms to 20ms as threads increase → lock contention or thread scheduling overhead. Investigate innodb_spin_wait_delay or reduce active threads.

Application Scenario: Your team observes that production MySQL hits latency spikes during peak hours. You run oltp_read_write and see that at 64 threads, QPS is acceptable but 95% latency is 8 ms. At 128 threads, QPS barely increases, but latency balloons to 45 ms and CPU% is only 70%. This suggests lock contention, not CPU exhaustion. You adjust innodb_lock_wait_timeout and innodb_rollback_on_timeout, rerun the test, and see latency drop to 12 ms at 128 threads—confirming the fix before production deployment.

Understanding the Monitoring Sample Count

The report includes “Sample Count,” which should equal TEST_TIME. If you see 30 seconds but only 28 samples, tsar missed 2 seconds—likely due to system load or disk hiccups. Consistently low sample counts indicate unstable monitoring and invalidate the CPU/IO averages.

Author’s Reflection: The Story Behind the Numbers

I once presented a report showing 300,000 QPS at 128 threads. The VP asked, “But what’s the catch?” The catch was CPU% at 98%. The number looked great, but there was no headroom for traffic bursts. The report’s value isn’t the peak QPS; it’s the resource cost to achieve it. Now I always highlight utilization alongside throughput to frame realistic capacity planning.

How Do You Compare Performance Across Different Environments?

How can you objectively compare your data center hardware against cloud instances? The merge_reports.py script aggregates individual reports into a single comparative document, normalizing configurations and highlighting deltas.

Running tests in isolation is useful; comparing them is transformative. Whether you’re evaluating cloud providers, CPU generations, or storage backends, you need an apples-to-apples summary. The merge script reads each environment’s performance_report.md, extracts key metrics, and constructs unified comparison tables.

The Multi-Environment Workflow

Assume you’ve benchmarked three setups:

idc/: Self-hosted, Intel Xeon 6248R, 64GB, SATA SSD
hw_cloud/: Huawei Cloud, Kunpeng 920, 64GB, NVMe SSD
ali_cloud/: Alibaba Cloud, Intel 8269CY, 64GB, ESSD

Directory structure:

sysbench_report/
├── merge_reports.py
├── idc/
│   └── performance_report.md
├── hw_cloud/
│   └── performance_report.md
└── ali_cloud/
    └── performance_report.md

Execute the merge:

python3 merge_reports.py idc,hw_cloud,ali_cloud

The resulting mysql_sysbench.md includes:

Environment Config Table: Lists CPU model, memory, innodb_buffer_pool_size (auto-converted to GB)
Performance Summary: Side-by-side QPS/TPS for all scenarios and thread counts
Delta Analysis: Percentage difference calculations
Latency Comparison: 95% latency across environments
Executive Summary: Automatic highlights like “hw_cloud shows 35% higher write throughput at 52% lower CPU”

Application Scenario: Your CFO wants to move from CapEx (self-hosted) to OpEx (cloud) but demands proof of cost-efficiency. You run identical benchmarks on your current hardware and two cloud providers. The merged report shows that while Ali Cloud costs 30% more annually, it delivers 25% higher throughput with 20% lower latency, allowing you to consolidate three database nodes into two. The TCO analysis, backed by hard benchmark data, secures approval for the migration.

Advanced Analysis with Detailed Merging

For deeper analysis, use merge_reports_v2.py:

python3 merge_reports_v2.py idc,hw_cloud,ali_cloud

This generates mysql_sysbench_v2.md with:

Performance Scaling Curves: QPS vs. Threads charts per scenario
Resource Efficiency Ratios: QPS-per-CPU% and TPS-per-IO% metrics
Smart Recommendations: Suggests optimal thread counts and instance types based on workload patterns

Author’s Reflection: The Political Power of Neutral Data

In a budget meeting, I once faced pushback against a cloud migration: “Our servers are fine.” The merged report from this suite showed that our oltp_write_only performance was 40% lower than a comparable cloud instance, and our CPUs were saturated at 95% during peak, causing queue pile-ups. The data wasn’t opinion—it was neutral, reproducible, and undeniable. Benchmarking tools don’t just measure performance; they provide objective ammunition for critical business decisions.

What Are the Most Common Pitfalls and How Do You Tune Performance?

Why do benchmarks sometimes produce misleading or unusable results? The root causes are usually misconfiguration, resource exhaustion, or parameter mismatch. The troubleshooting guide addresses the top failure modes.

Even with automation, environmental issues arise. The key is to recognize patterns quickly and have a checklist to isolate root causes without rerunning hours of tests.

Diagnosing and Resolving Top Issues

Problem: Empty tsar logs

# On the MySQL host, verify process and device
ps aux | grep tsar
ssh root@MYSQL_HOST "cat /tmp/tsar.log"

# If empty, restart with correct device name
ssh root@MYSQL_HOST "pkill tsar; nohup tsar --cpu --io -I nvme0n1 -l -i1 >/tmp/tsar.log 2>&1 &"

Root Cause: 99% of the time, the device name after -I is wrong. It must be the block device, not a partition.

Problem: MySQL connection failures

# Test connectivity from the client
mysql -h MYSQL_HOST -P 3306 -u sbtest -p

# Check if MySQL is bound to localhost
SHOW VARIABLES LIKE 'bind_address';

Root Cause: Firewall, security group, or MySQL’s bind_address restricting access.

Problem: max_prepared_stmt_count exceeded
Fix: Permanent configuration in my.cnf:

[mysqld]
max_prepared_stmt_count = 1048576

Then restart or set globally.

Problem: Prepare stage fails with disk full
Prevention: Check df -h before starting. For 16 tables × 10M rows, ensure 50 GB free.

Performance Tuning Based on Benchmark Feedback

MySQL Parameter Tuning:

-- If point_select QPS is low and IO% is high
SET GLOBAL innodb_buffer_pool_size = '48G';  -- 70% of RAM

-- If write_only latency is high and IO% is moderate
SET GLOBAL innodb_flush_log_at_trx_commit = 2;  -- Trade durability for speed

-- If you see "Too many connections" errors
SET GLOBAL max_connections = 1000;

System-Level Checks:

# Check for CPU throttling (e.g., in VMs)
grep -i throttled /var/log/messages

# Monitor for swap usage (performance killer)
vmstat 1 | tail -20

# Validate disk performance independently
fio --name=test --ioengine=libaio --rw=randwrite --bs=16k --size=1G --numjobs=8 --runtime=60

Application Scenario: Your oltp_read_write test shows high latency but CPU% and IO% are both under 60%. This suggests lock contention. You query performance_schema:

SELECT * FROM performance_schema.events_waits_summary_global_by_event_name 
WHERE event_name LIKE '%innodb%lock%';

Finding high waits on innodb_row_lock_waits, you increase innodb_lock_wait_timeout from 50 to 120 seconds, rerun the test, and see latency drop 30%—confirming the fix before applying it to production.

Author’s Reflection: The Incremental Tuning Principle

I used to change five MySQL parameters at once after a benchmark, hoping for a magic combination. When performance improved, I didn’t know which change helped. When it worsened, I had no way to roll back systematically. Now, I change one parameter per test run, document it in the config file’s comments, and treat each benchmark as a controlled experiment. This slower approach yields faster learning because cause and effect are clear.

Action Checklist / Implementation Steps

Ready to run your first benchmark? Follow this sequence exactly:

[ ] Client Preparation: Provision a Linux host in the same data center/VPC as MySQL. Install sysbench and python3. Verify versions.
[ ] Server Preparation: Install tsar on the MySQL host. Identify the exact block device name for the data disk using lsblk (e.g., nvme0n1, sda).
[ ] MySQL Configuration: Create a dedicated test user with full privileges on a test database. Set max_prepared_stmt_count=1048576 globally.
[ ] SSH Access: Configure passwordless SSH from the client to the MySQL server’s root user. Test with ssh root@MYSQL_HOST "hostname".
[ ] Disk Space: On the MySQL host, run df -h on the data partition. Ensure at least 50 GB free for the default dataset.
[ ] Configuration File: Copy benchmark_config.conf to prod_test.conf. Fill in MYSQL_HOST, PORT, USER, PASSWORD, DB, and confirm TABLES=16, TABLE_SIZE=10000000.
[ ] Launch Monitoring: SSH into the MySQL host and start tsar: nohup tsar --cpu --io -I YOUR_DEVICE_NAME -l -i1 >/tmp/tsar.log 2>&1 &. Verify logging with tail.
[ ] Execute Test: On the client, run ./mysql_benchmark.sh prod_test.conf. Do not interrupt; let the full sequence complete (approx. 2 hours for all scenarios and threads).
[ ] Validate Results: Check the generated directory mysql_benchmark_YYYYMMDD_HHMMSS/. Open performance_report.html and confirm sample counts match TEST_TIME.
[ ] Archive and Compare: For multi-environment analysis, rename the output directory to something descriptive (e.g., mysql_benchmark_hw_cloud_nvme). Run tests on other environments, then use merge_reports.py to create a unified comparison.

One-Page Overview

Tool Purpose: Automated MySQL benchmarking using sysbench with synchronized system monitoring via tsar, producing reproducible HTML and Markdown reports suitable for single-run analysis and multi-environment comparison.

Core Workflow:

Install: sysbench, python3 on client; tsar on MySQL server.
Configure: Edit benchmark_config.conf with connection details and test scale (16 tables × 10M rows = 40 GB dataset).
Monitor: Start tsar on the server with the correct data disk device name.
Execute: Run ./mysql_benchmark.sh your_config.conf. Script prepares data, runs four scenarios across six thread counts, collects logs.
Report: Generates performance_report.html and .md with QPS/TPS, latency, CPU%, IO%.
Compare: Use python3 merge_reports.py env1,env2,env3 to create side-by-side performance summaries.

Key Configuration Parameters:

SCENARIOS: oltp_point_select, oltp_read_only, oltp_read_write, oltp_write_only
THREADS: 1 8 16 32 64 128 (linear progression critical for finding saturation)
TEST_TIME: 30 seconds recommended for stability
max_prepared_stmt_count: Must be 1048576 to prevent sysbench errors

Output: Timestamped directory containing reports and raw logs. Merge tool produces mysql_sysbench.md for multi-env analysis.

Frequently Asked Questions (FAQ)

Q1: Can I run the benchmark against a remote cloud database where I don’t have SSH root access?

A: No. The tool requires SSH root access to the MySQL server to retrieve tsar logs. For cloud-managed databases (e.g., RDS), you cannot install tsar or access the underlying host. Use self-hosted instances or VMs where you control the OS.

Q2: How long does a full benchmark take with default settings?

A: Approximately 2 hours: 30 minutes for data preparation (sysbench prepare) plus 1.5 hours for running all tests (4 scenarios × 6 thread counts × 30 seconds each, plus overhead for log processing and report generation).

Q3: What is the minimum hardware requirement for the MySQL server?

A: The default dataset is 40 GB. The server should have at least 8 GB of RAM to run MySQL and tsar without swapping, but 32 GB is recommended to observe realistic buffer pool behavior. Disk space must have 50 GB free.

Q4: Why is the merge script sensitive to directory naming?

A: merge_reports.py expects each environment’s directory to contain a file named exactly performance_report.md. If you rename the report files, the script will fail. Keep the original filenames and place them in descriptive directories (e.g., aliyun_rds/).

Q5: Can I test MySQL 5.6 or MariaDB?

A: The tool targets MySQL 5.7+ and 8.0+. While sysbench itself may work with older versions, the parameter max_prepared_stmt_count and certain performance_schema assumptions align with 5.7+. Use at your own risk with older versions.

Q6: How do I interpret a high QPS but also high 95% latency?

A: This usually indicates a “long tail” problem—most queries are fast, but a fraction is very slow (locks, stalls). For user-facing applications, the 95% latency matters more than average QPS. Tune for latency consistency even if it means slightly lower peak QPS.

Q7: Is it safe to run this on a production MySQL instance?

A: No. The NEED_PREPARE=true stage drops and recreates the test database, which is destructive. Even with false, the high concurrency and load can starve resources from production traffic. Always run on dedicated staging or benchmarking instances that mirror production specs.

Q8: The generated HTML report is blank or malformed. What went wrong?

A: This typically means Python’s jinja2 or pandas modules are missing. Install them via pip. Also, check that the log directory path passed to generate_report.py is correct and contains .log files from a completed test run.