How to Troubleshoot 100% Server Load and CPU Usage: Expert Solutions for High Traffic and Resource Overload

高效码农

3 months ago

A Practical Guide to Troubleshooting 100% Server Load and CPU Usage

When a server shows 100% load and 100% CPU usage, it means the system has reached its maximum capacity. At this point, websites and applications may become extremely slow or completely unavailable. Many administrators think of restarting the server immediately, but that usually only offers temporary relief. This guide walks you through the causes, diagnosis, and actionable solutions in a structured way, ensuring you not only fix the issue but also prevent it from happening again.

1. Understanding Server Load and CPU Usage

Although often mentioned together, server load and CPU usage are not identical:

Load Average
Represents the number of processes waiting to be executed. If the load consistently exceeds the number of CPU cores, the system is overburdened.
CPU Usage
Shows how CPU time is being consumed — user space, system space, and I/O wait. When it reaches 100%, the processor has no idle time left.

Common causes include:

Slow or complex database queries overloading MySQL.
PHP-FPM worker processes maxed out due to heavy request handling.
Traffic spikes or malicious requests draining resources.
Disk I/O bottlenecks making the CPU wait on storage operations.
Background tasks or malicious processes consuming resources.

2. Emergency Measures (First 1–5 Minutes)

When the system is already maxed out, the first priority is to stabilize the situation quickly.

2.1 Check Real-Time Status

uptime
top -b -n1 | head -n20

uptime shows the current load average.
top highlights which processes consume the most CPU.

2.2 Identify Resource-Heavy Processes

ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n20

2.3 Restart Services Under Stress

If MySQL or PHP is the culprit:

systemctl restart php-fpm
systemctl restart mysql

2.4 Kill Problematic Processes

kill -15 <PID>   # Graceful stop
kill -9 <PID>    # Force stop

2.5 Block Malicious IPs

iptables -A INPUT -s 1.2.3.4 -j DROP

3. Quick Diagnosis (5–30 Minutes)

Once the bleeding is stopped, the next step is finding the root cause.

3.1 CPU and Process Analysis

top    # Sort by CPU with P
htop   # If installed, more interactive

3.2 MySQL Process Check

mysql -u root -p -e "SHOW FULL PROCESSLIST\G"

Look for long-running queries.
Pay attention to State values like Sending data or Copying to tmp table.

3.3 Enable Slow Query Logging

mysql -u root -p -e "SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 1;"

The log file will reveal queries that consistently consume high CPU.

3.4 PHP-FPM Process Status

ps -eo pid,cmd,%cpu,%mem --sort=-%cpu | grep php-fpm | head

Too many child processes indicate backlog.
Enable the PHP-FPM slowlog to capture problematic scripts.

3.5 Disk and I/O Monitoring

iostat -x 1 3
vmstat 1 5

High %iowait signals storage bottlenecks.

3.6 Network Connections

ss -tunp | head
netstat -anp | grep ESTABLISHED | wc -l

Check for floods of suspicious connections.

4. Targeted Solutions

Case A: MySQL Using Excessive CPU

Identify long queries and terminate them:

KILL QUERY <id>;

Analyze with EXPLAIN and adjust indexes.
Watch for JOIN or ORDER BY without indexes.
Increase innodb_buffer_pool_size if memory allows.

Case B: PHP-FPM Worker Overload

Adjust worker limits in www.conf:

pm.max_children = 30

Enable slowlog for diagnosis:

request_slowlog_timeout = 5s
slowlog = /var/log/php-fpm/www-slow.log

Gracefully end runaway workers:

kill -15 <PID>

Case C: Traffic Spikes or Crawlers

Limit request rates in Nginx:

limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
limit_req zone=one burst=20 nodelay;

Add a CDN or WAF for protection.

Case D: Disk I/O Overload

Check for large backups or log writes.
Move scheduled jobs to off-peak hours.
Upgrade to SSDs or higher IOPS storage.

Case E: Malicious or Rogue Processes

Use ps aux --sort=-%cpu to spot unknown processes.
If cryptominers or backdoors are detected, isolate the server, back up data, and redeploy securely.

5. Limiting Resource Usage Per Process

When optimization is not immediately possible, limit resource consumption to keep the system alive.

5.1 Adjust Process Priority

renice +10 -p <PID>

5.2 Limit CPU Usage

cpulimit -p <PID> -l 40

5.3 Apply CPU Quota via systemd

systemctl set-property --runtime php-fpm CPUQuota=70%

6. Long-Term Optimization

6.1 Database Improvements

Analyze slow query logs regularly.
Use tools like pt-query-digest.
Introduce caching layers (e.g., Redis).
Consider read/write splitting.

6.2 Application Layer Enhancements

Implement Nginx FastCGI caching.
Reduce unnecessary dynamic rendering.
Avoid blocking API calls.

6.3 Infrastructure Scaling

Add load balancers.
Deploy multiple web servers.
Use replication or clustering for databases.

6.4 Monitoring and Alerts

Set up Prometheus, Grafana, or Zabbix.
Configure alerts for CPU, memory, I/O, and MySQL connections.
Detect anomalies early.

6.5 Security Measures

Regularly scan for malicious cron jobs or rootkits.
Deploy fail2ban for brute-force prevention.
Enable cloud provider DDoS protection.

7. Actionable Checklist

Confirm usage: Use top / ps to find heavy processes.
Inspect MySQL: Run SHOW FULL PROCESSLIST for slow queries.
Check PHP-FPM: Reload service and monitor slowlog.
Review disk and network: Use iostat and ss tools.
Emergency actions: Enable maintenance mode, firewall rules, or kill rogue processes.
Optimize long term: Index tuning, caching, scaling, monitoring.

8. Conclusion

When servers hit 100% load and CPU usage, the solution is not a blind reboot. The proper workflow is:

Stop the bleeding — restart or limit critical services.
Diagnose the root cause — through process lists, slow queries, and logs.
Apply targeted fixes — optimize SQL, tune PHP-FPM, throttle traffic.
Plan for the long term — caching, scaling, monitoring, and security.

Server stability comes from combining immediate actions, careful analysis, and proactive optimization.