A Practical Guide to Troubleshooting 100% Server Load and CPU Usage
When a server shows 100% load and 100% CPU usage, it means the system has reached its maximum capacity. At this point, websites and applications may become extremely slow or completely unavailable. Many administrators think of restarting the server immediately, but that usually only offers temporary relief. This guide walks you through the causes, diagnosis, and actionable solutions in a structured way, ensuring you not only fix the issue but also prevent it from happening again.
1. Understanding Server Load and CPU Usage
Although often mentioned together, server load and CPU usage are not identical:
-
Load Average
Represents the number of processes waiting to be executed. If the load consistently exceeds the number of CPU cores, the system is overburdened. -
CPU Usage
Shows how CPU time is being consumed — user space, system space, and I/O wait. When it reaches 100%, the processor has no idle time left.
Common causes include:
-
Slow or complex database queries overloading MySQL. -
PHP-FPM worker processes maxed out due to heavy request handling. -
Traffic spikes or malicious requests draining resources. -
Disk I/O bottlenecks making the CPU wait on storage operations. -
Background tasks or malicious processes consuming resources.
2. Emergency Measures (First 1–5 Minutes)
When the system is already maxed out, the first priority is to stabilize the situation quickly.
2.1 Check Real-Time Status
uptime
top -b -n1 | head -n20
-
uptime
shows the current load average. -
top
highlights which processes consume the most CPU.
2.2 Identify Resource-Heavy Processes
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n20
2.3 Restart Services Under Stress
If MySQL or PHP is the culprit:
systemctl restart php-fpm
systemctl restart mysql
2.4 Kill Problematic Processes
kill -15 <PID> # Graceful stop
kill -9 <PID> # Force stop
2.5 Block Malicious IPs
iptables -A INPUT -s 1.2.3.4 -j DROP
3. Quick Diagnosis (5–30 Minutes)
Once the bleeding is stopped, the next step is finding the root cause.
3.1 CPU and Process Analysis
top # Sort by CPU with P
htop # If installed, more interactive
3.2 MySQL Process Check
mysql -u root -p -e "SHOW FULL PROCESSLIST\G"
-
Look for long-running queries. -
Pay attention to State
values likeSending data
orCopying to tmp table
.
3.3 Enable Slow Query Logging
mysql -u root -p -e "SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 1;"
The log file will reveal queries that consistently consume high CPU.
3.4 PHP-FPM Process Status
ps -eo pid,cmd,%cpu,%mem --sort=-%cpu | grep php-fpm | head
-
Too many child processes indicate backlog. -
Enable the PHP-FPM slowlog to capture problematic scripts.
3.5 Disk and I/O Monitoring
iostat -x 1 3
vmstat 1 5
High %iowait
signals storage bottlenecks.
3.6 Network Connections
ss -tunp | head
netstat -anp | grep ESTABLISHED | wc -l
Check for floods of suspicious connections.
4. Targeted Solutions
Case A: MySQL Using Excessive CPU
-
Identify long queries and terminate them:
KILL QUERY <id>;
-
Analyze with EXPLAIN
and adjust indexes. -
Watch for JOIN
orORDER BY
without indexes. -
Increase innodb_buffer_pool_size
if memory allows.
Case B: PHP-FPM Worker Overload
-
Adjust worker limits in www.conf
:
pm.max_children = 30
-
Enable slowlog for diagnosis:
request_slowlog_timeout = 5s
slowlog = /var/log/php-fpm/www-slow.log
-
Gracefully end runaway workers:
kill -15 <PID>
Case C: Traffic Spikes or Crawlers
-
Limit request rates in Nginx:
limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
limit_req zone=one burst=20 nodelay;
-
Add a CDN or WAF for protection.
Case D: Disk I/O Overload
-
Check for large backups or log writes. -
Move scheduled jobs to off-peak hours. -
Upgrade to SSDs or higher IOPS storage.
Case E: Malicious or Rogue Processes
-
Use ps aux --sort=-%cpu
to spot unknown processes. -
If cryptominers or backdoors are detected, isolate the server, back up data, and redeploy securely.
5. Limiting Resource Usage Per Process
When optimization is not immediately possible, limit resource consumption to keep the system alive.
5.1 Adjust Process Priority
renice +10 -p <PID>
5.2 Limit CPU Usage
cpulimit -p <PID> -l 40
5.3 Apply CPU Quota via systemd
systemctl set-property --runtime php-fpm CPUQuota=70%
6. Long-Term Optimization
6.1 Database Improvements
-
Analyze slow query logs regularly. -
Use tools like pt-query-digest
. -
Introduce caching layers (e.g., Redis). -
Consider read/write splitting.
6.2 Application Layer Enhancements
-
Implement Nginx FastCGI caching. -
Reduce unnecessary dynamic rendering. -
Avoid blocking API calls.
6.3 Infrastructure Scaling
-
Add load balancers. -
Deploy multiple web servers. -
Use replication or clustering for databases.
6.4 Monitoring and Alerts
-
Set up Prometheus, Grafana, or Zabbix. -
Configure alerts for CPU, memory, I/O, and MySQL connections. -
Detect anomalies early.
6.5 Security Measures
-
Regularly scan for malicious cron jobs or rootkits. -
Deploy fail2ban for brute-force prevention. -
Enable cloud provider DDoS protection.
7. Actionable Checklist
-
Confirm usage: Use top
/ps
to find heavy processes. -
Inspect MySQL: Run SHOW FULL PROCESSLIST
for slow queries. -
Check PHP-FPM: Reload service and monitor slowlog. -
Review disk and network: Use iostat
andss
tools. -
Emergency actions: Enable maintenance mode, firewall rules, or kill rogue processes. -
Optimize long term: Index tuning, caching, scaling, monitoring.
8. Conclusion
When servers hit 100% load and CPU usage, the solution is not a blind reboot. The proper workflow is:
-
Stop the bleeding — restart or limit critical services. -
Diagnose the root cause — through process lists, slow queries, and logs. -
Apply targeted fixes — optimize SQL, tune PHP-FPM, throttle traffic. -
Plan for the long term — caching, scaling, monitoring, and security.
Server stability comes from combining immediate actions, careful analysis, and proactive optimization.