Mastering Native Linux Monitoring: No Add-ons, No Bloat
Performance troubleshooting always starts local. Before adding Prometheus exporters or configuring remote Grafana dashboards, diagnose baseline system health using the native tools nearly every Linux environment already provides. These commands are resource-light, available out-of-the-box, and resilient to minimal system conditions—exactly what’s needed during critical incidents.
CPU and Process Inspection: top
and htop
Scenario: Application latency spikes. Where’s the contention?
Start with top
. Versions ≥3.3.12 provide improved batch mode and better metrics sorting. Launch:
top
Watch for:
%us
/%sy
(user/system CPU),%wa
(I/O wait)- Load averages—especially if 15min load creeps above CPU core count
- Processes sorted by CPU (
P
) or memory (M
)
Killing a runaway process fast:
Press k
, enter PID.
Note: An accidental kill of init (PID 1
) bricks the session—double-check before confirming.
htop
(needs install, e.g. apt-get install htop
on Ubuntu 20.04+) adds:
- Colorized CPU/RAM bars
- Tree view (
F5
), highlighting parent/child process groups - Interactive filtering, mouse controls
Trade-off: htop
fetches process threads; on heavily loaded boxes (>2000 threads
), startup lag is noticeable.
Disk I/O Profiling: iostat
iostat
(provided by sysstat
≥12.0.3) often finds root cause when /var/log
fills or database response times deteriorate.
Quick look at device utilization and latency:
iostat -xz 2 5
-x
: extended metrics;-z
: suppress zeros; repeat every 2s, stop after 5 reports- Key fields:
r/s
,w/s
: read/write IOPSawait
,svctm
: average wait/service time (ms)%util
: saturation (should rarely hit 100%)
- Elevated
await
with lowr/s
signals underlying hardware trouble
Practical example:
Device: r/s w/s await svctm %util
nvme0n1 13.0 8.2 43.21 2.00 79.3
Above: period of heavy journaling causes significant write latency.
Side note: SSD degradation or misconfigured virtual block layers (e.g. mdadm
, LVM) manifest here first—don't trust diagnostics from the storage array exclusively.
Memory and Paging: vmstat
Sporadic performance, yet CPU and disk seem clear? vmstat
exposes memory pressure invisible to free
or top
.
Sample with two-second intervals:
vmstat 2 10
Pay attention to:
si
/so
: swap-in/swap-out (KB/s). Anything persistent here means swap thrashing; performance tanks as a result.r
: runnable queue. If consistently exceeding CPU thread count, system is overcommitted.bi
/bo
: blocks in/out, reflecting disk I/O via the VM subsystem.
Example:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 0 105432 13520 984232 0 0 1 5 247 423 12 3 80 5 0
Here, ~5% I/O wait and r
rarely above 3—suggests healthy load.
Known issue: On cgroup-v2 constrained services (e.g. containers), system-level vmstat
may underreport per-cgroup memory pressure. Always cross-check with container stats (docker stats
, kubectl top pod
, etc.).
Network Status: /proc/net/dev
, ss
, and netstat
Investigating packet drops or network stalls? Start simple:
cat /proc/net/dev
This surfaces per-interface counters: bytes, packets, errors, drops.
Spike in RX errors? Check cabling, duplex settings (ethtool
).
For live connection/state analysis:
ss -tupan
-t
(TCP),-u
(UDP),-p
(process),-a
(all),-n
(numeric)
netstat
is deprecated post-procps-ng 3.3.0, but remains familiar in legacy workflows.
Typical diagnostic output:
tcp LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=631,fd=3))
Looking for ephemeral connection spikes?
ss -s
provides quick summary stats; invaluable during SYN flood incidents.
Non-obvious tip: Sometimes ss
omits socket states under mass load. For full dump, add -eipon
.
Workflow Enhancers
Automated periodic sampling:
watch -n 3 'iostat -xz'
Tip: Default terminal widths may truncate output on small screens. Resize accordingly for clarity.
Long-term logging for event correlation:
vmstat 10 >> /var/log/vmstat-$(date +%F).log &
Then tail reactively:
tail -F /var/log/vmstat-2024-06-06.log
Integrate these logs with logrotate
as needed—unchecked, they grow quickly.
Summary
No monitoring stack replaces foundational command-line proficiency.
With top
/htop
for live CPU/process diagnostics, iostat
for I/O bottlenecks, vmstat
for memory trends, and /proc/net/dev
plus ss
for networking, most production incidents can be traced without external tooling—critical when drift or dependency issues arise.
Almost every operational team has opinions on which tool to trust; sometimes, even a quick look at /proc/meminfo
reveals what others miss. As ever, combine outputs, think in context, and tune periodic sampling for your environment. Data without interpretation is just noise.
Want automated anomaly detection using Bash scripting and these primitives? Some approaches exist, but never trust alerts fed from a single metric. Multi-signal correlation remains key.