What To Do On Linux

What To Do On Linux

Reading time1 min
#Linux#Monitoring#SysAdmin#htop#iostat#vmstat

Mastering Native Linux Monitoring: No Add-ons, No Bloat

Performance troubleshooting always starts local. Before adding Prometheus exporters or configuring remote Grafana dashboards, diagnose baseline system health using the native tools nearly every Linux environment already provides. These commands are resource-light, available out-of-the-box, and resilient to minimal system conditions—exactly what’s needed during critical incidents.


CPU and Process Inspection: top and htop

Scenario: Application latency spikes. Where’s the contention?
Start with top. Versions ≥3.3.12 provide improved batch mode and better metrics sorting. Launch:

top

Watch for:

  • %us/%sy (user/system CPU), %wa (I/O wait)
  • Load averages—especially if 15min load creeps above CPU core count
  • Processes sorted by CPU (P) or memory (M)

Killing a runaway process fast:
Press k, enter PID.
Note: An accidental kill of init (PID 1) bricks the session—double-check before confirming.

htop (needs install, e.g. apt-get install htop on Ubuntu 20.04+) adds:

  • Colorized CPU/RAM bars
  • Tree view (F5), highlighting parent/child process groups
  • Interactive filtering, mouse controls

Trade-off: htop fetches process threads; on heavily loaded boxes (>2000 threads), startup lag is noticeable.


Disk I/O Profiling: iostat

iostat (provided by sysstat ≥12.0.3) often finds root cause when /var/log fills or database response times deteriorate.

Quick look at device utilization and latency:

iostat -xz 2 5
  • -x: extended metrics; -z: suppress zeros; repeat every 2s, stop after 5 reports
  • Key fields:
    • r/s, w/s: read/write IOPS
    • await, svctm: average wait/service time (ms)
    • %util: saturation (should rarely hit 100%)
  • Elevated await with low r/s signals underlying hardware trouble

Practical example:

Device:         r/s     w/s   await  svctm  %util
nvme0n1       13.0     8.2   43.21   2.00   79.3

Above: period of heavy journaling causes significant write latency.

Side note: SSD degradation or misconfigured virtual block layers (e.g. mdadm, LVM) manifest here first—don't trust diagnostics from the storage array exclusively.


Memory and Paging: vmstat

Sporadic performance, yet CPU and disk seem clear? vmstat exposes memory pressure invisible to free or top.

Sample with two-second intervals:

vmstat 2 10

Pay attention to:

  • si/so: swap-in/swap-out (KB/s). Anything persistent here means swap thrashing; performance tanks as a result.
  • r: runnable queue. If consistently exceeding CPU thread count, system is overcommitted.
  • bi/bo: blocks in/out, reflecting disk I/O via the VM subsystem.

Example:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 105432  13520 984232    0    0     1     5  247  423 12  3 80  5  0

Here, ~5% I/O wait and r rarely above 3—suggests healthy load.

Known issue: On cgroup-v2 constrained services (e.g. containers), system-level vmstat may underreport per-cgroup memory pressure. Always cross-check with container stats (docker stats, kubectl top pod, etc.).


Network Status: /proc/net/dev, ss, and netstat

Investigating packet drops or network stalls? Start simple:

cat /proc/net/dev

This surfaces per-interface counters: bytes, packets, errors, drops.
Spike in RX errors? Check cabling, duplex settings (ethtool).

For live connection/state analysis:

ss -tupan
  • -t (TCP), -u (UDP), -p (process), -a (all), -n (numeric)

netstat is deprecated post-procps-ng 3.3.0, but remains familiar in legacy workflows.

Typical diagnostic output:

tcp  LISTEN  0 128 0.0.0.0:22  0.0.0.0:*  users:(("sshd",pid=631,fd=3))

Looking for ephemeral connection spikes?
ss -s provides quick summary stats; invaluable during SYN flood incidents.

Non-obvious tip: Sometimes ss omits socket states under mass load. For full dump, add -eipon.


Workflow Enhancers

Automated periodic sampling:

watch -n 3 'iostat -xz'

Tip: Default terminal widths may truncate output on small screens. Resize accordingly for clarity.

Long-term logging for event correlation:

vmstat 10 >> /var/log/vmstat-$(date +%F).log &

Then tail reactively:

tail -F /var/log/vmstat-2024-06-06.log

Integrate these logs with logrotate as needed—unchecked, they grow quickly.


Summary

No monitoring stack replaces foundational command-line proficiency.
With top/htop for live CPU/process diagnostics, iostat for I/O bottlenecks, vmstat for memory trends, and /proc/net/dev plus ss for networking, most production incidents can be traced without external tooling—critical when drift or dependency issues arise.

Almost every operational team has opinions on which tool to trust; sometimes, even a quick look at /proc/meminfo reveals what others miss. As ever, combine outputs, think in context, and tune periodic sampling for your environment. Data without interpretation is just noise.


Want automated anomaly detection using Bash scripting and these primitives? Some approaches exist, but never trust alerts fed from a single metric. Multi-signal correlation remains key.