In-Depth Analysis of Linux System Performance Tuning
Optimizing system performance on Linux requires a deep understanding of kernel internals and systemd’s resource management. This article explores the most important tuning vectors that allow for precise adjustments tailored to specific workloads – particularly in demanding scenarios such as virtualization, high-performance databases, or HPC environments.
1. Virtual Memory (VM) Tuning with sysctl
The files /etc/sysctl.conf or preferably /etc/sysctl.d/*.conf serve as the main interface for modifying kernel runtime parameters.
Swapping & Memory Pressure
vm.swappiness(0–100)
Controls the kernel’s tendency to swap out memory pages.- Low (1–10): Ideal for databases and in-memory caches.
0does not completely disable swapping but prevents it until memory pressure is extreme.
- Combination for minimal swapping:
vm.swappiness = 0 vm.min_free_kbytes = 524288 # ~5% RAM reserved
This prevents unnecessary swapping while still keeping emergency memory reserves for the kernel before invoking the OOM killer.
Dirty Pages & Write I/O
vm.dirty_ratio(default: 20): Max percentage of memory allowed as dirty pages.vm.dirty_background_ratio(default: 10): When exceeded, the kernel starts background flushing.vm.dirty_expire_centisecs: How long data may stay dirty before forced writeback.
Example for write-heavy workloads (e.g., logging):
vm.dirty_background_ratio = 5
vm.dirty_expire_centisecs = 6000
Additional Memory Parameters
vm.overcommit_memory0: Default heuristic1: Aggressive overcommit (used in HPC/scientific computing)2: Strict mode
vm.overcommit_ratio: Ratio of RAM+swap allowed for malloc() commitments.
2. Block I/O and Filesystem Tuning
I/O Scheduler Selection
- none / noop – best for VMs and SSD/NVMe (host or controller handles scheduling).
- mq-deadline – good default for SATA/SAS disks, balanced latency vs throughput.
- bfq – fairness-oriented, suited for desktops.
- kyber – low-latency, still experimental.
Check scheduler:
cat /sys/block/sda/queue/scheduler
Filesystem Mount Options
noatime– never update access times (saves I/O).relatime(default) – only update atime when older than mtime.discard– inline TRIM for SSDs (better viafstrim.timer).- ext4:
data=writeback→ reduces journaling overhead (metadata-only), faster but riskier.
3. Resource Management with systemd & cgroups
systemd (since v219) relies on cgroups v2 for fine-grained process resource control.
Example: mysql.service unit:
[Service]
CPUAffinity=0 1
MemoryMax=8G
MemoryHigh=6G
CPUShares=1024
IOWeight=100
BlockIOReadBandwidth=/dev/sda 10M
BlockIOWriteBandwidth=/dev/sdb 50M
TasksMax=10000
Apply changes:
sudo systemctl daemon-reload
sudo systemctl restart mysql
4. CPU and NUMA Optimization
CPU Frequency Scaling
sudo cpupower frequency-set -g performance
- performance → fixed max frequency (recommended for servers).
- schedutil → dynamic scaling based on scheduler load info (good for modern CPUs).
Process Affinity
- CPU pinning:
taskset -cp 0,2,4-6 <pid> - NUMA binding:
numactl --cpunodebind=0 --membind=0 /usr/bin/app
Helpful tools: numastat, hwloc.
5. Additional Optimization Techniques
Transparent Huge Pages (THP)
Often harmful for databases → disable:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Classical HugePages
Beneficial for DBs (Postgres, Oracle, MySQL) → reduces TLB misses:
vm.nr_hugepages = 2048
IRQ Affinity
Distribute interrupts across CPUs:
cat /proc/interrupts
echo 2 > /proc/irq/32/smp_affinity
Tools: irqbalance, tuned-adm.
6. Monitoring & Analysis
Optimization without measurement is pointless. Key tools:
- perf – CPU profiler (cache misses, cycles).
- ftrace / trace-cmd – kernel tracing.
- iostat -xzm 1 – per-device I/O utilization & latency.
- pidstat -d – per-process I/O stats.
- vmstat 1 – virtual memory activity.
- bcc/eBPF (biolatency, execsnoop, tcpconnect) – modern deep-dive analysis.
7. Best-Practice Tables
Recommended Kernel Settings by Workload
| Workload | Swappiness | I/O Scheduler | CPU Governor | atime | Extra Notes |
|---|---|---|---|---|---|
| Database | 1–10 | noop/none (SSD), mq-deadline (HDD) | performance | noatime | THP off, HugePages on |
| Webserver | 10–30 | none (SSD) | performance/schedutil | relatime | tcp_tw_reuse=1 |
| Virtualization | 10–20 | none/noop | performance | relatime | IRQ balancing |
| HPC/Scientific | 0–5 | noop | performance | noatime | overcommit_memory=1 |
| Desktop | 30–60 | bfq | schedutil | relatime | fairness over raw perf |
8. The TOTE Model: Systematic Optimization
Tuning should follow a feedback loop based on the TOTE principle (Test–Operate–Test–Exit):
- Test: Measure baseline (perf, iostat, vmstat).
- Operate: Apply one targeted change (e.g., swappiness=10).
- Test: Measure again, compare with baseline.
- Exit: Keep changes only if results are positive.
If results fall short, refine the hypothesis and repeat – avoid blind trial-and-error.
9. Conclusion
Linux provides an extensive toolkit for performance tuning – from VM parameters, I/O schedulers, and cgroups to NUMA binding and IRQ affinity. The key is not random parameter tweaking but a measurement-driven approach. By applying the TOTE model systematically, administrators can achieve sustainable and verifiable performance gains across diverse workloads.