Module 9: Performance Monitoring and Optimization in Linux
Learning Objectives
After completing this module, you will be able to:
- Use system monitoring tools to observe and analyze system performance
- Understand and manage system logs effectively
- Implement performance tuning techniques for CPU, memory, and disk I/O
- Configure and manage cron jobs for scheduled tasks
- Analyze resource utilization to identify system bottlenecks
- Apply practical troubleshooting methods to resolve common performance issues
1. System Monitoring Tools
Understanding System Monitoring Fundamentals
Linux provides a rich set of tools for monitoring system performance. These tools access information from the /proc
filesystem, which is a virtual filesystem that provides an interface to kernel data structures. When you use a monitoring tool, it's reading from files in this filesystem that the kernel constantly updates with current system information.
The /proc
Filesystem Under the Hood
The /proc
filesystem doesn't exist on your physical storage. It's created in memory by the kernel when the system boots. Key files that monitoring tools typically read include:
/proc/cpuinfo
: Information about the CPU/proc/meminfo
: Memory usage statistics/proc/loadavg
: System load averages/proc/[pid]/stat
: Process-specific information
Let's explore some essential monitoring tools:
top: The Classic Performance Monitor
top
provides a dynamic real-time view of running processes. It displays a summary of system information and a list of processes or threads currently managed by the Linux kernel.
$ top
When you run top
, it reads from multiple /proc
files and presents the data in an organized format. Under the hood, top
periodically reads from files like /proc/stat
, /proc/meminfo
, and /proc/[pid]/*
to gather its information.
Key information in the top display:
- Uptime and load averages (1, 5, and 15-minute averages)
- Tasks statistics (total, running, sleeping, stopped, zombie)
- CPU usage breakdown (user, system, nice, idle, wait, hardware interrupts, software interrupts)
- Memory usage (total, free, used, buffers/cache)
- Swap usage (total, used, free)
- Process list sorted by configurable criteria
htop: An Enhanced Alternative
htop
is an interactive process viewer that improves upon top
with color-coding, visual indicators for CPU, memory, and swap usage, and a more user-friendly interface.
$ htop
Unlike top
, htop
presents information in a more intuitive way, with horizontal bars indicating resource usage and color-coding for different types of processes. It's built on the same fundamental data sources as top
but processes and displays that information differently.
glances: Comprehensive Monitoring
glances
provides an even more comprehensive overview of system resources. It can monitor not just CPU and memory but also network interfaces, disk I/O, sensors, and more.
$ glances
What makes glances
special is its capability to adapt to the terminal size and display as much information as possible. It also includes alerting features that change the color of metrics when they exceed defined thresholds.
sar: System Activity Reporter for Historical Data
While the previous tools show current system state, sar
(System Activity Reporter) can display historical performance data and collect data for later analysis.
# Display CPU utilization for the current day
$ sar
# Display memory utilization
$ sar -r
# Display disk I/O statistics
$ sar -b
Under the hood, sar
relies on a data collection service called sysstat
. This service periodically (typically every 10 minutes) collects system statistics and stores them in files in /var/log/sa/
. When you run sar
, it reads and processes these historical data files.
Understanding Performance Metrics
When using these tools, it's crucial to understand what various metrics indicate:
Load Average: Represents the average number of processes waiting for CPU time. As a rule of thumb, a consistently high load average exceeding the number of CPU cores suggests a CPU bottleneck.
Memory Usage: Modern Linux systems use most available memory for disk caching, so high memory usage isn't necessarily bad. What's more concerning is high swap usage, which indicates that the system is paging memory to disk.
I/O Wait: High I/O wait percentages in CPU statistics indicate that the CPU is waiting for disk operations to complete, suggesting a disk I/O bottleneck.
2. Log Analysis and Management
The Linux Logging System Architecture
Linux logging is primarily handled by the system logging daemon (traditionally syslogd
, but modern systems often use rsyslogd
or journald
). These daemons receive log messages from applications and the kernel, then write them to various log files based on configuration rules.
Understanding syslog Protocol
The syslog protocol defines a message format with:
- Facility: Indicates the source of the message (kernel, mail system, user-level messages, etc.)
- Severity: Indicates the importance of the message (emergency, alert, critical, error, warning, notice, info, debug)
- Timestamp and hostname
- Message content
Key Log Files and Their Purposes
Important log files you'll typically find in a Linux system:
/var/log/syslog or /var/log/messages # General system messages
/var/log/auth.log or /var/log/secure # Authentication-related messages
/var/log/kern.log # Kernel messages
/var/log/dmesg # Boot-time device messages
/var/log/[application-specific] # Logs for specific applications
systemd Journal for Modern Systems
If your system uses systemd, the journal provides a centralized, structured, and indexed logging system:
# View all journal entries
$ journalctl
# View entries for a specific service
$ journalctl -u apache2.service
# View entries since last boot
$ journalctl -b
The journal stores log data in a binary format in /var/log/journal/
that enables faster searching and filtering. Under the hood, it uses a complex indexing system that allows for efficient queries across various fields.
Log Rotation and Management
Log files can grow large and consume disk space. The logrotate
utility automatically manages log file rotation, compression, and deletion based on configured policies:
# View logrotate configuration
$ cat /etc/logrotate.conf
# View application-specific configurations
$ ls /etc/logrotate.d/
A typical log rotation configuration might archive logs weekly, keep four weeks of archives, and compress them to save space.
Advanced Log Analysis Techniques
For performance troubleshooting, logs are invaluable. Here are some techniques to extract meaningful information:
# Find all error messages in syslog
$ grep -i error /var/log/syslog
# Count occurrences of different HTTP status codes in Apache logs
$ awk '{print $9}' /var/log/apache2/access.log | sort | uniq -c | sort -rn
# Extract response times from custom application logs
$ grep "response_time" /var/log/application.log | awk '{print $NF}' | sort -n
For complex log analysis, tools like logwatch
, goaccess
, or the ELK stack (Elasticsearch, Logstash, Kibana) provide more sophisticated capabilities.
3. Performance Tuning
CPU Performance Tuning
Understanding CPU Scheduling in Linux
Linux uses the Completely Fair Scheduler (CFS) by default. This scheduler tries to allocate CPU time fairly among all processes based on their "weight" (nice value). Under the hood, CFS maintains a red-black tree of runnable processes sorted by the amount of CPU time they've already received.
Process Priorities with nice and renice
You can adjust the priority of processes using the nice
and renice
commands:
# Start a process with lower priority (higher nice value)
$ nice -n 10 command
# Change the priority of a running process
$ renice +10 -p PID
Nice values range from -20 (highest priority) to 19 (lowest priority). Only root can set negative nice values.
CPU Governor and Frequency Scaling
Modern CPUs can adjust their frequency to balance performance and power consumption. Linux provides CPU governors to control this behavior:
# View current CPU frequency information
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
# View available governors
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
# Set the performance governor
$ echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Common governors include:
performance
: Maintains maximum frequencypowersave
: Maintains minimum frequencyondemand
: Dynamically adjusts frequency based on current load (default on many systems)conservative
: Similar to ondemand but changes frequency more gradually
Memory Performance Tuning
Understanding Linux Memory Management
Linux manages memory using a virtual memory system that makes efficient use of physical RAM and disk-based swap space. The kernel tries to keep as much data in RAM as possible, using caching and buffering to improve performance.
Key Memory-Related Kernel Parameters
You can tune memory behavior by adjusting kernel parameters in /proc/sys/vm/
:
# Control swappiness (tendency to swap out memory)
$ sudo sysctl vm.swappiness=10
# Control dirty page writeback behavior
$ sudo sysctl vm.dirty_ratio=20
$ sudo sysctl vm.dirty_background_ratio=10
Lower swappiness values (0-10) make the kernel less likely to swap out memory to disk, which can improve performance for memory-intensive applications. The default is typically 60.
The dirty ratio parameters control when the kernel writes dirty (modified) memory pages to disk. Lower values protect against data loss but may increase I/O load.
Optimizing Application Memory Usage
For application-specific memory tuning:
- Use
ulimit
to set resource limits for processes - Configure application-specific memory settings (e.g., Java heap size, database buffer pools)
- Consider using tools like
cgroups
to limit memory usage for specific services
Disk I/O Performance Tuning
Understanding Linux I/O Schedulers
Linux uses I/O schedulers to determine the order in which disk I/O operations are performed. Different schedulers are optimized for different workloads:
# View current I/O scheduler for a device
$ cat /sys/block/sda/queue/scheduler
# Change the scheduler
$ echo deadline | sudo tee /sys/block/sda/queue/scheduler
Common I/O schedulers include:
cfq
(Completely Fair Queuing): Attempts to distribute I/O fairly among processesdeadline
: Optimized for low latency by imposing deadlines on I/O operationsnoop
: Minimal scheduling, good for SSDs and virtual environmentsmq-deadline
: Multi-queue version of deadline for modern devices
File System Selection and Mount Options
The file system you choose and its mount options can significantly impact performance:
# View current mount options
$ mount | grep sda
# Remount a filesystem with different options
$ sudo mount -o remount,noatime /dev/sda1 /mnt
Performance-related mount options include:
noatime
: Disables updating access times, reducing unnecessary writesdata=writeback
: For ext4, enables faster but slightly less safe write behaviorcommit=30
: Changes the interval between filesystem commitsdiscard
: For SSDs, enables TRIM support which can maintain SSD performance
Tuning with the blockdev Command
The blockdev
command allows you to adjust block device parameters:
# Set read-ahead buffer size (in 512-byte sectors)
$ sudo blockdev --setra 4096 /dev/sda
# View current read-ahead setting
$ sudo blockdev --getra /dev/sda
Increasing the read-ahead buffer can improve sequential read performance but may waste memory for random access patterns.
4. Cron Jobs and Scheduled Tasks
The Linux Task Scheduling Architecture
Linux provides multiple mechanisms for scheduling tasks:
cron
: For recurring tasks at specific timesat
: For one-time scheduled taskssystemd timers
: Modern alternative to cron in systemd-based systems
Understanding Cron Syntax and Configuration
Cron jobs are defined in crontab files using a specific syntax:
# Minute Hour Day Month Weekday Command
0 2 * * * /path/to/backup/script.sh
This example runs a backup script at 2:00 AM every day.
# Edit your personal crontab
$ crontab -e
# List your crontab entries
$ crontab -l
# Edit the system crontab (as root)
$ sudo crontab -e
Under the hood, the cron daemon reads these configuration files and executes commands at the specified times. It checks for updated crontabs every minute.
Special Cron Directories
Linux systems typically include special directories for common scheduling needs:
/etc/cron.hourly/
/etc/cron.daily/
/etc/cron.weekly/
/etc/cron.monthly/
Any executable script placed in these directories will run at the corresponding interval. The exact run time is defined in /etc/crontab
.
systemd Timers as a Modern Alternative
If your system uses systemd, timers provide a more flexible scheduling mechanism:
# Create a systemd service unit
$ sudo vim /etc/systemd/system/backup.service
# Create a corresponding timer unit
$ sudo vim /etc/systemd/system/backup.timer
# Enable and start the timer
$ sudo systemctl enable --now backup.timer
# List active timers
$ systemctl list-timers
A typical timer unit might look like:
[Unit]
Description=Daily backup timer
[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
[Install]
WantedBy=timers.target
Systemd timers offer advantages over cron, including:
- Better logging integration with the journal
- Handling of missed executions
- More sophisticated scheduling options
- Dependencies on other systemd units
Best Practices for Scheduled Tasks
When creating scheduled tasks:
- Include comprehensive error handling in your scripts
- Redirect output to log files or use the journal
- Consider using locking mechanisms to prevent overlapping executions
- Plan execution times to minimize impact on system performance
- Test thoroughly before deploying to production
5. Resource Utilization Analysis
Comprehensive Resource Assessment
While individual tools give you insights into specific aspects of your system, comprehensive resource analysis requires considering all components together. Performance issues often arise from interactions between subsystems rather than from a single component.
CPU Utilization Analysis
Beyond the basic metrics from top
, deeper CPU analysis includes:
# View detailed CPU statistics
$ mpstat -P ALL 2
# Check for CPU throttling due to temperature
$ grep MHz /proc/cpuinfo
# Examine what specific processes are doing
$ pidstat -u 2
Interpreting CPU utilization requires understanding different states:
%user
: Time spent in user space%system
: Time spent in kernel space%iowait
: Time spent waiting for I/O operations%idle
: Time the CPU is idle%steal
: Time lost to virtualization (in virtual environments)
High user time typically indicates busy applications, while high system time might suggest kernel issues or excessive system calls.
Memory Utilization Analysis
For detailed memory analysis:
# View detailed memory statistics
$ vmstat -s
# Examine memory usage by process
$ ps aux --sort=-%mem | head
# Check for memory fragmentation
$ cat /proc/buddyinfo
When analyzing memory, consider:
- Physical memory usage vs. virtual memory
- Buffer and cache usage (which can be reclaimed if needed)
- Swap usage (high swap activity can severely impact performance)
- Memory leak detection (steadily increasing usage over time)
Disk I/O Analysis
Detailed disk analysis tools include:
# View disk I/O statistics
$ iostat -x 2
# See which processes are performing I/O
$ iotop
# Track file system usage over time
$ df -h
Key metrics to monitor:
%util
: Percentage of time the device was busyr/s
andw/s
: Reads and writes per secondawait
: Average time (in milliseconds) for I/O requests to be servicedsvctm
: Average service time of I/O requests
A high utilization percentage combined with high wait times suggests a disk I/O bottleneck.
Network Utilization Analysis
For network performance:
# View network traffic statistics
$ sar -n DEV 2
# Examine connection statistics
$ netstat -tunapl
# Monitor network traffic in real-time
$ iftop
Important network metrics include:
- Bandwidth utilization (bytes/packets per second)
- Connection states and counts
- Packet loss and retransmissions
- Latency and response times
Correlating Resource Usage Across Subsystems
The most valuable insights often come from correlating metrics across different subsystems. For example, high CPU I/O wait combined with high disk utilization confirms a disk bottleneck, while high CPU user time with high network traffic might indicate a network-intensive application.
Tools like collectl
and dstat
can gather and display data from multiple subsystems simultaneously, making correlation easier.
6. Identifying Bottlenecks and Performance Issues
Systematic Bottleneck Identification
Identifying performance bottlenecks requires a methodical approach:
- Establish a baseline: Know what "normal" performance looks like for your system.
- Identify symptoms: Slow response time? High CPU usage? Excessive disk I/O?
- Isolate the affected components: Is it a single application or the entire system?
- Analyze resource utilization: Use the tools discussed earlier to identify which resources are constrained.
- Test hypotheses: Make controlled changes to verify your understanding of the issue.
Common Bottleneck Scenarios and Solutions
Let's examine some common performance bottlenecks and their typical solutions:
CPU Bottlenecks
Symptoms:
- Load average consistently higher than the number of CPU cores
- High CPU utilization with little idle time
- Processes spending time in the "runnable" state (R)
Potential Solutions:
- Optimize application code
- Distribute workload across multiple processes/threads
- Adjust process priorities for critical applications
- Consider vertical scaling (adding more CPU resources)
- Implement load balancing across multiple servers
Memory Bottlenecks
Symptoms:
- High swap usage and frequent swap activity
- Out-of-memory killer activations in logs
- Increased page fault rates
- Available memory consistently near zero
Potential Solutions:
- Add more physical memory
- Optimize application memory usage
- Adjust kernel swappiness parameter
- Implement memory limits for processes
- Close unnecessary applications or services
Disk I/O Bottlenecks
Symptoms:
- High I/O wait times in CPU statistics
- High disk utilization percentage
- Long service times for disk operations
- Processes in uninterruptible sleep state (D)
Potential Solutions:
- Use faster storage (SSDs instead of HDDs)
- Implement RAID for better I/O distribution
- Optimize application I/O patterns
- Adjust filesystem and I/O scheduler settings
- Consider caching solutions
Network Bottlenecks
Symptoms:
- High network utilization
- Increasing packet loss or retransmissions
- Long network latency
- Applications waiting for network responses
Potential Solutions:
- Increase network bandwidth
- Optimize network traffic (compression, caching)
- Reduce unnecessary network operations
- Consider content delivery networks for public services
- Implement quality of service (QoS) for critical traffic
Profiling Applications for Performance Issues
For application-specific performance issues, profiling tools can identify hot spots in code:
# Profile a process using perf
$ perf record -p PID
$ perf report
# Profile memory allocations
$ valgrind --tool=massif ./application
# Trace system calls
$ strace -c -p PID
These tools help identify which specific functions or system calls are consuming the most resources.
Hands-on Exercises
Exercise 1: Identifying a CPU Bottleneck
Scenario: A web server is experiencing slow response times during peak hours.
Tasks:
- Use
top
orhtop
to observe CPU utilization and load averages. - Identify the processes consuming the most CPU time.
- Use
pidstat -u 1
to analyze CPU usage patterns for these processes. - Check if the processes are CPU-bound or waiting for other resources.
- Look for opportunities to optimize or distribute the workload.
Solution Example:
top
shows the load average is 8.2 on a 4-core system, indicating CPU saturation.- The Apache web server processes are consuming most CPU time.
pidstat
reveals these processes are primarily in user mode, not waiting for I/O.- The processes appear to be CPU-bound, possibly performing complex calculations.
- Potential solutions include:
- Implementing caching to reduce CPU-intensive calculations
- Distributing load across more server instances
- Optimizing the most resource-intensive code paths
Exercise 2: Diagnosing and Resolving a Memory Leak
Scenario: A system's available memory is gradually decreasing over time, even during periods of low activity.
Tasks:
- Use
free -m
to check memory usage patterns over time (run periodically). - Identify which processes are consuming increasing amounts of memory using
ps aux --sort=-%mem
. - For specific processes, use
pmap -x PID
to examine memory mappings. - Check system logs for out-of-memory incidents.
- Develop a strategy to address the memory leak.
Solution Example:
free -m
shows available memory decreasing by approximately 100MB per hour.- A custom application process is steadily increasing in memory usage.
pmap
shows growth primarily in heap memory, suggesting a classic memory leak.- The log file shows the OOM killer activated twice in the past week.
- Short-term solution: Implement a cron job to restart the application nightly. Long-term solution: Use memory profiling tools to identify and fix the leak in the application code.
Exercise 3: Optimizing Disk I/O Performance
Scenario: Database queries are running slowly, and system monitoring indicates high disk I/O wait times.
Tasks:
- Use
iostat -x 1
to monitor disk activity and identify potential bottlenecks. - Check the current I/O scheduler using
cat /sys/block/sda/queue/scheduler
. - Analyze the types of disk operations with
iotop
. - Review filesystem mount options with
mount | grep sda
. - Implement and test performance improvements.
Solution Example:
iostat
shows high %util (95%) and await times (50ms) for the database disk.- The current scheduler is "cfq" which might not be optimal for database workloads.
iotop
reveals many small, random read operations typical of database indexes.- The filesystem is using default mount options with no performance optimizations.
- Implement these changes and measure the improvement:
# Change to deadline scheduler (better for databases)
echo deadline | sudo tee /sys/block/sda/queue/scheduler
# Increase read-ahead for potentially better sequential read performance
sudo blockdev --setra 1024 /dev/sda
# Remount filesystem with noatime to reduce unnecessary writes
sudo mount -o remount,noatime /dev/sda1 /database
Common Pitfalls and Troubleshooting Tips
Performance Analysis Pitfalls
- Misinterpreting Load Average: High load doesn't always indicate CPU bottlenecks; it could be I/O or other resource contention.
- Focusing on the Wrong Metric: For instance, total memory usage in Linux can be misleading because the kernel uses available memory for caching.
- Addressing Symptoms Rather Than Causes: For example, adding more memory when the real issue is a memory leak will only delay problems, not solve them.
- Making Multiple Changes Simultaneously: This makes it difficult to determine which change was effective.
- Ignoring Baseline Performance: Without knowing what "normal" looks like, it's challenging to identify abnormal behavior.
Troubleshooting Methodology
Follow a systematic approach to troubleshooting:
- Define the Problem Clearly: What exactly is slow or not working as expected?
- Gather Information: Use the tools discussed in this module to collect relevant data.
- Form a Hypothesis: Based on the data, what do you think is causing the issue?
- Test the Hypothesis: Make a single change and observe the results.
- Implement a Solution: Once confirmed, implement the fix properly.
- Document the Issue and Solution: This helps with future troubleshooting.
Proactive Monitoring Tips
- Set Up Ongoing Monitoring: Tools like Prometheus, Nagios, or Zabbix can continuously monitor system performance.
- Establish Alerting Thresholds: Receive notifications before problems become critical.
- Keep Historical Performance Data: This helps identify trends and predict future issues.
- Perform Regular Health Checks: Don't wait for problems; periodically review system performance.
- Schedule Maintenance Windows: Some issues require downtime to address properly.
Quick Reference Summary
Essential Commands
System Monitoring
top
/htop
: Interactive process viewersvmstat
: Virtual memory statisticsiostat
: I/O statistics for devices and partitionssar
: Collect, report, or save system activity informationdmesg
: Display kernel ring buffer messages
Log Analysis
journalctl
: Query the systemd journalgrep
,awk
,sed
: Text processing tools for log analysistail -f /var/log/file
: Follow log files in real-time
Performance Tuning
nice
/renice
: Adjust process prioritiessysctl
: Configure kernel parametersblockdev
: Configure block device parametersmount -o remount
: Change mount options without unmounting
Scheduled Tasks
crontab -e
: Edit user cron jobssystemctl list-timers
: List active systemd timers
Key Files and Directories
/proc/*
: Kernel and process information/sys/devices/system/cpu/
: CPU-related parameters/sys/block/*/queue/
: Block device queue parameters/var/log/
: System and application logs/etc/sysctl.conf
: Persistent kernel parameter configuration/etc/cron.d/
,/etc/cron.daily/
: Cron job configuration
Critical Performance Metrics
- CPU: Load average, user time, system time, I/O wait
- Memory: Free memory, buffer/cache usage, swap activity
- Disk: Utilization percentage, service time, await time
- Network: Bytes/packets per second, retransmissions, connection states
By mastering the tools and techniques covered in this module, you'll be well-equipped to monitor, analyze, and optimize Linux system performance in a variety of environments.