Module 9: Performance Monitoring and Optimization in Linux

Learning Objectives

After completing this module, you will be able to:

Use system monitoring tools to observe and analyze system performance
Understand and manage system logs effectively
Implement performance tuning techniques for CPU, memory, and disk I/O
Configure and manage cron jobs for scheduled tasks
Analyze resource utilization to identify system bottlenecks
Apply practical troubleshooting methods to resolve common performance issues

1. System Monitoring Tools

Understanding System Monitoring Fundamentals

Linux provides a rich set of tools for monitoring system performance. These tools access information from the /proc filesystem, which is a virtual filesystem that provides an interface to kernel data structures. When you use a monitoring tool, it's reading from files in this filesystem that the kernel constantly updates with current system information.

The `/proc` Filesystem Under the Hood

The /proc filesystem doesn't exist on your physical storage. It's created in memory by the kernel when the system boots. Key files that monitoring tools typically read include:

/proc/cpuinfo: Information about the CPU
/proc/meminfo: Memory usage statistics
/proc/loadavg: System load averages
/proc/[pid]/stat: Process-specific information

Let's explore some essential monitoring tools:

top: The Classic Performance Monitor

top provides a dynamic real-time view of running processes. It displays a summary of system information and a list of processes or threads currently managed by the Linux kernel.

$ top

When you run top, it reads from multiple /proc files and presents the data in an organized format. Under the hood, top periodically reads from files like /proc/stat, /proc/meminfo, and /proc/[pid]/* to gather its information.

Key information in the top display:

Uptime and load averages (1, 5, and 15-minute averages)
Tasks statistics (total, running, sleeping, stopped, zombie)
CPU usage breakdown (user, system, nice, idle, wait, hardware interrupts, software interrupts)
Memory usage (total, free, used, buffers/cache)
Swap usage (total, used, free)
Process list sorted by configurable criteria

htop: An Enhanced Alternative

htop is an interactive process viewer that improves upon top with color-coding, visual indicators for CPU, memory, and swap usage, and a more user-friendly interface.

$ htop

Unlike top, htop presents information in a more intuitive way, with horizontal bars indicating resource usage and color-coding for different types of processes. It's built on the same fundamental data sources as top but processes and displays that information differently.

glances: Comprehensive Monitoring

glances provides an even more comprehensive overview of system resources. It can monitor not just CPU and memory but also network interfaces, disk I/O, sensors, and more.

$ glances

What makes glances special is its capability to adapt to the terminal size and display as much information as possible. It also includes alerting features that change the color of metrics when they exceed defined thresholds.

sar: System Activity Reporter for Historical Data

While the previous tools show current system state, sar (System Activity Reporter) can display historical performance data and collect data for later analysis.

# Display CPU utilization for the current day
$ sar

# Display memory utilization
$ sar -r

# Display disk I/O statistics
$ sar -b

Under the hood, sar relies on a data collection service called sysstat. This service periodically (typically every 10 minutes) collects system statistics and stores them in files in /var/log/sa/. When you run sar, it reads and processes these historical data files.

Understanding Performance Metrics

When using these tools, it's crucial to understand what various metrics indicate:

Load Average: Represents the average number of processes waiting for CPU time. As a rule of thumb, a consistently high load average exceeding the number of CPU cores suggests a CPU bottleneck.

Memory Usage: Modern Linux systems use most available memory for disk caching, so high memory usage isn't necessarily bad. What's more concerning is high swap usage, which indicates that the system is paging memory to disk.

I/O Wait: High I/O wait percentages in CPU statistics indicate that the CPU is waiting for disk operations to complete, suggesting a disk I/O bottleneck.

2. Log Analysis and Management

The Linux Logging System Architecture

Linux logging is primarily handled by the system logging daemon (traditionally syslogd, but modern systems often use rsyslogd or journald). These daemons receive log messages from applications and the kernel, then write them to various log files based on configuration rules.

Understanding syslog Protocol

The syslog protocol defines a message format with:

Facility: Indicates the source of the message (kernel, mail system, user-level messages, etc.)
Severity: Indicates the importance of the message (emergency, alert, critical, error, warning, notice, info, debug)
Timestamp and hostname
Message content

Key Log Files and Their Purposes

Important log files you'll typically find in a Linux system:

/var/log/syslog or /var/log/messages  # General system messages
/var/log/auth.log or /var/log/secure  # Authentication-related messages
/var/log/kern.log                     # Kernel messages
/var/log/dmesg                        # Boot-time device messages
/var/log/[application-specific]       # Logs for specific applications

systemd Journal for Modern Systems

If your system uses systemd, the journal provides a centralized, structured, and indexed logging system:

# View all journal entries
$ journalctl

# View entries for a specific service
$ journalctl -u apache2.service

# View entries since last boot
$ journalctl -b

The journal stores log data in a binary format in /var/log/journal/ that enables faster searching and filtering. Under the hood, it uses a complex indexing system that allows for efficient queries across various fields.

Log Rotation and Management

Log files can grow large and consume disk space. The logrotate utility automatically manages log file rotation, compression, and deletion based on configured policies:

# View logrotate configuration
$ cat /etc/logrotate.conf

# View application-specific configurations
$ ls /etc/logrotate.d/

A typical log rotation configuration might archive logs weekly, keep four weeks of archives, and compress them to save space.

Advanced Log Analysis Techniques

For performance troubleshooting, logs are invaluable. Here are some techniques to extract meaningful information:

# Find all error messages in syslog
$ grep -i error /var/log/syslog

# Count occurrences of different HTTP status codes in Apache logs
$ awk '{print $9}' /var/log/apache2/access.log | sort | uniq -c | sort -rn

# Extract response times from custom application logs
$ grep "response_time" /var/log/application.log | awk '{print $NF}' | sort -n

For complex log analysis, tools like logwatch, goaccess, or the ELK stack (Elasticsearch, Logstash, Kibana) provide more sophisticated capabilities.

3. Performance Tuning

CPU Performance Tuning

Understanding CPU Scheduling in Linux

Linux uses the Completely Fair Scheduler (CFS) by default. This scheduler tries to allocate CPU time fairly among all processes based on their "weight" (nice value). Under the hood, CFS maintains a red-black tree of runnable processes sorted by the amount of CPU time they've already received.

Process Priorities with nice and renice

You can adjust the priority of processes using the nice and renice commands:

# Start a process with lower priority (higher nice value)
$ nice -n 10 command

# Change the priority of a running process
$ renice +10 -p PID

Nice values range from -20 (highest priority) to 19 (lowest priority). Only root can set negative nice values.

CPU Governor and Frequency Scaling

Modern CPUs can adjust their frequency to balance performance and power consumption. Linux provides CPU governors to control this behavior:

# View current CPU frequency information
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

# View available governors
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

# Set the performance governor
$ echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Common governors include:

performance: Maintains maximum frequency
powersave: Maintains minimum frequency
ondemand: Dynamically adjusts frequency based on current load (default on many systems)
conservative: Similar to ondemand but changes frequency more gradually

Memory Performance Tuning

Understanding Linux Memory Management

Linux manages memory using a virtual memory system that makes efficient use of physical RAM and disk-based swap space. The kernel tries to keep as much data in RAM as possible, using caching and buffering to improve performance.

Key Memory-Related Kernel Parameters

You can tune memory behavior by adjusting kernel parameters in /proc/sys/vm/:

# Control swappiness (tendency to swap out memory)
$ sudo sysctl vm.swappiness=10

# Control dirty page writeback behavior
$ sudo sysctl vm.dirty_ratio=20
$ sudo sysctl vm.dirty_background_ratio=10

Lower swappiness values (0-10) make the kernel less likely to swap out memory to disk, which can improve performance for memory-intensive applications. The default is typically 60.

The dirty ratio parameters control when the kernel writes dirty (modified) memory pages to disk. Lower values protect against data loss but may increase I/O load.

Optimizing Application Memory Usage

For application-specific memory tuning:

Use ulimit to set resource limits for processes
Configure application-specific memory settings (e.g., Java heap size, database buffer pools)
Consider using tools like cgroups to limit memory usage for specific services

Disk I/O Performance Tuning

Understanding Linux I/O Schedulers

Linux uses I/O schedulers to determine the order in which disk I/O operations are performed. Different schedulers are optimized for different workloads:

# View current I/O scheduler for a device
$ cat /sys/block/sda/queue/scheduler

# Change the scheduler
$ echo deadline | sudo tee /sys/block/sda/queue/scheduler

Common I/O schedulers include:

cfq (Completely Fair Queuing): Attempts to distribute I/O fairly among processes
deadline: Optimized for low latency by imposing deadlines on I/O operations
noop: Minimal scheduling, good for SSDs and virtual environments
mq-deadline: Multi-queue version of deadline for modern devices

File System Selection and Mount Options

The file system you choose and its mount options can significantly impact performance:

# View current mount options
$ mount | grep sda

# Remount a filesystem with different options
$ sudo mount -o remount,noatime /dev/sda1 /mnt

Performance-related mount options include:

noatime: Disables updating access times, reducing unnecessary writes
data=writeback: For ext4, enables faster but slightly less safe write behavior
commit=30: Changes the interval between filesystem commits
discard: For SSDs, enables TRIM support which can maintain SSD performance

Tuning with the blockdev Command

The blockdev command allows you to adjust block device parameters:

# Set read-ahead buffer size (in 512-byte sectors)
$ sudo blockdev --setra 4096 /dev/sda

# View current read-ahead setting
$ sudo blockdev --getra /dev/sda

Increasing the read-ahead buffer can improve sequential read performance but may waste memory for random access patterns.

4. Cron Jobs and Scheduled Tasks

The Linux Task Scheduling Architecture

Linux provides multiple mechanisms for scheduling tasks:

cron: For recurring tasks at specific times
at: For one-time scheduled tasks
systemd timers: Modern alternative to cron in systemd-based systems

Understanding Cron Syntax and Configuration

Cron jobs are defined in crontab files using a specific syntax:

# Minute Hour Day Month Weekday Command
0 2 * * * /path/to/backup/script.sh

This example runs a backup script at 2:00 AM every day.

# Edit your personal crontab
$ crontab -e

# List your crontab entries
$ crontab -l

# Edit the system crontab (as root)
$ sudo crontab -e

Under the hood, the cron daemon reads these configuration files and executes commands at the specified times. It checks for updated crontabs every minute.

Special Cron Directories

Linux systems typically include special directories for common scheduling needs:

/etc/cron.hourly/
/etc/cron.daily/
/etc/cron.weekly/
/etc/cron.monthly/

Any executable script placed in these directories will run at the corresponding interval. The exact run time is defined in /etc/crontab.

systemd Timers as a Modern Alternative

If your system uses systemd, timers provide a more flexible scheduling mechanism:

# Create a systemd service unit
$ sudo vim /etc/systemd/system/backup.service

# Create a corresponding timer unit
$ sudo vim /etc/systemd/system/backup.timer

# Enable and start the timer
$ sudo systemctl enable --now backup.timer

# List active timers
$ systemctl list-timers

A typical timer unit might look like:

[Unit]
Description=Daily backup timer

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target

Systemd timers offer advantages over cron, including:

Better logging integration with the journal
Handling of missed executions
More sophisticated scheduling options
Dependencies on other systemd units

Best Practices for Scheduled Tasks

When creating scheduled tasks:

Include comprehensive error handling in your scripts
Redirect output to log files or use the journal
Consider using locking mechanisms to prevent overlapping executions
Plan execution times to minimize impact on system performance
Test thoroughly before deploying to production

5. Resource Utilization Analysis

Comprehensive Resource Assessment

While individual tools give you insights into specific aspects of your system, comprehensive resource analysis requires considering all components together. Performance issues often arise from interactions between subsystems rather than from a single component.

CPU Utilization Analysis

Beyond the basic metrics from top, deeper CPU analysis includes:

# View detailed CPU statistics
$ mpstat -P ALL 2

# Check for CPU throttling due to temperature
$ grep MHz /proc/cpuinfo

# Examine what specific processes are doing
$ pidstat -u 2

Interpreting CPU utilization requires understanding different states:

%user: Time spent in user space
%system: Time spent in kernel space
%iowait: Time spent waiting for I/O operations
%idle: Time the CPU is idle
%steal: Time lost to virtualization (in virtual environments)

High user time typically indicates busy applications, while high system time might suggest kernel issues or excessive system calls.

Memory Utilization Analysis

For detailed memory analysis:

# View detailed memory statistics
$ vmstat -s

# Examine memory usage by process
$ ps aux --sort=-%mem | head

# Check for memory fragmentation
$ cat /proc/buddyinfo

When analyzing memory, consider:

Physical memory usage vs. virtual memory
Buffer and cache usage (which can be reclaimed if needed)
Swap usage (high swap activity can severely impact performance)
Memory leak detection (steadily increasing usage over time)

Disk I/O Analysis

Detailed disk analysis tools include:

# View disk I/O statistics
$ iostat -x 2

# See which processes are performing I/O
$ iotop

# Track file system usage over time
$ df -h

Key metrics to monitor:

%util: Percentage of time the device was busy
r/s and w/s: Reads and writes per second
await: Average time (in milliseconds) for I/O requests to be serviced
svctm: Average service time of I/O requests

A high utilization percentage combined with high wait times suggests a disk I/O bottleneck.

Network Utilization Analysis

For network performance:

# View network traffic statistics
$ sar -n DEV 2

# Examine connection statistics
$ netstat -tunapl

# Monitor network traffic in real-time
$ iftop

Important network metrics include:

Bandwidth utilization (bytes/packets per second)
Connection states and counts
Packet loss and retransmissions
Latency and response times

Correlating Resource Usage Across Subsystems

The most valuable insights often come from correlating metrics across different subsystems. For example, high CPU I/O wait combined with high disk utilization confirms a disk bottleneck, while high CPU user time with high network traffic might indicate a network-intensive application.

Tools like collectl and dstat can gather and display data from multiple subsystems simultaneously, making correlation easier.

6. Identifying Bottlenecks and Performance Issues

Systematic Bottleneck Identification

Identifying performance bottlenecks requires a methodical approach:

Establish a baseline: Know what "normal" performance looks like for your system.
Identify symptoms: Slow response time? High CPU usage? Excessive disk I/O?
Isolate the affected components: Is it a single application or the entire system?
Analyze resource utilization: Use the tools discussed earlier to identify which resources are constrained.
Test hypotheses: Make controlled changes to verify your understanding of the issue.

Common Bottleneck Scenarios and Solutions

Let's examine some common performance bottlenecks and their typical solutions:

CPU Bottlenecks

Symptoms:

Load average consistently higher than the number of CPU cores
High CPU utilization with little idle time
Processes spending time in the "runnable" state (R)

Potential Solutions:

Optimize application code
Distribute workload across multiple processes/threads
Adjust process priorities for critical applications
Consider vertical scaling (adding more CPU resources)
Implement load balancing across multiple servers

Memory Bottlenecks

Symptoms:

High swap usage and frequent swap activity
Out-of-memory killer activations in logs
Increased page fault rates
Available memory consistently near zero

Potential Solutions:

Add more physical memory
Optimize application memory usage
Adjust kernel swappiness parameter
Implement memory limits for processes
Close unnecessary applications or services

Disk I/O Bottlenecks

Symptoms:

High I/O wait times in CPU statistics
High disk utilization percentage
Long service times for disk operations
Processes in uninterruptible sleep state (D)

Potential Solutions:

Use faster storage (SSDs instead of HDDs)
Implement RAID for better I/O distribution
Optimize application I/O patterns
Adjust filesystem and I/O scheduler settings
Consider caching solutions

Network Bottlenecks

Symptoms:

High network utilization
Increasing packet loss or retransmissions
Long network latency
Applications waiting for network responses

Potential Solutions:

Increase network bandwidth
Optimize network traffic (compression, caching)
Reduce unnecessary network operations
Consider content delivery networks for public services
Implement quality of service (QoS) for critical traffic

Profiling Applications for Performance Issues

For application-specific performance issues, profiling tools can identify hot spots in code:

# Profile a process using perf
$ perf record -p PID
$ perf report

# Profile memory allocations
$ valgrind --tool=massif ./application

# Trace system calls
$ strace -c -p PID

These tools help identify which specific functions or system calls are consuming the most resources.

Hands-on Exercises

Exercise 1: Identifying a CPU Bottleneck

Scenario: A web server is experiencing slow response times during peak hours.

Tasks:

Use top or htop to observe CPU utilization and load averages.
Identify the processes consuming the most CPU time.
Use pidstat -u 1 to analyze CPU usage patterns for these processes.
Check if the processes are CPU-bound or waiting for other resources.
Look for opportunities to optimize or distribute the workload.

Solution Example:

top shows the load average is 8.2 on a 4-core system, indicating CPU saturation.
The Apache web server processes are consuming most CPU time.
pidstat reveals these processes are primarily in user mode, not waiting for I/O.
The processes appear to be CPU-bound, possibly performing complex calculations.
Potential solutions include:
- Implementing caching to reduce CPU-intensive calculations
- Distributing load across more server instances
- Optimizing the most resource-intensive code paths

Exercise 2: Diagnosing and Resolving a Memory Leak

Scenario: A system's available memory is gradually decreasing over time, even during periods of low activity.

Tasks:

Use free -m to check memory usage patterns over time (run periodically).
Identify which processes are consuming increasing amounts of memory using ps aux --sort=-%mem.
For specific processes, use pmap -x PID to examine memory mappings.
Check system logs for out-of-memory incidents.
Develop a strategy to address the memory leak.

Solution Example:

free -m shows available memory decreasing by approximately 100MB per hour.
A custom application process is steadily increasing in memory usage.
pmap shows growth primarily in heap memory, suggesting a classic memory leak.
The log file shows the OOM killer activated twice in the past week.
Short-term solution: Implement a cron job to restart the application nightly. Long-term solution: Use memory profiling tools to identify and fix the leak in the application code.

Exercise 3: Optimizing Disk I/O Performance

Scenario: Database queries are running slowly, and system monitoring indicates high disk I/O wait times.

Tasks:

Use iostat -x 1 to monitor disk activity and identify potential bottlenecks.
Check the current I/O scheduler using cat /sys/block/sda/queue/scheduler.
Analyze the types of disk operations with iotop.
Review filesystem mount options with mount | grep sda.
Implement and test performance improvements.

Solution Example:

iostat shows high %util (95%) and await times (50ms) for the database disk.
The current scheduler is "cfq" which might not be optimal for database workloads.
iotop reveals many small, random read operations typical of database indexes.
The filesystem is using default mount options with no performance optimizations.
Implement these changes and measure the improvement:

# Change to deadline scheduler (better for databases)
echo deadline | sudo tee /sys/block/sda/queue/scheduler

# Increase read-ahead for potentially better sequential read performance
sudo blockdev --setra 1024 /dev/sda

# Remount filesystem with noatime to reduce unnecessary writes
sudo mount -o remount,noatime /dev/sda1 /database

Common Pitfalls and Troubleshooting Tips

Performance Analysis Pitfalls

Misinterpreting Load Average: High load doesn't always indicate CPU bottlenecks; it could be I/O or other resource contention.
Focusing on the Wrong Metric: For instance, total memory usage in Linux can be misleading because the kernel uses available memory for caching.
Addressing Symptoms Rather Than Causes: For example, adding more memory when the real issue is a memory leak will only delay problems, not solve them.
Making Multiple Changes Simultaneously: This makes it difficult to determine which change was effective.
Ignoring Baseline Performance: Without knowing what "normal" looks like, it's challenging to identify abnormal behavior.

Troubleshooting Methodology

Follow a systematic approach to troubleshooting:

Define the Problem Clearly: What exactly is slow or not working as expected?
Gather Information: Use the tools discussed in this module to collect relevant data.
Form a Hypothesis: Based on the data, what do you think is causing the issue?
Test the Hypothesis: Make a single change and observe the results.
Implement a Solution: Once confirmed, implement the fix properly.
Document the Issue and Solution: This helps with future troubleshooting.

Proactive Monitoring Tips

Set Up Ongoing Monitoring: Tools like Prometheus, Nagios, or Zabbix can continuously monitor system performance.
Establish Alerting Thresholds: Receive notifications before problems become critical.
Keep Historical Performance Data: This helps identify trends and predict future issues.
Perform Regular Health Checks: Don't wait for problems; periodically review system performance.
Schedule Maintenance Windows: Some issues require downtime to address properly.

Quick Reference Summary

Essential Commands

System Monitoring

top / htop: Interactive process viewers
vmstat: Virtual memory statistics
iostat: I/O statistics for devices and partitions
sar: Collect, report, or save system activity information
dmesg: Display kernel ring buffer messages

Log Analysis

journalctl: Query the systemd journal
grep, awk, sed: Text processing tools for log analysis
tail -f /var/log/file: Follow log files in real-time

Performance Tuning

nice / renice: Adjust process priorities
sysctl: Configure kernel parameters
blockdev: Configure block device parameters
mount -o remount: Change mount options without unmounting

Scheduled Tasks

crontab -e: Edit user cron jobs
systemctl list-timers: List active systemd timers

Key Files and Directories

/proc/*: Kernel and process information
/sys/devices/system/cpu/: CPU-related parameters
/sys/block/*/queue/: Block device queue parameters
/var/log/: System and application logs
/etc/sysctl.conf: Persistent kernel parameter configuration
/etc/cron.d/, /etc/cron.daily/: Cron job configuration

Critical Performance Metrics

CPU: Load average, user time, system time, I/O wait
Memory: Free memory, buffer/cache usage, swap activity
Disk: Utilization percentage, service time, await time
Network: Bytes/packets per second, retransmissions, connection states

By mastering the tools and techniques covered in this module, you'll be well-equipped to monitor, analyze, and optimize Linux system performance in a variety of environments.

Module 9: Performance Monitoring and Optimization in Linux

Learning Objectives

1. System Monitoring Tools

Understanding System Monitoring Fundamentals

The /proc Filesystem Under the Hood

top: The Classic Performance Monitor

htop: An Enhanced Alternative

glances: Comprehensive Monitoring

sar: System Activity Reporter for Historical Data

Understanding Performance Metrics

2. Log Analysis and Management

The Linux Logging System Architecture

Understanding syslog Protocol

Key Log Files and Their Purposes

systemd Journal for Modern Systems

Log Rotation and Management

Advanced Log Analysis Techniques

3. Performance Tuning

CPU Performance Tuning

Understanding CPU Scheduling in Linux

Process Priorities with nice and renice

CPU Governor and Frequency Scaling

Memory Performance Tuning

Understanding Linux Memory Management

Key Memory-Related Kernel Parameters

Optimizing Application Memory Usage

Disk I/O Performance Tuning

Understanding Linux I/O Schedulers

File System Selection and Mount Options

Tuning with the blockdev Command

4. Cron Jobs and Scheduled Tasks

The Linux Task Scheduling Architecture

Understanding Cron Syntax and Configuration

Special Cron Directories

systemd Timers as a Modern Alternative

Best Practices for Scheduled Tasks

5. Resource Utilization Analysis

Comprehensive Resource Assessment

CPU Utilization Analysis

Memory Utilization Analysis

Disk I/O Analysis

Network Utilization Analysis

Correlating Resource Usage Across Subsystems

6. Identifying Bottlenecks and Performance Issues

Systematic Bottleneck Identification

Common Bottleneck Scenarios and Solutions

CPU Bottlenecks

Memory Bottlenecks

Disk I/O Bottlenecks

Network Bottlenecks

Profiling Applications for Performance Issues

Hands-on Exercises

Exercise 1: Identifying a CPU Bottleneck

Exercise 2: Diagnosing and Resolving a Memory Leak

Exercise 3: Optimizing Disk I/O Performance

Common Pitfalls and Troubleshooting Tips

Performance Analysis Pitfalls

Troubleshooting Methodology

Proactive Monitoring Tips

Quick Reference Summary

Essential Commands

System Monitoring

Log Analysis

Performance Tuning

Scheduled Tasks

Key Files and Directories

Critical Performance Metrics

The `/proc` Filesystem Under the Hood