Module 9: Performance Monitoring and Optimization in Linux

Learning Objectives

After completing this module, you will be able to:

1. System Monitoring Tools

Understanding System Monitoring Fundamentals

Linux provides a rich set of tools for monitoring system performance. These tools access information from the /proc filesystem, which is a virtual filesystem that provides an interface to kernel data structures. When you use a monitoring tool, it's reading from files in this filesystem that the kernel constantly updates with current system information.

The /proc Filesystem Under the Hood

The /proc filesystem doesn't exist on your physical storage. It's created in memory by the kernel when the system boots. Key files that monitoring tools typically read include:

Let's explore some essential monitoring tools:

top: The Classic Performance Monitor

top provides a dynamic real-time view of running processes. It displays a summary of system information and a list of processes or threads currently managed by the Linux kernel.

$ top

When you run top, it reads from multiple /proc files and presents the data in an organized format. Under the hood, top periodically reads from files like /proc/stat, /proc/meminfo, and /proc/[pid]/* to gather its information.

Key information in the top display:

htop: An Enhanced Alternative

htop is an interactive process viewer that improves upon top with color-coding, visual indicators for CPU, memory, and swap usage, and a more user-friendly interface.

$ htop

Unlike top, htop presents information in a more intuitive way, with horizontal bars indicating resource usage and color-coding for different types of processes. It's built on the same fundamental data sources as top but processes and displays that information differently.

glances: Comprehensive Monitoring

glances provides an even more comprehensive overview of system resources. It can monitor not just CPU and memory but also network interfaces, disk I/O, sensors, and more.

$ glances

What makes glances special is its capability to adapt to the terminal size and display as much information as possible. It also includes alerting features that change the color of metrics when they exceed defined thresholds.

sar: System Activity Reporter for Historical Data

While the previous tools show current system state, sar (System Activity Reporter) can display historical performance data and collect data for later analysis.

# Display CPU utilization for the current day
$ sar

# Display memory utilization
$ sar -r

# Display disk I/O statistics
$ sar -b

Under the hood, sar relies on a data collection service called sysstat. This service periodically (typically every 10 minutes) collects system statistics and stores them in files in /var/log/sa/. When you run sar, it reads and processes these historical data files.

Understanding Performance Metrics

When using these tools, it's crucial to understand what various metrics indicate:

Load Average: Represents the average number of processes waiting for CPU time. As a rule of thumb, a consistently high load average exceeding the number of CPU cores suggests a CPU bottleneck.

Memory Usage: Modern Linux systems use most available memory for disk caching, so high memory usage isn't necessarily bad. What's more concerning is high swap usage, which indicates that the system is paging memory to disk.

I/O Wait: High I/O wait percentages in CPU statistics indicate that the CPU is waiting for disk operations to complete, suggesting a disk I/O bottleneck.

2. Log Analysis and Management

The Linux Logging System Architecture

Linux logging is primarily handled by the system logging daemon (traditionally syslogd, but modern systems often use rsyslogd or journald). These daemons receive log messages from applications and the kernel, then write them to various log files based on configuration rules.

Understanding syslog Protocol

The syslog protocol defines a message format with:

Key Log Files and Their Purposes

Important log files you'll typically find in a Linux system:

/var/log/syslog or /var/log/messages  # General system messages
/var/log/auth.log or /var/log/secure  # Authentication-related messages
/var/log/kern.log                     # Kernel messages
/var/log/dmesg                        # Boot-time device messages
/var/log/[application-specific]       # Logs for specific applications

systemd Journal for Modern Systems

If your system uses systemd, the journal provides a centralized, structured, and indexed logging system:

# View all journal entries
$ journalctl

# View entries for a specific service
$ journalctl -u apache2.service

# View entries since last boot
$ journalctl -b

The journal stores log data in a binary format in /var/log/journal/ that enables faster searching and filtering. Under the hood, it uses a complex indexing system that allows for efficient queries across various fields.

Log Rotation and Management

Log files can grow large and consume disk space. The logrotate utility automatically manages log file rotation, compression, and deletion based on configured policies:

# View logrotate configuration
$ cat /etc/logrotate.conf

# View application-specific configurations
$ ls /etc/logrotate.d/

A typical log rotation configuration might archive logs weekly, keep four weeks of archives, and compress them to save space.

Advanced Log Analysis Techniques

For performance troubleshooting, logs are invaluable. Here are some techniques to extract meaningful information:

# Find all error messages in syslog
$ grep -i error /var/log/syslog

# Count occurrences of different HTTP status codes in Apache logs
$ awk '{print $9}' /var/log/apache2/access.log | sort | uniq -c | sort -rn

# Extract response times from custom application logs
$ grep "response_time" /var/log/application.log | awk '{print $NF}' | sort -n

For complex log analysis, tools like logwatch, goaccess, or the ELK stack (Elasticsearch, Logstash, Kibana) provide more sophisticated capabilities.

3. Performance Tuning

CPU Performance Tuning

Understanding CPU Scheduling in Linux

Linux uses the Completely Fair Scheduler (CFS) by default. This scheduler tries to allocate CPU time fairly among all processes based on their "weight" (nice value). Under the hood, CFS maintains a red-black tree of runnable processes sorted by the amount of CPU time they've already received.

Process Priorities with nice and renice

You can adjust the priority of processes using the nice and renice commands:

# Start a process with lower priority (higher nice value)
$ nice -n 10 command

# Change the priority of a running process
$ renice +10 -p PID

Nice values range from -20 (highest priority) to 19 (lowest priority). Only root can set negative nice values.

CPU Governor and Frequency Scaling

Modern CPUs can adjust their frequency to balance performance and power consumption. Linux provides CPU governors to control this behavior:

# View current CPU frequency information
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

# View available governors
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

# Set the performance governor
$ echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Common governors include:

Memory Performance Tuning

Understanding Linux Memory Management

Linux manages memory using a virtual memory system that makes efficient use of physical RAM and disk-based swap space. The kernel tries to keep as much data in RAM as possible, using caching and buffering to improve performance.

Key Memory-Related Kernel Parameters

You can tune memory behavior by adjusting kernel parameters in /proc/sys/vm/:

# Control swappiness (tendency to swap out memory)
$ sudo sysctl vm.swappiness=10

# Control dirty page writeback behavior
$ sudo sysctl vm.dirty_ratio=20
$ sudo sysctl vm.dirty_background_ratio=10

Lower swappiness values (0-10) make the kernel less likely to swap out memory to disk, which can improve performance for memory-intensive applications. The default is typically 60.

The dirty ratio parameters control when the kernel writes dirty (modified) memory pages to disk. Lower values protect against data loss but may increase I/O load.

Optimizing Application Memory Usage

For application-specific memory tuning:

Disk I/O Performance Tuning

Understanding Linux I/O Schedulers

Linux uses I/O schedulers to determine the order in which disk I/O operations are performed. Different schedulers are optimized for different workloads:

# View current I/O scheduler for a device
$ cat /sys/block/sda/queue/scheduler

# Change the scheduler
$ echo deadline | sudo tee /sys/block/sda/queue/scheduler

Common I/O schedulers include:

File System Selection and Mount Options

The file system you choose and its mount options can significantly impact performance:

# View current mount options
$ mount | grep sda

# Remount a filesystem with different options
$ sudo mount -o remount,noatime /dev/sda1 /mnt

Performance-related mount options include:

Tuning with the blockdev Command

The blockdev command allows you to adjust block device parameters:

# Set read-ahead buffer size (in 512-byte sectors)
$ sudo blockdev --setra 4096 /dev/sda

# View current read-ahead setting
$ sudo blockdev --getra /dev/sda

Increasing the read-ahead buffer can improve sequential read performance but may waste memory for random access patterns.

4. Cron Jobs and Scheduled Tasks

The Linux Task Scheduling Architecture

Linux provides multiple mechanisms for scheduling tasks:

Understanding Cron Syntax and Configuration

Cron jobs are defined in crontab files using a specific syntax:

# Minute Hour Day Month Weekday Command
0 2 * * * /path/to/backup/script.sh

This example runs a backup script at 2:00 AM every day.

# Edit your personal crontab
$ crontab -e

# List your crontab entries
$ crontab -l

# Edit the system crontab (as root)
$ sudo crontab -e

Under the hood, the cron daemon reads these configuration files and executes commands at the specified times. It checks for updated crontabs every minute.

Special Cron Directories

Linux systems typically include special directories for common scheduling needs:

/etc/cron.hourly/
/etc/cron.daily/
/etc/cron.weekly/
/etc/cron.monthly/

Any executable script placed in these directories will run at the corresponding interval. The exact run time is defined in /etc/crontab.

systemd Timers as a Modern Alternative

If your system uses systemd, timers provide a more flexible scheduling mechanism:

# Create a systemd service unit
$ sudo vim /etc/systemd/system/backup.service

# Create a corresponding timer unit
$ sudo vim /etc/systemd/system/backup.timer

# Enable and start the timer
$ sudo systemctl enable --now backup.timer

# List active timers
$ systemctl list-timers

A typical timer unit might look like:

[Unit]
Description=Daily backup timer

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target

Systemd timers offer advantages over cron, including:

Best Practices for Scheduled Tasks

When creating scheduled tasks:

5. Resource Utilization Analysis

Comprehensive Resource Assessment

While individual tools give you insights into specific aspects of your system, comprehensive resource analysis requires considering all components together. Performance issues often arise from interactions between subsystems rather than from a single component.

CPU Utilization Analysis

Beyond the basic metrics from top, deeper CPU analysis includes:

# View detailed CPU statistics
$ mpstat -P ALL 2

# Check for CPU throttling due to temperature
$ grep MHz /proc/cpuinfo

# Examine what specific processes are doing
$ pidstat -u 2

Interpreting CPU utilization requires understanding different states:

High user time typically indicates busy applications, while high system time might suggest kernel issues or excessive system calls.

Memory Utilization Analysis

For detailed memory analysis:

# View detailed memory statistics
$ vmstat -s

# Examine memory usage by process
$ ps aux --sort=-%mem | head

# Check for memory fragmentation
$ cat /proc/buddyinfo

When analyzing memory, consider:

Disk I/O Analysis

Detailed disk analysis tools include:

# View disk I/O statistics
$ iostat -x 2

# See which processes are performing I/O
$ iotop

# Track file system usage over time
$ df -h

Key metrics to monitor:

A high utilization percentage combined with high wait times suggests a disk I/O bottleneck.

Network Utilization Analysis

For network performance:

# View network traffic statistics
$ sar -n DEV 2

# Examine connection statistics
$ netstat -tunapl

# Monitor network traffic in real-time
$ iftop

Important network metrics include:

Correlating Resource Usage Across Subsystems

The most valuable insights often come from correlating metrics across different subsystems. For example, high CPU I/O wait combined with high disk utilization confirms a disk bottleneck, while high CPU user time with high network traffic might indicate a network-intensive application.

Tools like collectl and dstat can gather and display data from multiple subsystems simultaneously, making correlation easier.

6. Identifying Bottlenecks and Performance Issues

Systematic Bottleneck Identification

Identifying performance bottlenecks requires a methodical approach:

  1. Establish a baseline: Know what "normal" performance looks like for your system.
  2. Identify symptoms: Slow response time? High CPU usage? Excessive disk I/O?
  3. Isolate the affected components: Is it a single application or the entire system?
  4. Analyze resource utilization: Use the tools discussed earlier to identify which resources are constrained.
  5. Test hypotheses: Make controlled changes to verify your understanding of the issue.

Common Bottleneck Scenarios and Solutions

Let's examine some common performance bottlenecks and their typical solutions:

CPU Bottlenecks

Symptoms:

Potential Solutions:

Memory Bottlenecks

Symptoms:

Potential Solutions:

Disk I/O Bottlenecks

Symptoms:

Potential Solutions:

Network Bottlenecks

Symptoms:

Potential Solutions:

Profiling Applications for Performance Issues

For application-specific performance issues, profiling tools can identify hot spots in code:

# Profile a process using perf
$ perf record -p PID
$ perf report

# Profile memory allocations
$ valgrind --tool=massif ./application

# Trace system calls
$ strace -c -p PID

These tools help identify which specific functions or system calls are consuming the most resources.

Hands-on Exercises

Exercise 1: Identifying a CPU Bottleneck

Scenario: A web server is experiencing slow response times during peak hours.

Tasks:

  1. Use top or htop to observe CPU utilization and load averages.
  2. Identify the processes consuming the most CPU time.
  3. Use pidstat -u 1 to analyze CPU usage patterns for these processes.
  4. Check if the processes are CPU-bound or waiting for other resources.
  5. Look for opportunities to optimize or distribute the workload.

Solution Example:

  1. top shows the load average is 8.2 on a 4-core system, indicating CPU saturation.
  2. The Apache web server processes are consuming most CPU time.
  3. pidstat reveals these processes are primarily in user mode, not waiting for I/O.
  4. The processes appear to be CPU-bound, possibly performing complex calculations.
  5. Potential solutions include:
    • Implementing caching to reduce CPU-intensive calculations
    • Distributing load across more server instances
    • Optimizing the most resource-intensive code paths

Exercise 2: Diagnosing and Resolving a Memory Leak

Scenario: A system's available memory is gradually decreasing over time, even during periods of low activity.

Tasks:

  1. Use free -m to check memory usage patterns over time (run periodically).
  2. Identify which processes are consuming increasing amounts of memory using ps aux --sort=-%mem.
  3. For specific processes, use pmap -x PID to examine memory mappings.
  4. Check system logs for out-of-memory incidents.
  5. Develop a strategy to address the memory leak.

Solution Example:

  1. free -m shows available memory decreasing by approximately 100MB per hour.
  2. A custom application process is steadily increasing in memory usage.
  3. pmap shows growth primarily in heap memory, suggesting a classic memory leak.
  4. The log file shows the OOM killer activated twice in the past week.
  5. Short-term solution: Implement a cron job to restart the application nightly. Long-term solution: Use memory profiling tools to identify and fix the leak in the application code.

Exercise 3: Optimizing Disk I/O Performance

Scenario: Database queries are running slowly, and system monitoring indicates high disk I/O wait times.

Tasks:

  1. Use iostat -x 1 to monitor disk activity and identify potential bottlenecks.
  2. Check the current I/O scheduler using cat /sys/block/sda/queue/scheduler.
  3. Analyze the types of disk operations with iotop.
  4. Review filesystem mount options with mount | grep sda.
  5. Implement and test performance improvements.

Solution Example:

  1. iostat shows high %util (95%) and await times (50ms) for the database disk.
  2. The current scheduler is "cfq" which might not be optimal for database workloads.
  3. iotop reveals many small, random read operations typical of database indexes.
  4. The filesystem is using default mount options with no performance optimizations.
  5. Implement these changes and measure the improvement:
# Change to deadline scheduler (better for databases)
echo deadline | sudo tee /sys/block/sda/queue/scheduler

# Increase read-ahead for potentially better sequential read performance
sudo blockdev --setra 1024 /dev/sda

# Remount filesystem with noatime to reduce unnecessary writes
sudo mount -o remount,noatime /dev/sda1 /database

Common Pitfalls and Troubleshooting Tips

Performance Analysis Pitfalls

  1. Misinterpreting Load Average: High load doesn't always indicate CPU bottlenecks; it could be I/O or other resource contention.
  2. Focusing on the Wrong Metric: For instance, total memory usage in Linux can be misleading because the kernel uses available memory for caching.
  3. Addressing Symptoms Rather Than Causes: For example, adding more memory when the real issue is a memory leak will only delay problems, not solve them.
  4. Making Multiple Changes Simultaneously: This makes it difficult to determine which change was effective.
  5. Ignoring Baseline Performance: Without knowing what "normal" looks like, it's challenging to identify abnormal behavior.

Troubleshooting Methodology

Follow a systematic approach to troubleshooting:

  1. Define the Problem Clearly: What exactly is slow or not working as expected?
  2. Gather Information: Use the tools discussed in this module to collect relevant data.
  3. Form a Hypothesis: Based on the data, what do you think is causing the issue?
  4. Test the Hypothesis: Make a single change and observe the results.
  5. Implement a Solution: Once confirmed, implement the fix properly.
  6. Document the Issue and Solution: This helps with future troubleshooting.

Proactive Monitoring Tips

  1. Set Up Ongoing Monitoring: Tools like Prometheus, Nagios, or Zabbix can continuously monitor system performance.
  2. Establish Alerting Thresholds: Receive notifications before problems become critical.
  3. Keep Historical Performance Data: This helps identify trends and predict future issues.
  4. Perform Regular Health Checks: Don't wait for problems; periodically review system performance.
  5. Schedule Maintenance Windows: Some issues require downtime to address properly.

Quick Reference Summary

Essential Commands

System Monitoring

Log Analysis

Performance Tuning

Scheduled Tasks

Key Files and Directories

Critical Performance Metrics

By mastering the tools and techniques covered in this module, you'll be well-equipped to monitor, analyze, and optimize Linux system performance in a variety of environments.