I have discussed atop previously but concentrated primarily on how to run it and how to collect data. Now I’d like to spend some time talking about ways to analyze the data collected with atop.

Included with the atop package is atopsar – a utility design to extract data from the atop logs and massage it in various ways. Let’s take a look at a few simple examples. And a few not so simple ones as well.

Processes

Identify top-five most frequently executed processes during specified time frame

Count the number of times a particular process has been detected during specified time frame

Generate a chart of the number of instances of a particular process over time

Similar to the previous example, but output chart to PNG

Generate an overall process count chart for the entire available time frame

Identify top-ten most frequently executed binaries from /sbin or /usr/sbin during specified time frame

Find out when the process creation rate, or the total number of processes, or the number of processes in uninterruptable sleep state exceeded certain limits during specified time frame

Disks and Volumes

Identify disks with over 90% activity during specified time frame

Identify processes responsible for most disk I/O during specified time frame

Identify periods of heavy swap activity during specified time frame

Make a chart of pagescan, swapin, and swapout

Identify logical volumes with high activity or high average queue during specified time frame

Processors

Identify processes consuming more than half of all available CPUs during specified time frame

Memory

Identify time of peak memory utilization during the specified time frame

Comparing apples and oysters

When it comes to system performance analysis, being able to see the complete picture is crucial. Just looking at memory, or disk I/O, or CPU activity separately may only tell you that something happened. But you already knew this.

Here is a basic example of joining output of two separate instances of atoprar based on the timestamp. In this example we can view the time, the number of running processes, the processes generating most I/O, and the percentage of I/O they’re responsible for:

The join command allows you to merge two files (or outputs of two commands) based on a common matching field – a timestamp in our case.

If you need to merge more than two files, things can get a bit tricky. Essentially, you merge the first two and the output becomes file #1. Then you pipe that output plus your new file #2 into the second merge. And so on.

Here’s an example of five nested joins providing side-by-side view of the timestamp, the number of processes, application with heavy I/O, the amount of I/O by this application, memory/disk swapping in and out, activity on the system disk, amount of free memory, and average system load.

Sometimes you see a drastic change in system behavior and wonder if some new process starting up on the server could’ve been the reason. However, with thousands of running processes, figuring out what started and what exited can be difficult.

Here’s a small script that will go through your atop log frame by frame; get a list of running process names; and compare that list to the previous frame. It will then show you what appeared and what disappeared. There are a few variables in the script. The obvious ones have to do with the starting time, the frame interval (should probably match the interval used to record the log), and the number of frames to analyze.

There is also the ${cutoff} variable. When looking at the running process, sometimes you may see a great many instances of the same command (say, a couple thousand instances of awk, sed, and grep that come from some hyperactive script). You probably want to filter those out and concentrate on more important things. The script will only look at processes that have no more than ${cutoff} running instances.

And here’s sample output: