Originally published January 2, 2018 @ 10:18 am

When something is going down on a server, the first thing most sysadmins will run is the venerable top utility. This happens automatically: if you suspect the server is being sluggish, your fingers just type top without you even thinking about it. Unfortunately, top and many similar tools will only show you the current state of the system. So if a problem came and went before you even logged into the server, you’re out of luck.

It doesn’t help that most cenetralized system performance monitoring tools (OpenView, Solarwinds, Observium, Big Brother, etc), while collecting tons of historical performance data, do not monitor the systems on a per-process basis. And this can be very important when troubleshooting application issues. On the historical performance charts you can see that disk I/O was high and system load went through the roof, but the data about the misbehaving process is long gone.

The atop utility has one killer feature: ability to write everything it sees to a compressed log file. You can later replay this log file, skip to the time index of interest and see exactly what you would have seen, if you were sitting at the console window at that exact moment. Below is a script I wrote to make this logging process a little easier to schedule and run when you want and for as long as you want.

A few things to keep in mind:
  1. Never kill -9 an atop process. From within the utility use q to exit. From console, use kill -15 or pkill atop. The pkill by default uses -15.
  2. While atop creates a compressed log file, it can still get pretty big, so be mindful of available disk space. The rule of thumb is: every hour of atop logging will consume about 50MB of filesystem space at one-second sampling interval
  3. The script below requires the atd service to be installed and active. On RHEL/CentOS 5/6: yum -y install at ; /sbin/chkconfig atd on; /sbin/service atd restart. Some versions of CentOS/RHEL had a buggy atd, so, even if you have it installed, it never hurts to update: yum -y update at ; /sbin/service atd restart
  4. You should run the script as root, so make sure /etc/at.allow contains the root username and the /etc/at.deny doesn’t.

The syntax is fairly simple:

This will start atop at 7:30 tomorrow morning and will keep it going for eight hours, every 15 seconds writing to /var/log/atop_log directory.

And here’s the script. Syntax and examples are included. You can download it here: atop_log. Uncompress and save it to, say, /var/adm/bin and create a convenient link: ln -s /var/adm/bin/atop_log.sh /usr/bin/atoplog