Originally published January 2, 2018 @ 10:18 am

When something is going down on a server, the first thing most sysadmins will run is the venerable top utility. This happens automatically: if you suspect the server is being sluggish, your fingers just type top without you even thinking about it. Unfortunately, top and many similar tools will only show you the current state of the system. So if a problem came and went before you even logged into the server, you’re out of luck.

It doesn’t help that most cenetralized system performance monitoring tools (OpenView, Solarwinds, Observium, Big Brother, etc), while collecting tons of historical performance data, do not monitor the systems on a per-process basis. And this can be very important when troubleshooting application issues. On the historical performance charts you can see that disk I/O was high and system load went through the roof, but the data about the misbehaving process is long gone.

The atop utility has one killer feature: ability to write everything it sees to a compressed log file. You can later replay this log file, skip to the time index of interest and see exactly what you would have seen, if you were sitting at the console window at that exact moment. Below is a script I wrote to make this logging process a little easier to schedule and run when you want and for as long as you want.

A few things to keep in mind:
  1. Never kill -9 an atop process. From within the utility use q to exit. From console, use kill -15 or pkill atop. The pkill by default uses -15.
  2. While atop creates a compressed log file, it can still get pretty big, so be mindful of available disk space. The rule of thumb is: every hour of atop logging will consume about 50MB of filesystem space at one-second sampling interval
  3. The script below requires the atd service to be installed and active. On RHEL/CentOS 5/6: yum -y install at ; /sbin/chkconfig atd on; /sbin/service atd restart. Some versions of CentOS/RHEL had a buggy atd, so, even if you have it installed, it never hurts to update: yum -y update at ; /sbin/service atd restart
  4. You should run the script as root, so make sure /etc/at.allow contains the root username and the /etc/at.deny doesn’t.

The syntax is fairly simple:

atoplog -t "7:30am tomorrow" -d 480 -i 15 -w /var/log/atop_log

This will start atop at 7:30 tomorrow morning and will keep it going for eight hours, every 15 seconds writing to /var/log/atop_log directory.

And here’s the script. Syntax and examples are included. You can download it here: atop_log. Uncompress and save it to, say, /var/adm/bin and create a convenient link: ln -s /var/adm/bin/atop_log.sh /usr/bin/atoplog

#!/bin/bash
#
#                                      |
#                                  ___/"\___
#                          __________/ o \__________
#                            (I) (G) \___/ (O) (R)
#                                   Igor Os
#                           igor@comradegeneral.com
#                             www.krazyworks.com
#                                 2016-08-03
# ----------------------------------------------------------------------------
# Record atop output in the background for future analysis
# ----------------------------------------------------------------------------

usage() {
cat << EOF
Syntax:
---------------------
atoplog -d <duration_minutes> [-t "<time when to run>" Default: in a minute] [-i <interval_seconds> Default: 5] [-w <target_directory> Default: /var/log/atop]

Example:
---------------------
atoplog -t "2:30pm today" -d 30 -i 2 -w /var/tmp/atop
EOF
exit 1
}

atop_check() {
	if [ ! -x /usr/bin/atop ]
	then
		echo "Can't find /usr/bin/atop. Exiting..."
		exit 1
	fi
	
	if [ ! -x /usr/bin/timeout ]
	then
		echo "Can't find /usr/bin/timeout. Exiting..."
		exit 1
	fi
	
	if [ $(ps -ef | egrep -c "[a]top\w[1-9].*log") -ne 0 ]
	then
		echo "Just FYI, there's another atop already running:"
		ps -ef | egrep "[a]top\w[1-9].*log"
	fi
}

while getopts ":d:t:i:w:" OPTION; do
	case "${OPTION}" in
		d)
			duration_minutes="${OPTARG}"
			;;
		t)
			when_to_run="${OPTARG}"
			;;
		i)
			interval_seconds="${OPTARG}"
			;;
		w)
			logdir="${OPTARG}"
			;;
		\? ) echo "Unknown option: -$OPTARG" >&2; usage;;
        :  ) echo "Missing option argument for -$OPTARG" >&2; usage;;
        *  ) echo "Unimplemented option: -$OPTARG" >&2; usage;;
	esac
done

configure() {
	if [ -z "${duration_minutes}" ] ; then usage ; fi
	if [ -z "${when_to_run}" ] ; then when_to_run="now" ; fi
	datetime="$(date -d "${when_to_run}" +'%Y-%m-%d_%H%M%S')"
	if [ -z "${interval_seconds}" ] ; then interval_seconds=5 ; fi
	if [ -z "${logdir}" ] ; then logdir="/var/log/atop" ; fi
	if [ ! -d "${logdir}" ] ; then mkdir -p "${logdir}" ; fi
	outfile="${logdir}/atop_${datetime}.log"
	if [ -f "${outfile}" ] ; then /bin/rm -f "${outfile}" ; fi
	(( duration_seconds = duration_minutes * 60 ))
	(( duration_samples = duration_seconds / interval_seconds ))
	
}

atop_do() {
	at ${when_to_run} <<<"atop ${interval_seconds} ${duration_samples} -w ${outfile}"
	echo "Running atop at $(atq 2>/dev/null | tail -1 | awk '{print $2,$3}') for ${duration_minutes} minutes at ${interval_seconds}-second intervals with output saved to ${outfile}"
}

atop_help() {
cat << EOF

  You can read this file like so: atop -r ${outfile}
 --------------------------------------------------------------------------------------------------
|                                                                                                  |
| You access this file at any time: no need to wait for recording to finish.                       |
|                                                                                                  |
| Here are some of the useful filtering options:                                                   |
|                                                                                                  |
|  t - Skip forward in time to next snapshot                                                       |
|  T - Skip back in time to previous snapshot                                                      |
|  P - Filter by process name regex                                                                |
|  U - Filter by username regex                                                                    |
|  b - [hh:mm] - jump to specified timestamp                                                       |
|  r - skip back to start of file with current filter applied                                      |
|                                                                                                  |
| For more help, press "?" in atop                                                                 |
|                                                                                                  |
 --------------------------------------------------------------------------------------------------
 
EOF
}

# RUNTIME
atop_check
configure
atop_do
atop_help