Sometimes knowing when something didn’t happen can be just as valuable as knowing when it did. For reasons yet to be determined, my Salt Master server needs to be bounced occasionally. Not the whole server – just the salt-master service. I needed to find a good time to do this.

The restart itself is quick, but for a few minutes after the restart, the server goes into a hyperactive state while it tries to re-establish communication with thousands of minions. I felt it would be a good idea to stop my very busy cron daemon until salt-master settles down. The challenge was finding a good time for this.

Previously, I discussed the awesome Perl script called croncal that presents your crontab as a schedule. Within this schedule, I needed to find thirty-minute gaps – at least – and then identify the best time for the service restart. And so I wrote this awk one-liner:

# Define the timeframe for croncal (I use one week)
s=$(date -d 'last Monday' +'%Y-%m-%d 00:00')
e=$(date -d 'next Monday' +'%Y-%m-%d 00:00')

# Set the 30-minute threshold
threshold=1800

# Run croncal and parse the output using awk
croncal -s "${s}" -e "${e}" -f /var/spool/cron/root 2>/dev/null | \
awk -F'|' -v threshold=${threshold} 'BEGIN{prev=0;} {split($1, datetime, " "); split(datetime[1], date, "-"); split(datetime[2], time, ":"); current=mktime(date[1] " " date[2] " " date[3] " " time[1] " " time[2] " 00"); if (prev!=0 && current-prev > threshold) {print strftime("%Y-%m-%d %H:%M", prev) "\n--- " threshold " ---\n" strftime("%Y-%m-%d %H:%M", current);} prev=current;}'

You can apply the same idea to analyzing log files. For example, you may have a process that runs every hour and creates a log entry. But on occasion, the process fails to run, and no log entry is created. If you could identify those gaps, you may be able to figure out the cause.

Here’s a real-life example. One of my house’s Wemo “smart” power switches decides to drop off the network seemingly randomly. It eventually reconnected, but this was one of those small problems that I found far more annoying than the situation warranted. Looking at the DHCP server log, I noticed these entries:

Apr 23 06:32:44 ncc1711 dhcpd: DHCPREQUEST for ... from ... (Wemo Plug 1962) via eth1
Apr 23 12:32:56 ncc1711 dhcpd: DHCPREQUEST for ... from ... (Wemo Plug 1962) via eth1
Apr 23 18:33:09 ncc1711 dhcpd: DHCPREQUEST for ... from ... (Wemo Plug 1962) via eth1
Apr 24 06:33:36 ncc1711 dhcpd: DHCPREQUEST for ... from ... (Wemo Plug 1962) via eth1
Apr 24 12:33:49 ncc1711 dhcpd: DHCPREQUEST for ... from ... (Wemo Plug 1962) via eth1
Apr 24 18:34:01 ncc1711 dhcpd: DHCPREQUEST for ... from ... (Wemo Plug 1962) via eth1
Apr 25 00:34:14 ncc1711 dhcpd: DHCPREQUEST for ... from ... (Wemo Plug 1962) via eth1
Apr 25 06:34:26 ncc1711 dhcpd: DHCPREQUEST for ... from ... (Wemo Plug 1962) via eth1

Apparently, the switch renews its DHCP lease every six hours. Now I needed to identify gaps in these log entries when the switch failed to check in with DHCP on time:

# Set the threshold to 6 hours + 5 minutes:
threshold=21900

grep "DHCPREQUEST for.*Wemo Plug 1962" /var/log/messages | \
awk -v threshold=${threshold} 'BEGIN { prev = 0; FS="[ :]" } { current_year = strftime("%Y"); current = mktime(current_year " " (index("JanFebMarAprMayJunJulAugSepOctNovDec", $1) + 2) / 3 " " $2 " " $3 " " $4 " 0 0"); if (current - prev > threshold && NR > 1) { print prev_line "\nMore than "threshold" seconds\n" $0 } prev = current; prev_line = $0 }'

# Sample output
Apr 23 18:33:09 ncc1711 dhcpd: DHCPREQUEST for ... from ... (Wemo Plug 1962) via eth1
More than 21900 seconds
Apr 24 06:33:36 ncc1711 dhcpd: DHCPREQUEST for ... from ... (Wemo Plug 1962) via eth1

It so turned out that the switch would fail to renew its DHCP lease at midnight every Friday. The switch would then reconnect at 6:33 am on Saturday. It had something to do with its automatic update process that I disabled on all the other switches but somehow missed this one. There, the problem was solved. Did this make me happy? Absolutely.