Originally published October 31, 2020 @ 8:19 pm
Cron is an indispensable tool for system administration. The difficulties in working with cron in a large environment stem from its decentralized nature. Cron jobs multiply like rabbits, and keeping track of them is not a trivial task.
I recently spent two weeks racking my brain trying to figure out an especially stubborn file sync problem, only to discover a cleanup cron job running on an unrelated server mounting the same NFS share that was messing with my process.
From logs, I suspected that something must have been deleting files sometime between 1 and 3 am. Unfortunately, the NFS share containing these files was a popular one – mounted on quite a few servers. Due to file ownership and permissions, it had to have been either a root or a WebLogic automated task – possibly a cron job. And so I set out to find it using Salt CLI and croncal
.
The croncal
is a handy Perl script written by a GitHub user, Waldner. Given a frame of time – either in the past or the future – croncal will generate a list of all cron jobs that will be (or were) executed for any given system user. You can find the installation details and basic usage examples here.
The idea was simple: write a script that will install croncal on all servers mounting the NFS share; then write a script that will use croncal to determine which cron jobs are executed between 1 and 3 am; finally, write a script that will check all those cron jobs.
The following command will list all root, and weblogic cron jobs that ran between 1 and 3 am today:
s=$(date +'%F 01:00'); e=$(date +'%F 03:00') croncal -s "${s}" -e "${e}" -f /var/spool/cron/{root,weblogic}
I was looking for either a script or something along the lines of find ... -delete
or find ... -exec ... rm
. The usual file cleanup commands. Here’s an example of running croncal and then extracting either name of scripts or what looks like a file cleanup task.
croncal -s "${s}" -e "${e}" -f /var/spool/cron/root 2>/dev/null| \ grep -oP '(?<=(\s|\|))(((/[^/ ]*)+/?\.((b|d)?a|(m|pd)?k|z)?sh)|(find.*(rm|delete).*$))(?=(\s|$))' | \ sort -u
I saw a few find...delete
jobs, but none of them looked like the culprit. So it could have been a script, and I needed to check all of them. Reformatting the previous command only to extract scripts from the croncal listing, we can then loop through each script looking for pretty much the same find...delete
stuff.
croncal -s "${s}" -e "${e}" -f /var/spool/cron/root 2>/dev/null | \ grep -oP '(?<=(\s|\|))(/[^/ ]*)+/?\.((b|d)?a|(m|pd)?k|z)?sh(?=(\s|$))' | \ sort -u | while read i; do if [ $(grep -Pc 'find.*(rm|delete)' "${i}") -gt 0 ]; then grep -P 'find.*(rm|delete)' "${i}" | awk -v i="${i}" '{print i"\t"$0}' fi; done
And this did it: some file cleanup script created by who-knows-when diligently doing its thing and driving sysadmins, application support, and management up the wall.
The understandable limitation of croncal would be cron jobs that run at arbitrary time intervals or introduce some random delay. An example is below. Luckily, stuff like that is relatively rare, and there’s only a slight chance such a cron job would run entirely outside of the time window you specified with croncal.
27 */4 * * * sleep $(expr $RANDOM \% 900); sudo su - root -c "/var/adm/bin/iptables_rules_update.sh >/dev/null 2>&1" >/dev/null 2>&1
Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.