Originally published November 26, 2020 @ 1:25 pm

Over the years I’ve been using Salt CLI extensively for day-to-day system administration tasks and, in my opinion, there’s nothing quite like it. Faster, more capable, and easier to learn than the rest of the CAPS.

I’m getting old, that’s the thing! What’s in me now won’t be there anymore.Leo Tolstoy, 'War and Peace'

My interest in SaltStack generally is confined to its remote execution capabilities. As a sysadmin with HPC cluster background, I know the challenges of managing a large number of dissimilar servers. Using old tools like pdsh is an option, but even the simplest of tasks – such as running a basic command on all servers – immediately presents the first challenge: getting the list of servers up to date.

Salt’s advanced targeting, remote command, and script execution capabilities are very attractive, even if you have no interest in using Salt for server deployment and centralized configuration management. Much of what you will find on this site will be limited to remote execution and monitoring tasks, and I will not get much into the Puppet-vs-Salt-vs-Ansible-vs-Chef debate.

If you support a lot of servers, the trivial effort it takes to set up a Salt master server and to deploy the agents is well justified. Similar to Ansible, Salt can work without the agents, via SSH, but why deny yourself the convenience of near-instantaneous response from thousands of managed systems? And Salt agents work on Windows as well, if you’re into that sort of thing.

Unlike agentless server provisioning tools, with Salt, you don’t need to maintain lists and complex group hierarchies of managed nodes. You can target the systems you need on the fly using a myriad of different options you get with Salt. None of that Ansible slowness while “gathering facts”. Of course, maybe if you get paid by the hour…

In an attempt to be concise, I will go straight to troubleshooting the four most common problems I’ve encountered running Salt in a large environment (and none of these problems are really with Salt itself).

  1. Salt agents maintain a runtime cache in /var/cache/salt. The /var filesystem tends to run out of space because that’s where /var/log is. When this happens, the salt-minion stops working. I wish it had an option of designating a secondary cache location.For now, the solution is to use salt-ssh to access the problem nodes and clean up /var. I mean, you would have to do this anyway. And, by the way, pretty much all of the salt .* cmd.run ... commands you will see below also work as salt-ssh .* cmd.run ..., as long as you have your passwordless SSH configured the same way you would with, say, Ansible.
  2. From time to time virtualization clusters lose access to storage. Why? It’s complicated and right now not really important. What is important, however, is when this happens, local filesystems tend to become read-only. This may include /var. I am sure that now you can guess how this affects the Salt agent.
  3. Some people have a bad habit of cloning VMs instead of using whatever VM deployment process that they should’ve been using. When they clone VMs they invariably forget to update /etc/salt/minion_id file that usually contains the node’s FQDN. This, understandably, causes some confusion.Once again, you can use salt-ssh to automate a quick process that will periodically scan your environment to identify and fix this issue: just delete the minion_id file and bounce the salt-minion service. And then you can deal with the real culprits personally.
  4. Finally, and I should’ve mentioned this first, make sure your firewall rules are in place to work with Salt. I can’t tell you how many times this happened: new VLAN is created, standard firewall rules not implemented. The minions talk to the master via an encrypted ZMQ connection on ports 4505 and 4506, so these need to be opened from minions to the master (or to the Salt proxies if you use those). Additionally, port 22 should be opened from the master to the minions if you plan on using salt-ssh (and you should have that option).

I have to admit that, when it comes to scripting, I have a soft spot for the convoluted. Here’s a characteristic example:

Imagine you have four environments – Dev, QA, UAT, and Prod – and you need to compare the CPU utilization of all Tomcat servers by the environment. Let’s also imagine that your host-naming convention looks something like this: devl-tomcat-01 or qal-tomcat-01, where “l” after the environment abbreviation stands for “Linux”. This is just to help you understand the mess below.

j=tomcat ; for i in prodl devl uatl qal ; do echo "" ; echo "CPU utilization summary for ${i}*${j}* nodes" ;  salt --timeout=30 --output=txt "${i}*${j}*" cmd.run "top -b -n 1" | egrep ': Cpu' | printf "%s %s %s %s %s %s %s %s\n" `grep -oE '[0-9]{1,}\.[0-9]{1,2}'` | awk '{ c++ ; sum_us += $1 ; sum_sy += $2 ; sum_ni += $3 ; sum_id += $4 ; sum_wa += $5 ; sum_hi += $6 ; sum_si += $7 ; sum_st += $8 ; a_us[c] = $1 ; a_sy[c] = $2 ; a_ni[c] = $3 ; a_id[c] = $4 ; a_wa[c] = $5 ; a_hi[c] = $6 ; a_si[c] = $7 ; a_st[c] = $8 } END { c=asort(a_us) ; d=asort(a_sy) ; e=asort(a_ni) ; f=asort(a_id) ; g=asort(a_wa) ; h=asort(a_hi) ; k=asort(a_si) ; l=asort(a_st) ; printf "avg_us\tavg_sy\tavg_ni\tavg_id\tavg_wa\tavg_hi\tavg_si\tavg_st\tmax_us\tmax_sy\tmax_ni\tmax_id\tmax_wa\tmax_hi\tmax_si\tmax_st\n" ; printf ( "%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\n", sum_us/c, sum_sy/c, sum_ni/c, sum_id/c, sum_wa/c, sum_hi/c, sum_si/c, sum_st/c, a_us[c], a_sy[d], a_ni[e], a_id[f], a_wa[g], a_hi[h], a_si[k], a_st[l] ) }' | column -t ; echo "" ; done

CPU utilization summary for prodl*tomcat* nodes
avg_us  avg_sy  avg_ni  avg_id  avg_wa  avg_hi  avg_si  avg_st  max_us  max_sy  max_ni  max_id  max_wa  max_hi  max_si  max_st
2%      0%      0%      97%     0%      0%      0%      0%      33%     1%      1%      99%     2%      0%      0%      0%
 
CPU utilization summary for devl*tomcat* nodes
avg_us  avg_sy  avg_ni  avg_id  avg_wa  avg_hi  avg_si  avg_st  max_us  max_sy  max_ni  max_id  max_wa  max_hi  max_si  max_st
3%      0%      1%      96%     0%      0%      0%      0%      18%     1%      2%      100%    1%      0%      0%      0%
 
CPU utilization summary for uatl*tomcat* nodes
avg_us  avg_sy  avg_ni  avg_id  avg_wa  avg_hi  avg_si  avg_st  max_us  max_sy  max_ni  max_id  max_wa  max_hi  max_si  max_st
1%      0%      1%      97%     0%      0%      0%      0%      3%      1%      2%      100%    1%      0%      0%      0%
 
CPU utilization summary for qal*tomcat* nodes
avg_us  avg_sy  avg_ni  avg_id  avg_wa  avg_hi  avg_si  avg_st  max_us  max_sy  max_ni  max_id  max_wa  max_hi  max_si  max_st
2%      0%      1%      96%     0%      0%      0%      0%      14%     0%      1%      99%     1%      0%      0%      0%

The simple truth of what I do with Salt becomes evident. From everything above, this is the only Salt command:

salt --timeout=30 --output=txt "${i}*${j}*" cmd.run "top -b -n 1"

That’s it. The rest is just shell scripting. Having said that, accomplishing the same task with pdsh would have been quite a bit more complicated.

So why don’t I use Salt to the fullest of its abilities? To make a short story long, back in the earlier days of HPC clusters I’ve been using Scali, which became Platform Manager in 2007, and after five more years was acquired by IBM. Scali was an interested but unstable and poorly-documented tool. It had excellent core functionality, but this advantage has been entirely undermined by many buggy features of dubious value.

I am certain I’ve spent more time fixing issues with Scali itself than it would have taken me to deploy and manage my HPC clusters using Clonezilla and pdsh. And I would have done exactly that, has it not been for my management’s determination to continue using Scali/PM/whatever, since they already spend the money on licensing.

I can deploy servers faster with scripts and FTP than most DevOps folks can with Puppet and Ansible. I can provide much more responsive and flexible configuration management using pdsh and flat files than most automation guys can with SaltStack or Chef. So, if I can do all this right now knowing what I already know, why bother with anything else, unless there’s some clear advantage?

There are many DevOps engineers out there who have never heard of Scali or even pdsh. They truly believe they’re onto something new here with their Ansible, Jenkins, OpenShift, and endless layers of virtualization.  As a Russian saying goes, everything new is just well-forgotten old. Having said that, I do recognize a superior tool when I see one and so here we go.

Table of Contents

Basic Operations

Run a command on QA nodes

salt --output=txt --timeout=30 "qa*" cmd.run "service ntpd status 2>/dev/null" 2>/dev/null

Identify QA nodes with local filesystem utilization above 95%

salt --output=txt --timeout=30 "qa*" cmd.run "df -hPl | egrep -E '9[5-9]%|100%'" 2>/dev/null | column -t

Identify QA nodes that haven’t been rebooted in the past week

salt --output=txt --timeout=30 "qa*" cmd.run "uptime" | grep -E " ([1-9]{2,}|[8-9]) days" | awk -F, '{print $1}' | column -t

Identify QA nodes with security advisories (RHEL)

salt --output=txt --timeout=30 "qa*" cmd.run "yum updateinfo summary 2>/dev/null" | grep Security | column -t

Get RHEL version of QA nodes

salt --output=txt --timeout=30 "qa*" cmd.run "grep -oE '[0-9]{1,}\.[0-9]{1,}' /etc/redhat-release 2>/dev/null" | column -t

Running commands as another user on QA Tomcat servers

salt --timeout=30 --output=txt "qa-tomcat*" cmd.run "su - tomcat bash -c 'whoami'"

Get a list of physical servers and their hardware models, sorted by generation (HPE)

salt --timeout=30 --output=txt -G "virtual:physical" cmd.run "dmidecode 2>/dev/null | grep -m1 'Product Name:'"  2>/dev/null | awk -F: '{print $1": "$NF}' | sed 's/Gen/G/g' | sort -k4 | column -t

Get a list of unique subnets used by Salt minions

salt --output=txt "qa*" cmd.run "ifconfig 2>/dev/null" 2>/dev/null | grep -oE "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" | grep -E "(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" | sort -u

Advanced Operations

Check for Puppet errors on QA nodes

salt --timeout=30 --output=txt "qa*" cmd.run "grep -c 'Puppet::Error' /var/log/messages" | grep [1-9]$ | column -t

Get the total number of CPU cores across all 32-bit nodes

salt --timeout=30 --output=txt -G "osarch:i686" cmd.run "cat /proc/cpuinfo" | cut -d ' ' -f2- | grep -c ^processor | awk '{ SUM += $1} END { print ( SUM )" cores" }'

Compound matching, count VMs:

salt --output=txt -C "* and not G@virtual:physical" test.ping 2>/dev/null | wc -l

Get the total LVM size across all DEV WebLogic nodes

salt --timeout=30 --output=txt "dev*logic*" cmd.run "vgs --units=k 2>/dev/null" | cut -d ' ' -f3- | awk '{print $6}' | grep -oE "[0-9]{1,100}\.[0-9]{2}" | awk '{ SUM += $1} END { print ( SUM/1024/1024 )" GB" }'

Show total memory allocated to all DEV Tomcat nodes

salt --timeout=30 --output=txt "dev*tomcat*" cmd.run "free -k | grep ^Mem:" 2>/dev/null | cut -d ' ' -f3- | awk '{ SUM += $1} END { print ( SUM/1024/1024 )" GB" }'

See which DEV nodes are favored by a particular user

salt --timeout=30 --output=txt "dev*" cmd.run "last jdoe | grep -c ^jdoe" 2>/dev/null | sort -k2 -rn | head -10 | column -t

Find RHEL 6 nodes with the highest swap utilization

salt --timeout=30 --output=txt -G "osfinger:Red*6" cmd.run "free -k | grep ^Swap:" | awk '{print $1"\t"$4}' | grep -vE "0{1}$" 2>/dev/null | sort -k2 -rn | column -t

Salt understands POSIX regex expressions

salt --timeout=30 --output=txt -E 'qa-web.*(0?[0-9]{1,2}).domain.*' cmd.run "uptime" 2>/dev/null

Salt can read a list of nodes from a file

salt --output=txt --timeout=30 -L "`cat /var/tmp/one-per-line-fqdns`" cmd.run "grep -m1 release /etc/issue 2>/dev/null" 2>/dev/null

Similar to above, but the hostnames are not FQDN

salt --output=txt --timeout=30 -L "`sed -e 's/$/\.domain\.local/' /tmp/one-per-line-short-hostnames`" cmd.run "grep release /etc/issue 2>/dev/null" 2>/dev/null

Salt can read a list of nodes from CLI

salt --output=txt --timeout=30 -L qa-tomcat-01.domain.local,dev-tomcat-01.domain.local cmd.run "grep release /etc/issue 2>/dev/null" 2>/dev/null

Salt can use Boolean operators

salt --output=txt -C "[pdqu]l-tomcat* or [pdqu]l-weblogic*" cmd.run "logrotate -f /etc/logrotate.conf 2>/dev/null" 2>/dev/null

Targeting minions using “salt grains”

salt --timeout=30 --output=txt -G 'virtual:physical' cmd.run "uname -a"
salt --timeout=30 --output=txt -G 'manufacturer:HP' cmd.run "uname -a"
salt --timeout=30 --output=txt -G 'cpuarch:x86_64' cmd.run "uname -a"
salt --timeout=30 --output=txt -G 'os:RedHat' cmd.run "uname -a"

Target minions by subnet

salt -S 10.92.136.0/24 cmd.run "uptime"

Target minions by subnet and Salt grains

salt -C 'S@10.92.136.0/24 and G@os:RedHat' cmd.run "uptime"

Identify PROD servers with 15-min load average in double-digits:

salt --timeout=30 --output=txt "prod*" cmd.run "uptime | egrep -E '[0-9]{2}\.[0-9]{2}$'" 2>/dev/null

Get a list of NFS mounts on QA nodes

salt --timeout=30 --output=txt "qa*" cmd.run "grep ' nfs ' /etc/mtab 2>/dev/null" 2>/dev/null | awk -F',' '{print $1}' | awk '{print $1" "$2" "$3" "$5}' | column -t

Install Wireshark on minions that don’t already have it

salt --output=txt "qltc*" cmd.run "which tshark 2>/dev/null 1>&2 || yum -y install wireshark 2>/dev/null 1>&2" 2>/dev/null

Find “/opt” filesystem on “prod-db*” servers with the utilization of 80-100%

salt --output=txt "prod-db*" cmd.run "df -hlP /opt 2>/dev/null" 2>/dev/null | egrep "(([8-9][0-9])|100)\%" | awk '{print $1,$2,$5,$6}' | sort -u | column -t

Get allocated LVM size of minions in the “/root/list” containing short hostnames

salt --timeout=30 --output=txt -L "`sed -e 's/$/\.domain\.local/' /root/list`" cmd.run "vgs --units=k 2>/dev/null" 2>/dev/null | cut -d ' ' -f3- | awk '{print $6}' | grep -oE "[0-9]{1,100}\.[0-9]{2}" | awk '{ SUM += $1} END { print ( SUM/1024/1024 )" GB" }'

Get total memory size of minions in the “/root/list” containing short hostnames

salt --timeout=30 --output=txt -L "`sed -e 's/$/\.domain\.local/' /root/list`" cmd.run "free -k 2>/dev/null | grep ^Mem:" 2>/dev/null | cut -d ' ' -f3- | awk '{ SUM += $1} END { print ( SUM/1024/1024 )" GB" }'

Get the total number of CPU cores of minions in the “/root/list” containing short hostnames

salt --timeout=30 --output=txt -L "`sed -e 's/$/\.domain\.local/' /root/list`" cmd.run "cat /proc/cpuinfo" 2>/dev/null | cut -d ' ' -f2- | grep -c ^processor | awk '{ SUM += $1} END { print ( SUM )" cores" }'

Get a list of mounted NFS shares for minions in the “/root/list” containing short hostnames

salt --timeout=30 --output=txt -L "`sed -e 's/$/\.domain\.local/' /root/list`" cmd.run "cat /etc/mtab" 2>/dev/null | grep " nfs " | awk -F',' '{print $1}' | awk '{print $1" "$2" "$3" "$5}' | sed -e 's/:\s/\t/g' | column -t | sort -u

Display HP ILO IP address on all physical hosts using ipmitool

salt --output=txt -G "virtual:physical" cmd.run "ipmitool lan print 2>/dev/null | grep -E '^IP Address\s\s' | awk '{print $NF}'" 2>/dev/null

Show top ten UAT nodes by CPU utilization for the “java” process running under “weblogic” username

salt --output=txt "qa*" cmd.run "top -b -n 1 -d 1 -u weblogic 2>/dev/null | grep [j]ava" 2>/dev/null | column -t | sort -k10 -rn | head -10

Similar to the previous example, but sorted by memory utilization

salt --output=txt "qa*" cmd.run "top -b -n 1 -d 1 -u weblogic 2>/dev/null | grep [j]ava" 2>/dev/null | column -t | sort -k11 -rh | head -10

Copying Files and Folders

Copy a single file from the Salt master to minions

The source location must be inside Salt server home (i.e. /srv/salt/). Also, the target folder must already exist on the minions (in this example, /var/tmp). Also, keep in mind that for security reasons, any file copied from the Salt master to a minion will have its permissions changed to 644. You will need to run another Salt command to set the desired permissions on the file.

salt "prod-weblogic*" cp.get_file salt://scripts/app_migrate.sh /var/tmp/app_migration.sh

Copy a directory with contents from master to minions

This example will copy the contents of /srv/salt/myfiles/ to the minion:/tmp/myfiles

salt 'prod-weblogic*' cp.get_dir salt://myfiles /tmp

Copy contents of a directory from minions to the master

The data you copy will end up on the Salt master in /var/cache/salt/master/minions/<minion_id>/files/folder/on/minions(for the example below), so be mindful of the available disk space.

salt "*" cp.push_dir /folder/on/minions/

Running Scripts via Salt

The scripts are usually located on the Salt master server in /srv/salt(on RHEL). They don’t need to be executable.

Basic Syntax

Run yum_health_check.shscript on all of the QA servers:

salt --timeout=30 --output=txt "qa*" cmd.script "salt://scripts/yum_health_check.sh"

Cleaning up output

Text output of cmd.scriptis very busy and not very readable. I found it useful to create the following /usr/bin/cleanhelper script that will sanitize the output of cmd.script:

#!/bin/bash
sed -r "s/: \{u'/^/g" | sed -r "s/u'stdout': u'/^/g" | sed -r "s/'}$/^/g" | \
sed -r "s/\\t/ /g" | sed -r "s/\\n/{/g" | grep -v ', ^^' | \
awk -F'^' '{print $1" "$3}' | while read line; do
    if [ $(echo "${line}" | grep -c '{') -gt 0 ]; then
        h="$(echo ${line} | awk '{print $1}')"
        echo "${line}" | \
        awk '{ s = ""; for (i = 2; i <= NF; i++) s = s $i " "; print s }' | \ sed 's/{/\n/g' | while read line2; do echo "${h} ${line2}"; done else echo "${line}" fi done | sort -k1V | column -t 2>/dev/null

And here’s how to use it with a script:

salt "qa*" cmd.script "salt://scripts/yum_health_check.sh" | clean

Passing arguments to a script

In this example script app_find.shwill try to locate an application called “webstore01” on various WebLogic servers:

salt --timeout=30 --output=txt "[pdqu]l*weblogic*" cmd.script "salt://scripts/app_find.sh" args="webstore01" | clean

Salt-Cloud Operations

This is a brief listing of the more common salt-cloudcommands with some VMWare-specific examples. You can see all available salt-cloudcommands here.

Working with VMWare

The primary configuration file that allows salt-cloudto interact with VMWare is usually located in /etc/salt/cloud.providers.d/vmware.confand looks something like this:

vcenter01:
driver: vmware
user: 'DOMAIN\vCenterSvcAccount'
password: 'P@ssw0rd1'
url: 'vcenter01.domain.local'
protocol: 'https'
port: 443
esxi_host_user: 'root'
esxi_host_password: 'P@ssw0rd2'

vcenter02:
driver: vmware
user: 'DOMAIN\vCenterSvcAccount'
password: 'P@ssw0rd1'
url: 'vcenter02.domain.local'
protocol: 'https'
port: 443
esxi_host_user: 'root'
esxi_host_password: 'P@ssw0rd2'

There is one important detail to keep in mind: with VMWare salt-cloudhas two modes of operation – vmwareand vsphere. The two are similar for the most part. However, in the vspheremode you need to specify the name of the vCenter. In the vmwaremode, the desired operation will be performed on all configured vCenters.

Let’s say you want to create a snapshot of a VM called prod-tomcat-test-01. The following command will do this for you:

salt-cloud -a create_snapshot prod-tomcat-test-01 snapshot_name="prod-tomcat-test-01_$(date +'%F')" description="Test snapshot made by $(whoami)@$(hostname) on $(date +'%F')"

Notice how I did not specify the name of the vCenter. Salt already knows which vCenter has this VM. However, if more than one vCenter has a VM with that name, snapshots will be created for all VMs matching the name.

List configured cloud providers

salt-cloud --out=json --list-providers

List clusters in a vCenter

salt-cloud --out=compact -f list_clusters 

List VM build profiles

salt-cloud --out=nested --list-profiles 

List VMs in a vCenter

salt-cloud --out=nested --query 

List all VM snapshots in a vCenter

salt-cloud --out=json -f list_snapshots 

Use jqto parse output

Similar to the previous example, but we extract only the names of VMs that have snapshots.

salt-cloud --out=json -f list_snapshots  | jq -r '.[]|.[]|keys' | grep -oP "(?<=\").*(?=\",)"

List snapshots for a particular VM

salt-cloud --out=json -f list_snapshots  name=""

Create a snapshot

 Note:  by default, salt-cloudwill prompt you for confirmation before modifying a VMs configuration state, a snapshot, an ESX host, a cluster, or a vCenter. You can use – carefully – the --assume-yesor -yoption to assume an affirmative response and proceed with the operation.

salt-cloud --out=json -y -a create_snapshot  snapshot_name="before_patching_$(date +'%F')" description="Test snapshot made by $(whoami)@$(hostname) on $(date +'%F')"

Revert to a snapshot

salt-cloud --out=json -y -a revert_to_snapshot  snapshot_name=""

Merge all snapshots for a VM

salt-cloud --out=json -y -a remove_all_snapshots  merge_snapshots=True