Originally published January 27, 2019 @ 5:01 pm

For some reason I haven’t used zip much on Linux, sticking to the standard tar/gzip combo. But zip seems to be a viable alternative. While not as space-efficient, it is definitely faster; syntax is simple; and, if you need to share the archive with a Windows user, he doesn’t have to Google what the heck is “tar.gz”.

Having said that, this post is not so much about zip and how it compares to more commonly-used Linux CLI archiving tools. This post is more about cleverish use of Bash arrays holding PIDs and parallel, so that’s really what you want to pay attention to in the script below. (And you can also get it here.)

#!/bin/bash
echo "$(grep -c ^proc /proc/cpuinfo) x$(grep -m1 ^model.name /proc/cpuinfo | awk -F: '{print $2}')"

echo "Making base folder for our test and create a temporary file"
d=/var/tmp/test
mkdir -p $d
cd $d
f=$(mktemp)

echo "Downloading a large text file into the temporary file"
curl -k -s0 -q https://norvig.com/big.txt > $f

echo "Creating a folder structure populated with files, each containing 128KB of random text"
for i in $(seq -w 01 10)
do
  mkdir -p dir_${i}
  echo "Populating dir_${i}"
  for j in $(seq -w 001 100)
  do
    { head -c 128KB <(shuf -n 10000 $f) > ./dir_${i}/file_${j} & } 2>/dev/null 1>&2
    pids+=($!)
  done
done

for pid in ${pids[*]}
do
  wait ${pid} 2>/dev/null 1>&2
done

echo -n "Determine the number of parallel threads based on the available cores: "
p=$(grep -c proc /proc/cpuinfo)
echo $p

echo ""
echo "Running a test with zip"
echo "Before: $(du -s . | awk '{print $1}')"

find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'zip -r -q {}{.zip,} && /bin/rm -r {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'

echo "After: $(du -s . | awk '{print $1}')"
ls *zip | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'unzip -q {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'

echo ""
echo "Running a test with tar/gzip"
echo "Before: $(du -s . | awk '{print $1}')"

find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'GZIP=-9 tar cfz {}{.tgz,} && /bin/rm -r {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'

echo "After: $(du -s . | awk '{print $1}')"
ls *tgz | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'tar xfz {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'

echo ""
echo "Running a test with tar/bzip2"
echo "Before: $(du -s . | awk '{print $1}')"
find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'BZIP=-9 tar cfj {}{.tbz,} && /bin/rm -r {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'

echo "After: $(du -s . | awk '{print $1}')"
ls *tbz | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'tar xfj {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'

echo ""
echo "Running a test with tar/pigz"
echo "Before: $(du -s . | awk '{print $1}')"
find . -maxdepth 1 -mindepth 1 -type d -print0 | \
{ time parallel --will-cite --gnu --null -j $p 'tar cf - {} | pigz -9 -p $p > {}.tar.gz }' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to compress: "$2}'

echo "After: $(du -s . | awk '{print $1}')"
ls *tar.gz | \
{ time parallel --will-cite --gnu -j $(grep -c proc /proc/cpuinfo) 'tar xfz {} && /bin/rm {}' >/dev/null; } 2>&1 | \
grep real | awk '{print "Time to uncompress: "$2}'

echo ""
echo "Removing test folder"
/bin/rm -r $d