Home SysAdmin Commands & Shells Parallel Rsync

Parallel Rsync

November 1, 2023

469

Originally published June 10, 2017 @ 11:12 am

This is an update of the script I originally wrote five years ago and used to migrate many terabytes of production data between two NAS systems. What’s new: more efficient subfolder crawling, more effective way to launch rsync threads, ability to specify options from the command line.

Here’s the problem with rsync: it’s a single-threaded process that needs to crawl the source and the destination directories in their entirety, build lists of folders and files, compare them, and then start transferring the discovered items: one by one.

This is not an issue when files are few and large. However, when the files are small and many and spread throughout a deep directory structure – this is when rsync grinds to a virtual halt. You may have a 10-gig network and rsync seems to be busy moving files, but your network utilization is a tiny fraction of the available bandwidth. The reason for that lies in rsync‘s single-threaded nature.

The workaround I am suggesting is to, basically, launch a separate rsync for every subfolder down to a certain level. And then another rsync to pick up whatever files were left above that level. There is a flow control in the script that will see how many cores your system has and will keep the number of rsyncs running at any given time down to a reasonable number, so as not to kill your machine.

The script is below and you can also download it here. Save it and create a convenient link in /usr/bin/rsync-parallel. Here’s how to use it:

Syntax:
rsync-parallel -o <rsync options; default: -aKPHAXx> -d <branch-out depth> -s <source_dir> -t <target_dir>

Example:
rsync-parallel -o "avKx --timeout=5" -d 2 -s /mnt/source -t /mnt/target

One thing to remember: the --delete option or any of its variations will not work with this script. The purpose of this script was to do initial synchronization. However, you can use the following rsync syntax to only delete items from the destination if they were removed from the source. This will only delete – it will not copy anything new:

rsync -avKx --delete --ignore-non-existing --ignore-existing <source> <target>

As a test, I created a dummy folder structure (1111 folders) with some files (110200 files) using this script:

# created 110200 files in 11111 folders totalling 1.8GB
for i in `seq 1 10`; do
echo "Top level $i"
for j in `seq 1 10`; do
for k in `seq 1 10`; do
for l in `seq 1 10`; do
mkdir -p /archive/source/dir_${i}/dir_${j}/dir_${k}/dir_${l}
for n in `seq 1 10`; do
dd if=/dev/zero of=/archive/source/dir_${i}/dir_${j}/dir_${k}/dir_${l}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done
for n in `seq 1 10`; do
dd if=/dev/zero of=/archive/source/dir_${i}/dir_${j}/dir_${k}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done
for n in `seq 1 10`; do
dd if=/dev/zero of=/archive/source/dir_${i}/dir_${l}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done
for n in `seq 1 10`; do
dd if=/dev/zero of=/archive/source/dir_${i}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done

First, using just rsync :

time rsync -aKx /archive/source/ /archive/target/

real    0m14.998s

Making sure everything is there:

# for i in source target; do for j in f d; do echo -e "${i}\t${j}:\t`find $i -type $j |wc -l`"; done; done; /bin/rm -r ./target/*
source  f:      110200
source  d:      11111
target  f:      110200
target  d:      11111

Now I remove everything from the target and repeat the process this time using the script:

time rsync-parallel -d 2 -s /archive/source -t /archive/target && time while [ `ps -ef | grep -c [r]sync` -ne 0 ]; do sleep 1; done
Level max: 4
4 /archive/source/dir_9
4 /archive/source/dir_8
4 /archive/source/dir_7
4 /archive/source/dir_6
4 /archive/source/dir_5
4 /archive/source/dir_4
4 /archive/source/dir_3
4 /archive/source/dir_2
4 /archive/source/dir_10
4 /archive/source/dir_1

real    0m0.324s

real    0m1.265s

Again, to make sure everything is there:

for i in source target; do for j in f d; do echo -e "${i}\t${j}:\t`find $i -type $j |wc -l`"; done; done; /bin/rm -r ./target/*
source  f:      110200
source  d:      11111
target  f:      110200
target  d:      11111

So the script was about ten times faster than just rsync by itself. Keep in mind that in this example both source and target were local filesystems. If even one of them was NFS-mounted, the time advantage of the script would have been even greater.

#!/bin/bash
#                                      |
#                                  ___/"\___
#                          __________/ o \__________
#                            (I) (G) \___/ (O) (R)
#                                Igor Oseledko
#                           igor@comradegeneral.com
#                                 2017-06-10
# ----------------------------------------------------------------------------
# A script to use rsync to copy complex directory structures, starting several
# levels below the parent source directory and running multiple rsync threads
# at the same time to utilize the available bandwidth.
# ----------------------------------------------------------------------------
IFS=$(echo -en "\n\b")
usage() {
cat << EOF
Syntax:
---------------------
rsync-parallel -o <rsync options; default: -aKPHAXx> -d <branch-out depth> -s <source_dir> -t <target_dir>
Example:
---------------------
rsync-parallel -d 3 -s /mnt/source -t /mnt/target
EOF
exit 1
}
while getopts ":s:t:d:o:" OPTION; do
case "${OPTION}" in
s)
source_dir="${OPTARG}"
;;
t)
target_dir="${OPTARG}"
;;
d)
max_depth="${OPTARG}"
;;
o)
rsync_options="${OPTARG}"
;;
\? ) echo "Unknown option: -$OPTARG" >&2; usage;;
:  ) echo "Missing option argument for -$OPTARG" >&2; usage;;
*  ) echo "Unimplemented option: -$OPTARG" >&2; usage;;
esac
done
if [ -z "${source_dir}" ]
then
echo "Source directory must be specified"
usage
fi
if [ -z "${target_dir}" ]
then
echo "Target directory must be specified"
usage
fi
if [ -z "${max_depth}" ]
then
echo "Branch-out depth must be specified"
usage
fi
if [ -z "${rsync_options}" ]
then
rsync_options="aKHAXx"
fi
configure() {
if [ "${source_dir}" == "${target_dir}" ] ; then echo "Source and target directories must not be the same! Exiting..." ; exit 1 ; fi
if [ ${max_depth} -lt 2 ] ; then echo "Minimum search depth must be 2. Exiting..." ; exit 1 ; fi
cpu_count=$(cat /proc/cpuinfo|grep processor | wc -l)
let max_threads=cpu_count*30
sleep_time=3
export RSYNC="/usr/bin/rsync -${rsync_options}"
randomnum=$(echo "`expr ${RANDOM}${RANDOM} % 1000000`+1"|bc -l)
logdir="/var/log/rsync"
if [ ! -d "${logdir}" ] ; then mkdir -p "${logdir}" ; fi
cd "${logdir}"
filelist="${logdir}/filelist_${randomnum}"
if [ -f "${filelist}" ] ; then /bin/rm -f "${filelist}" ; fi
split_prefix="${logdir}/filelist_split_${randomnum}_"
/bin/rm -f ${split_prefix}*
dirlist="${logdir}/dirlist_${randomnum}"
if [ -f "${dirlist}" ] ; then /bin/rm -f "${dirlist}" ; fi
tmplist="/${logdir}/tmplist_${randomnum}"
if [ -f "${tmplist}" ] ; then /bin/rm -f "${tmplist}" ; fi
level_min=$(echo "${source_dir}" | awk -F'/' '{print NF}')
let level_max=level_min+max_depth-1
logfile=${logdir}/`echo ${source_dir} | awk -F'/' '{print $NF}'`_`date +'%Y-%m-%d'`_${randomnum}_log.txt
if [ -f "${logfile}" ] ; then /bin/rm -f "${logfile}" ; fi
logfile_files=${logdir}/`echo ${source_dir} | awk -F'/' '{print $NF}'`_`date +'%Y-%m-%d'`_files_${randomnum}_log.txt
if [ -f "${logfile_files}" ] ; then /bin/rm -f "${logfile_files}" ; fi
}
build_dir_list() {
echo "`date +'%Y-%m-%d %H:%M:%S'`    Looking for directories ${max_depth} levels deep from ${source_dir}" >> "${logfile}"
find "${source_dir}" -maxdepth ${max_depth} -mindepth 1 -mount -type d > "${dirlist}"
touch "${tmplist}"
echo "`date +'%Y-%m-%d %H:%M:%S'`    Pruning directory list. This may take a while..." >> "${logfile}"
echo "Level max: $level_max"
sort -r "${dirlist}" | while read dir
do
level=$(echo "${dir}" | awk -F'/' '{print NF}')
if [ ${level} -eq ${level_max} ]
then
echo "$level ${dir}"
echo "${dir}" >> "${tmplist}"
elif [ ${level} -gt ${level_min} ] && [ ${level} -lt ${level_max} ] && [ `grep -c "^${dir}/" "${tmplist}"` -eq 0 ]
then
echo "$level ${dir}"
echo "${dir}" >> "${tmplist}"
fi
done
sed "s@${source_dir}/@@g" < "${tmplist}" | sort > "${dirlist}"
}
build_file_list() {
echo "`date +'%Y-%m-%d %H:%M:%S'`    Looking for orphaned files" >> "${logfile}"
exclude_list=$(grep -v "\/" "${dirlist}" | sed 's@ @\\s@g' | awk -F'/' '{print "-not -path */"$1"/*"}' | sort | uniq)
max_depth_file=$(awk -F'/' '{print NF}' < $dirlist | sort -n | tail -1)
find "${source_dir}" -maxdepth ${max_depth_file} -mount -type f `eval echo ${exclude_list}` -prune 2>/dev/null | sed "s@${source_dir}@\.@g" > "${filelist}"
}
report() {
dircount=$(cat "${dirlist}" | grep -c .)
filecount=$(cat "${filelist}" | grep -c .)
echo "`date +'%Y-%m-%d %H:%M:%S'`    Found ${dircount} directories ${max_depth} levels deep and ${filecount} orphaned files" >> "${logfile}"
}
copy_files() {
if [ -f "${filelist}" ]
then
if [ `grep -c . "${filelist}"` -gt 0 ]
then
if [ `grep -c . "${filelist}"` -gt 2000 ]
then
let lines=`grep -c . "${filelist}"`/20
split -l ${lines} -a 10 -d "${filelist}" "${split_prefix}"
k=1 ; find "${logdir}" -mount -type f -name "${split_prefix}[0-9]*" | while read filelist_split
do
echo "`date +'%Y-%m-%d %H:%M:%S'`    Copying `wc -l ${filelist_split} | awk '{print $1}'` orphaned files found in ${filelist_split}" >> "${logfile}"
eval ${RSYNC} \
--log-file="${logfile_files}_${k}" \
--files-from="${filelist_split}" "${source_dir}/" "${target_dir}/" &disown
(( k = k + 1 ))
done
else
echo "`date +'%Y-%m-%d %H:%M:%S'`    Copying `wc -l ${filelist} | awk '{print $1}'` orphaned files" >> "${logfile}"
eval ${RSYNC} \
--log-file="${logfile_files}" \
--files-from="${filelist}" "${source_dir}/" "${target_dir}/" &disown
fi
fi
fi
}
copy_directories() {
threads=1
i=1
cat "${dirlist}" | grep . | while read subfolder
do
if [ ! -d "${target_dir}/${subfolder}" ]
then
echo "Creating target subfolder: ${target_dir}/${subfolder}" >> "${logfile}"
mkdir -p "${target_dir}/${subfolder}"
chown --reference="${source_dir}/${subfolder}" "${target_dir}/${subfolder}"
chmod --reference="${source_dir}/${subfolder}" "${target_dir}/${subfolder}"
else
echo "Target subfolder already exists: ${target_dir}/${subfolder}" >> "${logfile}"
fi
if [ ${threads} -le ${max_threads} ]
then
echo "`date +'%Y-%m-%d %H:%M:%S'`    Processing ${i} of ${dircount}: ${subfolder}" >> "${logfile}"
eval ${RSYNC} --exclude .etc/ \
"${source_dir}/${subfolder}/" "${target_dir}/${subfolder}/" &disown
let threads=threads+1
else
while [ `/bin/ps -ef | grep -v "[t]ar " | grep -v grep | grep -c "[r]sync "` -gt ${max_threads} ]
do
sleep ${sleep_time}
done
threads=1
echo "`date +'%Y-%m-%d %H:%M:%S'`    Processing ${i} of ${dircount}: ${subfolder}" >> "${logfile}"
eval ${RSYNC} \
"${source_dir}/${subfolder}/" "${target_dir}/${subfolder}/" &disown
let threads=threads+1
fi
let i=i+1
done
}
# RUNTIME
configure
build_dir_list
build_file_list
report
copy_files
copy_directories

Igor

Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.

Symbol	USD	% 1h	% 24h	% 7d
BTC	37,157	0.55	2.50	7.72
ETH	1,716.5	0.31	3.66	4.71
USDT	1.000	0.01	0.01	0.00
XRP	0.3813	0.14	0.63	2.13
BNB	660.79	0.17	0.39	2.33
SOL	147.93	0.13	1.23	6.13
USDC	0.9999	0.01	0.00	0.00
	?	---	0.00	0.00
	?	---	0.00	0.00
	?	---	0.00	0.00

Bitcoin $ 37,157	Bitcoin 2.50 %
Ethereum $ 1,716.5	Ethereum 3.66 %
Litecoin $ 53.16	Litecoin 0.18 %
XRP $ 0.3813	XRP 0.63 %

IMDb Movie Title Parser in Bash

Managing Mapped Network Drives in Windows

Squeezing Video Files

Adding and Removing sshd instances on CentOS 7

Adding and Removing sshd instances on CentOS 6

LLM Collapse Explained

Notes on ownCloud configuration

Removing Chef Server Installation

Curated Downloads

Sending Windows Logs to Remote Syslog

Plugging iPhone’s Privacy Holes

Managing Mapped Network Drives in Windows

Squeezing Video Files

Late Night Rant: College Admissions Scandal

Measure DNS Server Performance

Resizing Photos for Instagram

QNAP NAS Performance Analysis

Adding and Removing sshd instances on CentOS 7

Adding and Removing sshd instances on CentOS 6

Measure DNS Server Performance

Inventory Network Services with Nmap

Finding Duplicate Photos

Maryland Renaissance Festival

Focus Stacking with Lightroom and Photoshop

Longwood Gardens, April 2018

Parallel Rsync

Obfuscating Shell Scripts

Inventory Network Services with Nmap

Finding Gaps in Timestamps

Monitoring DNS Queries

Searching Twitter

Finding Passwords in Logs and Shell History