This is an update of the script I originally wrote five years ago and used to migrate many terabytes of production data between two NAS systems. What’s new: more efficient subfolder crawling, more effective way to launch rsync threads, ability to specify options from command line.

Here’s the problem with rsync: it’s a single-threaded process that needs to crawl the source and the destination directories in their entirety, build lists of folders and files, compare them, and then start transferring the discovered items: one by one.

This is not an issue when files are few and large. However, when the files are small and many and spread throughout a deep directory structure – this is when rsync grinds to a virtual halt. You may have a 10-gig network and rsync seems to be busy moving files, but your network utilization is a tiny fraction of the available bandwidth. The reason for that lies in rsync‘s single-threaded nature.

The workaround I am suggesting is to, basically, launch a separate rsync for every subfolder down to a certain level. And then another rsync to pick up whatever files were left above that level. There is a flow control in the script that will see how many cores your system has and will keep the number of rsyncs running at any given time down to a reasonable number, so not to kill your machine.

The script is below and you can also download it here. Save it and create a convenient link in /usr/bin/rsync-parallel. Here’s how to use it:

One thing to remember: the --delete option or any of its variations will not work with this script. The purpose of this script was to do initial synchronization. However, you can use the following rsync syntax to only delete items from the destination if they were removed from the source. This will only delete – it will not copy anything new:

As a test, I created a dummy folder structure (1111 folders) with some files (110200 files) using this script:

First, using just rsync :

Making sure everything is there:

Now I remove everything from the target and repeat the process this time using the script:

Again, to make sure everything is there:

So the script was about ten times faster than just rsync by itself. Keep in mind that in this example both source and target were local filesystems. If even one of them was NFS-mounted, the time advantage of the script would have been even greater.

 

Leave A Reply

Please enter your comment!
Please enter your name here