Being a shutterbug and a digital hoarder can get expensive. A couple of days ago, my
TerraMaster D5-300 5-bay DAS crashed. Again. This time it was the power supply. I replaced it only to discover that my RAID 5 volume is toast. It was still rebuilding the failed drive I replaced the other week when the power supply failed along with another disk in the set.

Every TerraMaster doohickey I ever bought was garbage, and I should’ve returned this one the day it arrived from Amazon, and I discovered it had a wobbly USB-C port. At least I was smart enough not to put anything important there. The “replacement” I just ordered – a QNAP TVS-872X 8-bay NAS with 8x16TB disks – is not exactly cheap. So why do I need this much storage anyway?

Well, just one iPhone ProRAW photo is over 25MB on average. RAW format photos from my Nikon and Fujifilm are at least twice that. A single minute of 4K video is roughly 5.2GB (that’s 318GB for an hour, if you’re counting). And, as I said, I don’t like deleting things (which is really the main problem here).

blank
With a resolution of 100MP, the Fujifilm GFX 100 creates 230-MB uncompressed RAW files.

Then, you need to triple or quadruple the storage requirements to accommodate all the versions of the original photos and videos processed in Photoshop, Premiere Pro, AfterEffects, etc. Finally, I have too many years of sysadmin experience to keep all my stuff on a single NAS, even if it’s RAID 10, so I need at least two. Plus, an offsite system.

I inevitably end up with many duplicate photos floating around. Stuff that I processed on my laptop, then edited a copy on my iPad, then made a small correction on my other laptop and saved an extra copy, and so on.

Some commercial tools can help you hunt down and eliminate duplicate photos. One such utility – Duplicate Photos Fixer Pro – I recently purchased and tested. It was OK, but it has a hard time telling apart an original photo from a version that had levels adjusted in photoshop. This is an obvious problem for me, and so I decided to write a script instead.

Before I was willing to unleash the script on my photo collection, I figured I should try something less ambitious as a test. I have over a thousand photos posted on my Instagram. On occasion, I stumble onto a cool photo I took some time ago, but I am unsure if I already posted it and just forgot. Scrolling through a thousand tiny thumbnails is not my idea of fun.

The solution was to scrape the Instagram feed and then use my script to see if there is a match between the photo I want to post and the ones I already posted. For downloading my own Instagram photos, I used the instagram-scraper on my WSL Ubuntu:

pip3 install instagram-scraper

The syntax is simple:

instagram-scraper "igor_os777" --cookiejar /mnt/c/zip/instagram_downloads/cookiejar.txt -t image -u "igor_os777" -p "*****************" -d /mnt/c/zip/instagram_downloads/igor_os777/

Then I created another folder (/mnt/c/zip/instagram_downloads/reference, in the example below) containing a few photos I wanted to post and used my script to see if those photos I posted previously:

image-compare /mnt/c/zip/instagram_downloads/reference /mnt/c/zip/instagram_downloads/igor_os777

Here’s what the script does: the first step is to create thumbnails of all photos in both folders. Resizing images somewhat lowers comparison accuracy but dramatically speeds up processing. The photos are sorted by modification timestamp (the original timestamp is preserved when the thumbnails are created), with the newest photos being looked at first. Doing so can save some time, sometimes.

The script goes to work, and, in this example, it compares 12 photos I would like to post to 1036 photos I posted in the past. This process will take some time.

[root:/tmp] # image-compare /mnt/c/zip/instagram_downloads/reference /mnt/c/zip/instagram_downloads/igor_os777

Resizing photos in /mnt/c/zip/instagram_downloads/reference
Resizing photos in /mnt/c/zip/instagram_downloads/igor_os777
Comparing photo 09/12 to 0005/1036

Once all the photos have been processed, the script’s findings will be stored in a temporary file and printed to the console. In this case, the script found two photos that I haven’t posted to Instagram yet:

Matching photos in /tmp/tmp.uot4VN0OKB:

/mnt/c/zip/instagram_downloads/reference/2021-09-19_13-39-00_IMG_1840-Edit-Edit-Edit-Edit-Edit.jpg  /mnt/c/zip/instagram_downloads/igor_os777/242367166_617039812793689_4728770883374369021_n.jpg
/mnt/c/zip/instagram_downloads/reference/DSCF3198.jpg                                               /mnt/c/zip/instagram_downloads/igor_os777/241711773_220123510091396_3639455831665287351_n.jpg
/mnt/c/zip/instagram_downloads/reference/DSCF3226.jpg                                               /mnt/c/zip/instagram_downloads/igor_os777/241534051_546450486470600_1562021441681046307_n.jpg
/mnt/c/zip/instagram_downloads/reference/DSCF3470-HDR.jpg                                           /mnt/c/zip/instagram_downloads/igor_os777/242299510_1764935730374174_4554904161948082469_n.jpg
/mnt/c/zip/instagram_downloads/reference/DSCF3473-HDR.jpg                                           /mnt/c/zip/instagram_downloads/igor_os777/242211240_388027456124622_2677077871810538266_n.jpg
/mnt/c/zip/instagram_downloads/reference/DSCF3491.jpg                                               /mnt/c/zip/instagram_downloads/igor_os777/241998262_379625187086227_7926922993137434281_n.jpg
/mnt/c/zip/instagram_downloads/reference/DSCF3496-Edit.jpg                                          /mnt/c/zip/instagram_downloads/igor_os777/241416012_1200120437079792_8577750566156040991_n.jpg
/mnt/c/zip/instagram_downloads/reference/DSCF3511-Edit-Edit-Edit.jpg                                /mnt/c/zip/instagram_downloads/igor_os777/241836399_562224451499860_3636591261485964266_n.jpg
/mnt/c/zip/instagram_downloads/reference/DSCF3559-HDR.jpg                                           /mnt/c/zip/instagram_downloads/igor_os777/242634080_4377870348965297_6992672767669165606_n.jpg
/mnt/c/zip/instagram_downloads/reference/IMG_1456-Enhanced-Edit-Edits.jpg                           /mnt/c/zip/instagram_downloads/igor_os777/242594762_904931250375633_4985131236172892403_n.jpg


Unique photos in /mnt/c/zip/instagram_downloads/reference:

/mnt/c/zip/instagram_downloads/reference/DSCF3141.jpg
/mnt/c/zip/instagram_downloads/reference/DSCF3531-Edit.jpg

I validated this result manually, and I’m proud to say that the script was right on the money. The next step was to see how this process competed against various post-processed versions of the original photo. This comparison of visually similar images is where the commercial tool I tried made a hash of things, telling me I need to delete a bunch of photos I spend days perfecting.

The test folder in this example contains six photos – the original RAW format DNG image taken with iPhone 12 Pro Max and several versions post-processed in Topaz Sharpen AI, Lightroom, and Photoshop:

# Original photo
original.DNG

# 16:9 crop of the original
original-cropped.dng

# Processed in Topaz Sharpen AI
original-cropped-edit1.tif

# Processed in Lightroom and Photoshop
original-cropped-edit2.tif

# Same as previous but converted to JPEG
original-cropped-edit2.jpg

# Same as previous but resized for Web use
original-cropped-edit2-resized.jpg

Visually, there are four distinctive versions: the original, the cropped, edit1, and edit2. So ideally, I would like to have my script tell me to keep at least those four.

blank
Highlighted are the four photos I would like to keep. The remaining two are just lower resolution JPEG versions and are disposable.

The trick to comparing photos in the same folder is to list the same folder twice: both as the source and the target for comparison. Here’s the result:

[root:/tmp] # image-compare /mnt/c/zip/test/similar_photos /mnt/c/zip/test/similar_photos
Resizing photos in /mnt/c/zip/test/similar_photos
Resizing photos in /mnt/c/zip/test/similar_photos
Comparing photo 6/6 to 5/6


Matching photos in /tmp/tmp.8Ji0Pk4Bhr:

/mnt/c/zip/test/similar_photos/original-cropped-edit2-resized.jpg  /mnt/c/zip/test/similar_photos/original-cropped-edit2.jpg
/mnt/c/zip/test/similar_photos/original-cropped-edit2.jpg          /mnt/c/zip/test/similar_photos/original-cropped-edit2-resized.jpg
/mnt/c/zip/test/similar_photos/original-cropped-edit2.tif          /mnt/c/zip/test/similar_photos/original-cropped-edit2-resized.jpg


Unique photos in /mnt/c/zip/test/similar_photos:

/mnt/c/zip/test/similar_photos/original-cropped.dng
/mnt/c/zip/test/similar_photos/original-cropped-edit1.tif
/mnt/c/zip/test/similar_photos/original-cropped-edit2.tif
/mnt/c/zip/test/similar_photos/original.DNG

As you can see in the list of matching photos, some files appear multiple times, while others appear only once. The latter are added to the list of unique photos and should not be deleted. In this case, the result is exactly what I was hoping to see. If I delete the two images, not on the “unique” list, I will not lose anything I cannot easily get back.

I have to say, I am impressed. Not so much with my scripting abilities, but with how well ImageMagick’s compare function works. Keep in mind that it was comparing 150×150 thumbnails!

This script can be made much more robust (and I’m working on the Mk2 version) by taking into consideration file sizes and various EXIF data. When deciding which version of a photo to keep, the script can also give preference to the lossless formats. I’ll post an update when I have something worth sharing.