Home SysAdmin Commands & Shells Automating Web Page Screenshots

Automating Web Page Screenshots

July 13, 2024

469

Originally published July 13, 2018 @ 2:25 pm

Automatic Web page screenshots are commonly used for various monitoring tasks, such as detecting copyright infringement, website defacement, and other legal and security matters. Automatic screenshotting can also be a useful tool for researchers, developers, and journalists. There are free and open-source tools out there that can help you automate the task of creating Web page screenshots.

Screenshot Tools

I’ll briefly go over the following tools: CutyCapt, wkhtmltoimage, Firefox, PhantomJS, and Pageres-CLI, with a bit more focus on the last two.

Some of these tools allow you to specify the user agent, which can give you some control over how the page will be rendered. You can get a very long list of various user agent strings here.

CutyCapt

Download it here. Installation is fairly straight-forward. You do need X11 server running on your machine to use CutyCapt. Alternatively, you would need to run it via xvfb-run, which is not difficult either.

An issue you may notice with CutyCapt is missing some dynamic content that would only be visible if you scrolled down the page. This utility has quite a few features, but scrolling ain’t one of them.

url="https://igoroseledko.com/the-future-of-spaghetti-code/"
ua="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"

cutycapt --url="${url}" --user-agent="${ua}" --max-wait=10000 --out-format=png --method=get --javascript=on --java=on --plugins=on --private-browsing=off --auto-load-images=on --js-can-open-windows=off --js-can-access-clipboard=off --zoom-factor=1 --delay=0 --print-backgrounds=on --min-width=1280 --min-height=1280 --out=screenshot_cutycapt.png 2>/dev/null

wkhtmltoimage

This tool is part of wkhtmltopdfpackage. It is quick and has a good selection of options. You can install it with sudo apt install wkhtmltopdf on Debian, or get a precompiled package here.

There are some issues getting certain page elements. You will notice some missing graphics in the sidebar and the footer. This is dynamic content that is loaded when you scroll down, which you can’t do in this case.

url="https://igoroseledko.com/the-future-of-spaghetti-code/"
ua="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"

wkhtmltoimage --custom-header-propagation --custom-header User-Agent "${ua}" --images --cache-dir /tmp --enable-javascript --no-stop-slow-scripts --javascript-delay 20000 --format jpg --quality 95 --load-error-handling ignore "${url}" screenshot_wkhtmltoimage.jpg 2>/dev/null

Firefox

With Firefox screenshot CLI option you don’t really get any options except the --window-size parameter. It is significantly slower than other tools discussed here and it only supports PNG format. With my Firefox 61.0, screenshots are missing all graphics – not sure why.

An interesting option here is that Firefox is supposed to make use of any configured plugins when running in headless mode to take screenshots. It would’ve been more interesting if it actually worked. So why am I including Firefox here? Because I like Firefox and I am hoping its headless functionality will improve.

url="https://igoroseledko.com/the-future-of-spaghetti-code/"

firefox -screenshot screenshot_firefox.png "${url}" 2>/dev/null

Pageres-CLI

This utility uses PhantomJS on the back-end to create the actual screenshots. The plus here is you don’t have to write scripts to access many of the advanced functions of PhantomJS.

Many but not all, obviously. There’s this same limitation of rendering dynamic content down toward the bottom of the page. There may be an option for this, but I could not find it.

url="https://igoroseledko.com/the-future-of-spaghetti-code/"
ua="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"

pageres --useragent="${ua}" --delay=5 "${url}" --filename="<%= date %> - <%= url %>-${RANDOM}

PhantomJS

This is a scripted headless browser that can actually interact with Web pages and not just take screenshots. Currently, PhantomJS is the tool of choice for browser-based tests for continuous integration, as well as writing all sorts of Web bots.

Because of its extended functionality PhantomJS is a bit more difficult to use. First you need to create the JS file, which is a script telling PhantomJS how to handle loading and rendering the Web page. For the example below I am using this ~/phantomjs_rasterize.js:

var page = require('webpage').create(),
    system = require('system'),
    address, output, size;
page.settings.resourceTimeout = 10000;
var userAgent = system.args[1];
page.settings.userAgent = userAgent;

if (system.args.length < 3 || system.args.length > 5) {
    console.log('Usage: rasterize.js URL filename [paperwidth*paperheight|paperformat] [zoom]');
    console.log('  paper (pdf output) examples: "5in*7.5in", "10cm*20cm", "A4", "Letter"');
    phantom.exit(1);
} else {
    address = system.args[2];
    output = system.args[3];
    page.viewportSize = {
        width: 1440,
        height: 1280
    };
    if (system.args.length > 3 && system.args[3].substr(-4) === ".pdf") {
        size = system.args[4].split('*');
        page.paperSize = size.length === 2 ? {
                width: size[0],
                height: size[1],
                margin: '0px'
            } :
            {
                format: system.args[4],
                orientation: 'portrait',
                margin: '1cm'
            };
    }
    if (system.args.length > 4) {
        page.zoomFactor = system.args[5];
    }
    page.open(address, function(status) {
        if (status !== 'success') {
            console.log('Unable to load the address!');
            phantom.exit();
        } else {
            page.evaluate(function() {
                //$("#overlay, #modal").remove();
                var current = 0,
                    delta = 1280,
                    total = document.height - delta;
                var style = document.createElement('style'),
                    text = document.createTextNode('body { background-color: #ffffff; }');
                style.setAttribute('type', 'text/css');
                style.appendChild(text);
                document.head.insertBefore(style, document.head.firstChild);
                window.scrollTo(0, current);

                function fakeScroll() {
                    if (current < total) {
                        current = current + delta;
                        window.scrollTo(0, current);

                        window.setTimeout(fakeScroll, 200);
                    } else {
                        window.scrollTo(0, 0);
                    }
                }

                fakeScroll()
            });
            window.setTimeout(function() {
                page.render(output);
                phantom.exit();
            }, 10000);
        }
    });
}

To make things easier, PhantomJS comes with a bunch of handy examples of JS scripts that you can quickly adapt to your needs. There’re also plenty of resources, such as GitHub, where you can find just about anything you need.

Unlike previous tools, PhantomJS is able to successfully simulate scrolling and load dynamic content all the way to the bottom of the page (although it did take me a bit of googling and trial-and-error before I got this to work). So, some assembly required, but, overall, this is the best solution for the most accurate screenshots.

url="https://igoroseledko.com/the-future-of-spaghetti-code/"
ua="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"

phantomjs --load-images=true --local-storage-path=/tmp --disk-cache=true --disk-cache-path=/tmp --cookies-file=/tmp/phantomjs_cookies.txt --ignore-ssl-errors=true --ssl-protocol=any --web-security=true ~/phantomjs_rasterize.js "${ua}" "${url}" screenshot_phantomjs.png 2>/dev/null

Preserving for Posterity

Let’s say you have a list of URLs that you want to screenshot periodically. The following script will loop through your list of links; make screenshots using PhantomJS; add an EXIF comment field; and insert a hidden watermark message using steghide. The latter is not fool-proof, but can be useful if you need to check if the images have been tampered with.

cat "${url_list}" | while read url; do
urldir="${basedir}$(url_nopro=${url:7}; url_rel=${url_nopro#*/}; echo "/${url_rel%%\?*}")"
mkdir -p "${urldir}"
screenshot_out="${urldir}$(date +'%Y-%m-%d_%H%M%S').jpg"

function phantom_do() {
  phantomjs --load-images=true --local-storage-path=/tmp --disk-cache=true \
  --disk-cache-path=/tmp --cookies-file=/tmp/phantomjs_cookies.txt \
  --ignore-ssl-errors=true --ssl-protocol=any --web-security=true \
  ~/phantomjs_rasterize.js "${ua}" "${url}" "${screenshot_out}" 2>/dev/null
}

function convert_do() {
  f="$(mktemp)"
  echo "Screenshot of ${url} taken on $(date +'%Y-%m-%d %H:%M:%S')" > "${f}"
  convert -comment "$(cat "${f}")" "${screenshot_out}" "${screenshot_out}" 2>/dev/null
}

function steghide_do() {
  steghide embed -cf "${screenshot_out}" -ef "${f}" -p "${pass}" -q -f
  /bin/rm -f "${f}"
}

function runtime_do() {
 phantom_do
 convert_do
 steghide_do
}

echo ${url}
runtime_do &

done

And, if you need to extract the watermark message, use this command:

steghide extract -sf "${screenshot_out}" -xf "${screenshot_out}.txt" -p "${pass}" -f -q

Comparing Screenshots

The ability to compare two or more screenshots may come in handy at some point. Probably the best tool for this is ImageMagick. In the example below I took two screenshots of the same URL some time apart. Running the following command produced a diff image:

compare -metric rmse -subimage-search screenshot_phantomjs_2.png screenshot_phantomjs_1.png -compose src screenshot_phantomjs_diff.png

Here are the two screenshots and the resulting diff image, showing what changed:

As you can see, the highlighted changes are due to various dynamic content, such as my Twitter feed, for example. It is possible to configure the compare utility to look for difference only in a particular area of the image. In this case, I would like it to see if there were any changes in the main body of the post, while ignoring changes in dynamic content.

Igor

Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.

Bitcoin $ 37,157	Bitcoin 2.50 %
Ethereum $ 1,716.5	Ethereum 3.66 %
Litecoin $ 53.16	Litecoin 0.18 %
XRP $ 0.3813	XRP 0.63 %

Randomizing Filenames

Finding Gaps in Timestamps

Coronavirus Stats in Bash

Updating Lynis

Converting Geofency Data to Google Maps

The Black Box on the Org Chart

The Post-Language Future of AI Systems

Outsmarted by a River, a Rope, and an Anchor

The AI Bubble Isn’t a Bubble. It’s a Trap.

College Students Demand Refunds

Randomizing Filenames

To Mask or Not to Mask

Checking Linux Account Password

Finding Gaps in Timestamps

Converting Geofency Data to Google Maps

Analyzing atop Logs with atopsar

Verifying SNMP Connectivity on Multiple Hosts

Validating HTTPS Cache Peers for Squid

Verifying SNMP Connectivity on Multiple Hosts

Bulk-Adding IPTables Rules

Automatically Block Frequent Visitors

Canyonlands National Park, Utah

Yellowstone National Park, Wyoming

Sun Juan Mountains, Colorado

Detecting Blurry Photos with ImageMagick

Automating Web Page Screenshots

Screenshot Tools

CutyCapt

wkhtmltoimage

Firefox

Pageres-CLI

PhantomJS

Preserving for Posterity

Comparing Screenshots

Decrypting Mac Disk

Identify Overused Words

Gnuplot with Bash

Squeezing Video Files

Killing Process Network Access

Verifying SNMP Connectivity on Multiple Hosts