Home Monitoring Performance Server and Application Resiliency Testing

Server and Application Resiliency Testing

September 8, 2025

109

Originally published September 9, 2022 @ 1:15 pm

You are deploying a new application cluster and wonder how it will perform under less-than-ideal conditions: heavy system load, slow storage, network performance degradation. Application resiliency testing is integral to any application architecture but is often passed over because the process is considered overly complex and time-consuming. Here are some technical suggestions to make resiliency testing a little easier.

Most applications designed to run in a high-availability clustered environment are very resilient to a partial loss of cluster components: servers crashing, disks failing, network links dropping, and so on. Traditionally, the weak spot of such designs has always been a partial and fleeting degradation in performance.

For example, an application can usually handle a server crash but will be thrown for a loop if one of the servers restarts unexpectedly. As the remaining cluster nodes are trying to re-distribute roles and workload, the missing server reboots and tries to re-join the cluster, usually causing much confusion. Or, if one of the network links drops, a cluster can handle this through link aggregation, among other methods. On the other hand, intermittent network degradation – a bandwidth bottleneck, high latency, and dropped packets – will usually result in application performance and stability problems.

Simulating such events during application performance testing will save you a lot of weekend work down the road. The specific commands below were used on HPE DL380 servers running RHEL 8.6, but you can easily adapt the syntax for most other modern Linux flavors.

System Pre-Requisites

Install these diagnostic tools and configure system parameters.

# Install the required testing tools
yum -y install iproute-tc stress kernel-modules-extra kernel-modules-extra-$(uname -r) stress-ng bonnie++

# Load the required kernel modules
modprobe sch_netem; lsmod | grep -c sch_netem | sed -e 's/1/OK/g' -e 's/0/FAIL/g'

# View all tc qdiscs for the primary NIC
tc qdisc show dev $(route | grep -m1 ^default | awk '{print $NF}')

# Clear all tc qdiscs from the primary NIC
tc qdisc del dev $(route | grep -m1 ^default | awk '{print $NF}') root 2>/dev/null

Network Testing

Introduce a delay of 100ms with randomized +/-10ms uniform distribution and the correlation value of 25%

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem delay 100ms 10ms 25%

Introduce a 10% packet loss

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem loss 10%

Corrupt 5% of the packets by introducing single bit error at a random offset

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem corrupt 5%

Duplicate 1% of sent packets

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem duplicate 1%

Limit egress bandwidth to 128kbps with 32kbps burst and 100ms latency

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root tbf rate 128kbit burst 32kbit latency 100ms

Clear all tc qdiscs from the primary NIC

tc qdisc del dev $(route | grep -m1 ^default | awk '{print $NF}') root 2>/dev/null

System stress test

This is another realistic test that emulates system resource limitations caused by factors like runaway processes, hardware failures, and resource contentions.

Fully utilize half of all CPU cores and half of all memory for one minute:

stress --cpu $(echo "scale=0;$(grep -c proc /proc/cpuinfo) / 2" | bc -l) --io 1 --vm 1 --vm-bytes $(echo "scale=0;$(grep MemTotal /proc/meminfo | awk '{print $2}') / 2" | bc -l)K --timeout 60

A variation of the previous test with additional disk I/O

stress --hdd 4 --io 6 --vm 8 --cpu $(echo "scale=0;$(grep -c proc /proc/cpuinfo) / 2" | bc -l) --timeout 60

Test /mnt/app filesystem performance using Bonnie++

NOTE: This test will set the size of the test file to twice the available memory. This is required for accurate performance data. If your server has a lot of RAM, the test will take a long time to complete.

if [ $(mountpoint /mnt/app 2>/dev/null 1>&2; echo $?) -eq 0 ]; then bonnie++ -n 0 -u 0 -r $(free -m | grep 'Mem:' | awk '{print $2}') -s $(echo "scale=0;$(free -m | grep 'Mem:' | awk '{print $2}')*2" | bc -l) -f -b -d /mnt/app; fi

Igor

Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.

Bitcoin $ 37,157	Bitcoin 2.50 %
Ethereum $ 1,716.5	Ethereum 3.66 %
Litecoin $ 53.16	Litecoin 0.18 %
XRP $ 0.3813	XRP 0.63 %

Bash Scripts and MySQL

Bulk Create Linux Users using Salt

Randomizing Filenames

Finding Gaps in Timestamps

Coronavirus Stats in Bash

Bash Scripts and MySQL

The Black Box on the Org Chart

The Post-Language Future of AI Systems

Outsmarted by a River, a Rope, and an Anchor

Bulk Create Linux Users using Salt

Relocating WLS2 Images

College Students Demand Refunds

Randomizing Filenames

Searching Twitter

Finding Gaps in Timestamps

Converting Geofency Data to Google Maps

Analyzing atop Logs with atopsar

Validating HTTPS Cache Peers for Squid

Verifying SNMP Connectivity on Multiple Hosts

Bulk-Adding IPTables Rules

Automatically Block Frequent Visitors

Canyonlands National Park, Utah

Yellowstone National Park, Wyoming

Sun Juan Mountains, Colorado

Detecting Blurry Photos with ImageMagick

Server and Application Resiliency Testing

System Pre-Requisites

Network Testing

Working with ffmpeg on Multi-Core Systems

Squeezing Video Files

Check Filesystem Mount Status

Diff on Output of Remote Commands

Clone Raspberry Pi SD to Larger Card

Verify Network Port Access