You are deploying a new application cluster and wonder how it will perform under less-than-ideal conditions: heavy system load, slow storage, network performance degradation. Application resiliency testing is integral to any application architecture but is often passed over because the process is considered overly complex and time-consuming. Here are some technical suggestions to make resiliency testing a little easier.

Most applications designed to run in a high-availability clustered environment are very resilient to a partial loss of cluster components: servers crashing, disks failing, network links dropping, and so on. Traditionally, the weak spot of such designs has always been a partial and fleeting degradation in performance.

For example, an application can usually handle a server crash but will be thrown for a loop if one of the servers restarts unexpectedly. As the remaining cluster nodes are trying to re-distribute roles and workload, the missing server reboots and tries to re-join the cluster, usually causing much confusion. Or, if one of the network links drops, a cluster can handle this through link aggregation, among other methods. On the other hand, intermittent network degradation – a bandwidth bottleneck, high latency, and dropped packets – will usually result in application performance and stability problems.

Simulating such events during application performance testing will save you a lot of weekend work down the road. The specific commands below were used on HPE DL380 servers running RHEL 8.6, but you can easily adapt the syntax for most other modern Linux flavors.

System Pre-Requisites

Install these diagnostic tools and configure system parameters.

# Install the required testing tools
yum -y install iproute-tc stress kernel-modules-extra kernel-modules-extra-$(uname -r) stress-ng bonnie++

# Load the required kernel modules
modprobe sch_netem; lsmod | grep -c sch_netem | sed -e 's/1/OK/g' -e 's/0/FAIL/g'

# View all tc qdiscs for the primary NIC
tc qdisc show dev $(route | grep -m1 ^default | awk '{print $NF}')

# Clear all tc qdiscs from the primary NIC
tc qdisc del dev $(route | grep -m1 ^default | awk '{print $NF}') root 2>/dev/null

Network Testing

Introduce a delay of 100ms with randomized +/-10ms uniform distribution and the correlation value of 25%

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem delay 100ms 10ms 25%

Introduce a 10% packet loss

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem loss 10%

Corrupt 5% of the packets by introducing single bit error at a random offset

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem corrupt 5%

Duplicate 1% of sent packets

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem duplicate 1%

Limit egress bandwidth to 128kbps with 32kbps burst and 100ms latency

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root tbf rate 128kbit burst 32kbit latency 100ms

Clear all tc qdiscs from the primary NIC

tc qdisc del dev $(route | grep -m1 ^default | awk '{print $NF}') root 2>/dev/null

System stress test

This is another realistic test that emulates system resource limitations caused by factors like runaway processes, hardware failures, and resource contentions.

Fully utilize half of all CPU cores and half of all memory for one minute:

stress --cpu $(echo "scale=0;$(grep -c proc /proc/cpuinfo) / 2" | bc -l) --io 1 --vm 1 --vm-bytes $(echo "scale=0;$(grep MemTotal /proc/meminfo | awk '{print $2}') / 2" | bc -l)K --timeout 60

A variation of the previous test with additional disk I/O

stress --hdd 4 --io 6 --vm 8 --cpu $(echo "scale=0;$(grep -c proc /proc/cpuinfo) / 2" | bc -l) --timeout 60

Test /mnt/app filesystem performance using Bonnie++

NOTE: This test will set the size of the test file to twice the available memory. This is required for accurate performance data. If your server has a lot of RAM, the test will take a long time to complete.
if [ $(mountpoint /mnt/app 2>/dev/null 1>&2; echo $?) -eq 0 ]; then bonnie++ -n 0 -u 0 -r $(free -m | grep 'Mem:' | awk '{print $2}') -s $(echo "scale=0;$(free -m | grep 'Mem:' | awk '{print $2}')*2" | bc -l) -f -b -d /mnt/app; fi