Chaos Testing Redis on Kubernetes with KubeDB and Chaos Mesh

Redis is a popular in-memory data store used for caching, session management, pub/sub messaging, leaderboards, and real-time analytics. In many architectures, Redis sits in the critical path — a failure can cause cascading degradation across the entire application stack.

Chaos engineering is the discipline of proactively injecting failures into a system to discover weaknesses before they manifest as production incidents.

In this post, we will:

  1. Deploy a production-style Redis Cluster using KubeDB
  2. Inject ten different failure scenarios using Chaos Mesh
  3. Observe and validate how Redis and KubeDB respond to each fault

What is Chaos Engineering?

Chaos engineering is the practice of intentionally introducing controlled failures into a system to observe how it behaves. The goal is not to break things — it is to learn how the system responds so you can make it more resilient.

For a Redis deployment, the questions chaos engineering helps answer are:

  • Does Redis recover automatically after a pod is killed?
  • Does the cluster re-converge after a network partition?
  • Can clients reconnect seamlessly after a disruption?
  • Does KubeDB restore the desired state after a fault?
  • How does Redis behave under CPU or memory pressure?

Tools Used

ToolPurpose
KubeDBManages Redis lifecycle on Kubernetes
Chaos MeshInjects chaos experiments into the cluster
Redis 7.4.0The database under test

Prerequisites

Before you begin, make sure you have the following:

  • A running Kubernetes cluster (GKE, EKS, AKS, or a local cluster using Kind or Minikube)
  • kubectl configured to access the cluster
  • KubeDB operator installed — follow the KubeDB setup guide
  • Chaos Mesh installed — follow the Chaos Mesh installation guide
  • A default or usable StorageClass in the cluster

Step 1: Deploy Redis Cluster with KubeDB

We will deploy a Redis Cluster with 3 shards and 2 replicas per shard. This gives us a proper distributed setup where we can observe failover and re-convergence behavior.

Create the namespace:

kubectl create ns demo

Apply the Redis manifest:

apiVersion: kubedb.com/v1
kind: Redis
metadata:
  name: redis-cluster
  namespace: demo
spec:
  version: "7.4.0"
  mode: Cluster
  cluster:
    shards: 3
    replicas: 2
  storageType: Durable
  storage:
    resources:
      requests:
        storage: 1Gi
    accessModes:
      - ReadWriteOnce
  deletionPolicy: WipeOut
kubectl apply -f redis-cluster.yaml

Wait for Redis to become ready:

NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    73s

NAME                         READY   STATUS    RESTARTS   AGE
pod/redis-cluster-shard0-0   1/1     Running   0          70s
pod/redis-cluster-shard0-1   1/1     Running   0          54s
pod/redis-cluster-shard1-0   1/1     Running   0          67s
pod/redis-cluster-shard1-1   1/1     Running   0          53s
pod/redis-cluster-shard2-0   1/1     Running   0          66s
pod/redis-cluster-shard2-1   1/1     Running   0          53s

Step 2: Verify Redis is Healthy

Retrieve the Redis password from the secret KubeDB created:

export PASSWORD=$(kubectl get secret -n demo redis-cluster-auth \
  -o jsonpath='{.data.password}' | base64 -d)

Check the cluster state:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c CLUSTER INFO

Write a test key that we will check after each experiment to verify data integrity:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c SET chaos-test "before-chaos"

Read it back:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
                              redis-cli -a $PASSWORD -c GET chaos-test
Defaulted container "redis" out of: redis, redis-init (init)
"before-chaos"

Step 3: Verify Chaos Mesh is Ready

kubectl get pods -n chaos-mesh
NAME                                      READY   STATUS    RESTARTS   AGE
chaos-controller-manager-b8d65b98-75s8w   1/1     Running   0          2m15s
chaos-controller-manager-b8d65b98-jcmnt   1/1     Running   0          2m13s
chaos-controller-manager-b8d65b98-tfwfd   1/1     Running   0          2m14s
chaos-daemon-jhth2                        1/1     Running   0          2m15s
chaos-dashboard-566b9f5c4b-zmplh          1/1     Running   0          2m15s
chaos-dns-server-85b8846dc9-ksljn         1/1     Running   0          116m

Chaos Experiments

For each experiment below, the workflow is:

  1. Apply the manifest
  2. Watch pod and Redis status
  3. Verify data is still accessible
  4. Delete the experiment and wait for full recovery before running the next one

Experiment 1: Master Pod Failure

What it does: Marks a Redis pod as unavailable without killing the process. This simulates Kubernetes declaring a pod unhealthy.

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: rd-master-pod-failure-short
  namespace: demo
spec:
  action: pod-failure
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duration: "1m"

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    39m

NAME                         READY   STATUS    RESTARTS         AGE
pod/redis-cluster-shard0-0   1/1     Running   10 (3m16s ago)   39m
pod/redis-cluster-shard0-1   1/1     Running   0                39m
pod/redis-cluster-shard1-0   1/1     Running   10 (3m16s ago)   39m
pod/redis-cluster-shard1-1   1/1     Running   0                39m
pod/redis-cluster-shard2-0   1/1     Running   10 (3m16s ago)   39m
pod/redis-cluster-shard2-1   1/1     Running   0                39m

One pod will go into a not-ready state. After the 30-second duration, the experiment ends and the pod recovers automatically.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
                                    redis-cli -a $PASSWORD -c GET chaos-test
                                    
Defaulted container "redis" out of: redis, redis-init (init)
"before-chaos"

markdown Observed timeline:

Wall-clockΔ from chaosEventDB Status
11:39:00Pre-chaos baseline (all 3 master pods ONLINE)Ready
11:44:070sPodChaos applied (all master pods targeted)Ready
11:44:07+0sAll master pods marked unavailableNotReady
11:44:21+14sOperator marks masters unhealthyCritical
11:45:07+1m00sChaos auto-recovered, master pods restartCritical
11:45:40~+1m33sPods in RECOVERING (rejoining cluster)Critical
~11:48–52~+4–8mAll masters reach Running, cluster re-convergesReady

Result: PASS — All master pods recovered automatically after the 1-minute fault window. KubeDB reconciled the Redis Cluster back to the desired state with zero data loss. The chaos-impacted pods rejoined the cluster and the test key chaos-test remained intact.


Experiment 2: Pod Kill Master

What it does: Sends SIGKILL to a Redis pod, simulating an OOM kill or a sudden node failure. Unlike pod-failure, the pod process is actually terminated and Kubernetes must reschedule it.

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: rd-master-pod-kill-short
  namespace: demo
spec:
  action: pod-kill
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duration: "5m"
  gracePeriod: 0

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    72m

NAME                         READY   STATUS    RESTARTS   AGE
pod/redis-cluster-shard0-0   1/1     Running   0          6m55s
pod/redis-cluster-shard0-1   1/1     Running   0          72m
pod/redis-cluster-shard1-0   1/1     Running   0          6m55s
pod/redis-cluster-shard1-1   1/1     Running   0          72m
pod/redis-cluster-shard2-0   1/1     Running   0          6m55s
pod/redis-cluster-shard2-1   1/1     Running   0          72m

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test
  
"before-chaos"

Observed timeline:

Wall-clockΔ from chaosEventDB Status
12:14:19Pre-chaos baseline (all 3 master pods ONLINE)Ready
12:14:190sPodChaos applied — all master pods SIGKILLedReady
12:14:19+0sAll master pods terminated (pod-kill injected for shard0-0, shard1-0, shard2-0)Critical
12:14:30~+11sKubernetes reschedules master pods; replicas promotedCritical
12:14:55~+36sNew master pods reach Running; cluster re-elects primariesCritical
12:19:19+5m00sChaos experiment window endsCritical
~12:21:00~+7mAll shards re-converge; cluster topology stabilizedReady

Result: PASS — All master pods were hard-killed simultaneously. Kubernetes rescheduled them via the StatefulSet controller and KubeDB reconciled the cluster back to the desired state. Replica pods were promoted to masters during the kill window and the test key chaos-test remained intact after full recovery.


Experiment 3: Container Kill

What it does: Kills only the redis container inside a pod, without deleting the pod itself. This tests in-place container restart behavior and is useful when sidecars or init containers are involved.

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: rd-master-container-kill-short
  namespace: demo
spec:
  action: container-kill
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duration: "5m"
  containerNames: ['redis']

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    80m

NAME                         READY   STATUS    RESTARTS      AGE
pod/redis-cluster-shard0-0   1/1     Running   1 (61s ago)   14m
pod/redis-cluster-shard0-1   1/1     Running   0             79m
pod/redis-cluster-shard1-0   1/1     Running   1 (61s ago)   14m
pod/redis-cluster-shard1-1   1/1     Running   0             79m
pod/redis-cluster-shard2-0   1/1     Running   1 (61s ago)   14m
pod/redis-cluster-shard2-1   1/1     Running   0             79m

The pod stays alive. Only the redis container restarts. Once it comes back up, it rejoins the cluster automatically.

markdown Observed timeline:

Wall-clockΔ from chaosEventDB Status
12:27:43Pre-chaos baseline (all 3 master pods ONLINE)Ready
12:27:430sPodChaos applied — redis container killed in shard0-0, shard1-0, shard2-0Ready
12:27:43+0sAll 3 master containers killed simultaneously (container-kill injected)Critical
12:27:44~+1sKubernetes detects container exit; kubelet restarts redis container in-placeCritical
12:28:04~+21sContainers restart with 1 restart count; pods remain scheduled on same nodesCritical
12:28:44~+1mRestarted containers rejoin the Redis Cluster; cluster re-convergesReady
12:32:43+5m00sChaos experiment window ends; no further kills injectedReady

Result: PASS — All three master containers were killed simultaneously. Kubernetes restarted them in-place (pod identity preserved, no rescheduling needed). KubeDB reconciled the cluster back to the desired state with zero data loss. The test key chaos-test remained intact after full recovery.


Experiment 4: Network Delay

What it does: Injects artificial latency into the network traffic of Redis pods. This simulates cross-availability-zone communication, saturated network links, or noisy neighbour conditions.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: delay
  namespace: demo
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  delay:
    latency: '1000ms'
    correlation: '100'
    jitter: '50ms'
  duration: "60s"

Observe:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c PING

Commands that normally complete in under 1ms will now take 1000ms or more. After the 60-second window, latency returns to normal.

markdown Observed timeline:

Wall-clockΔ from chaosEventDB Status
T+0:00Pre-chaos baseline (all 6 pods ONLINE)Ready
T+0:000sNetworkChaos applied — 1000ms latency injected on all master podsReady
T+0:05~+5sRedis commands begin timing out; write attempts fail with context deadline exceededNotReady
T+0:17~+17sCluster marks affected shards unavailable; KubeDB detects degraded stateNotReady
T+1:00+60sChaos experiment window ends; latency injection removedNotReady
T+3:00~+3mAll shards re-converge; KubeDB reconciles cluster to desired stateReady

Result: PASS — During the 1000ms latency window, Redis write operations failed with context deadline exceeded errors due to cluster heartbeat timeouts exceeding the configured threshold. After the chaos window ended, all pods recovered automatically and KubeDB reconciled the cluster back to Ready. Data integrity was confirmed — the test key chaos-test remained intact after full recovery.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 5: Network Bandwidth

markdown What it does: Throttles the network bandwidth available to all Redis pods to 1 Mbps. This simulates a congested or limited network link, such as a cross-region replication scenario or a degraded network interface.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: bandwidth
  namespace: demo
spec:
  action: bandwidth
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
  bandwidth:
    rate: 1mbps
    limit: 20971520
    buffer: 10000
  duration: "60s"

Apply and observe:

kubectl get rd,pods -n demo
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c PING

With only 1 Mbps available, Redis inter-node gossip and replication traffic will compete with client traffic. Large key writes or bulk operations will slow significantly. After the 60-second window, bandwidth returns to normal.

Observed timeline:

Wall-clockΔ from chaosEventDB Status
T+0:00Pre-chaos baseline (all 6 pods ONLINE)Ready
T+0:000sNetworkChaos applied — 1 Mbps cap injected on all podsReady
T+0:05~+5sReplication and gossip traffic degraded; write latency increasesReady
T+0:30~+30sCluster remains functional but throughput is visibly throttledReady
T+1:00+60sChaos experiment window ends; bandwidth restriction liftedReady
T+1:10~+1m10sAll pods return to full throughput; cluster fully convergedReady

Result: PASS — The 1 Mbps bandwidth cap throttled replication and gossip traffic across all shards. Redis remained available throughout the experiment with elevated latency on large operations, but the cluster did not lose quorum. After the chaos window ended, all pods returned to normal throughput and the test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 6: Network Corruption

What it does: Corrupts a percentage of network packets for Redis pods. This simulates bit-flip errors from faulty hardware or bad cables.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: redis-network-corruption
  namespace: demo
spec:
  action: corrupt
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
  corrupt:
    corrupt: "100"
    correlation: "100"
  duration: "60s"

markdown Observe:

kubectl get rd,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    145m

NAME                         READY   STATUS    RESTARTS      AGE
pod/redis-cluster-shard0-0   1/1     Running   1 (86m ago)   99m
pod/redis-cluster-shard0-1   1/1     Running   0             145m
pod/redis-cluster-shard1-0   1/1     Running   1 (86m ago)   99m
pod/redis-cluster-shard1-1   1/1     Running   0             145m
pod/redis-cluster-shard2-0   1/1     Running   1 (86m ago)   99m
pod/redis-cluster-shard2-1   1/1     Running   0             145m

TCP will detect and retransmit corrupted packets, so most Redis commands will still succeed but with higher latency and occasional errors. After the experiment, the network returns to normal.

Note: Chaos Mesh may initially fail to apply the tc (traffic control) rules on some pods — retrying automatically until injection succeeds. This is visible in the NetworkChaos status as repeated FailedSucceeded events per pod.

Observed timeline:

Wall-clockΔ from chaosEventDB Status
09:59:46Pre-chaos baseline (all 6 pods ONLINE)Ready
09:59:460sNetworkChaos applied — 100% packet corruption injected on all podsReady
09:59:46+0sChaos Mesh begins injecting tc rules; some pods fail initial injection and retryReady
10:00:00~+14sAll pods successfully injected; Redis inter-node gossip and replication traffic degradedReady
10:00:10~+24sTCP retransmissions increase; Redis commands experience elevated latency and occasional errorsNotReady
10:00:46+60sChaos experiment window ends; corruption rules removed from all podsNotReady
10:00:50~+1m04sAll pods return to normal network; cluster fully convergedReady

Result: PASS — 100% packet corruption was injected across all 6 Redis pods. TCP retransmission handled most corrupted packets transparently, keeping the cluster available with elevated latency. Some pods required multiple injection retries by Chaos Mesh before tc rules were successfully applied. After the 60-second window ended, all pods recovered automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 7: Network Partition

What it does: Completely isolates one Redis pod from the rest of the cluster. This is the most aggressive network experiment and tests how the Redis Cluster handles a split-brain scenario.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: partition
  namespace: demo
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - demo
      labelSelectors:
        app.kubernetes.io/name: redises.kubedb.com
        kubedb.com/role: slave
  duration: "10m"

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS     AGE
redis.kubedb.com/redis-cluster   7.4.0     Critical   119m

NAME                         READY   STATUS    RESTARTS      AGE
pod/redis-cluster-shard0-0   1/1     Running   1 (40m ago)   53m
pod/redis-cluster-shard0-1   1/1     Running   0             118m
pod/redis-cluster-shard1-0   1/1     Running   1 (40m ago)   53m
pod/redis-cluster-shard1-1   1/1     Running   0             118m
pod/redis-cluster-shard2-0   1/1     Running   1 (40m ago)   53m
pod/redis-cluster-shard2-1   1/1     Running   0             118m

The isolated pod is cut off from its peers. Redis Cluster will mark the affected shard as unavailable. Once the 10m window ends, the partition is lifted and the cluster re-converges.

observe after 10m:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    126m

NAME                         READY   STATUS    RESTARTS      AGE
pod/redis-cluster-shard0-0   1/1     Running   1 (47m ago)   60m
pod/redis-cluster-shard0-1   1/1     Running   0             126m
pod/redis-cluster-shard1-0   1/1     Running   1 (47m ago)   60m
pod/redis-cluster-shard1-1   1/1     Running   0             126m
pod/redis-cluster-shard2-0   1/1     Running   1 (47m ago)   60m
pod/redis-cluster-shard2-1   1/1     Running   0             126m

markdown

Observed timeline:

Wall-clockΔ from chaosEventDB Status
13:04:51Pre-chaos baseline (all 3 master pods ONLINE)Ready
13:04:510sNetworkChaos applied — bidirectional partition between all masters and all replicasReady
13:04:52~+1sAll 6 pods partitioned (masters isolated from replicas across all 3 shards)Critical
13:05:10~+20sRedis Cluster marks affected shards unavailable; KubeDB detects degraded stateCritical
13:14:51+10m00sChaos experiment window ends; partition lifted on all podsCritical
~13:16:00~+11mAll shards re-converge; cluster topology stabilizedReady

Result: PASS — All master pods were bidirectionally partitioned from their replicas simultaneously. During the 10-minute window, the Redis Cluster entered a Critical state as shards lost quorum visibility. Once the partition was lifted, all pods automatically re-converged and KubeDB reconciled the cluster back to Ready. The test key chaos-test remained intact after full recovery.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 8: CPU Stress

What it does: Saturates the CPU of a Redis pod to simulate CPU-intensive workloads on the same node or a resource-constrained environment.

current CPU usage:

kubectl top pods -n demo
NAME                     CPU(cores)   MEMORY(bytes)   
redis-cluster-shard0-0   3m           5Mi             
redis-cluster-shard0-1   4m           6Mi             
redis-cluster-shard1-0   4m           5Mi             
redis-cluster-shard1-1   4m           5Mi             
redis-cluster-shard2-0   3m           4Mi             
redis-cluster-shard2-1   3m           6Mi
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress-example
  namespace: demo
spec:
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duration: "2m"
  stressors:
    cpu:
      workers: 4
      load: 50

Apply and Observe:

kubectl top pods -n demo
NAME                     CPU(cores)   MEMORY(bytes)   
redis-cluster-shard0-0   1901m        9Mi             
redis-cluster-shard0-1   3m           5Mi             
redis-cluster-shard1-0   2007m        9Mi             
redis-cluster-shard1-1   3m           6Mi             
redis-cluster-shard2-0   1104m        9Mi             
redis-cluster-shard2-1   3m           5Mi

Redis is single-threaded for command processing, so heavy CPU stress will noticeably increase command latency. After the experiment, CPU usage drops and performance returns to baseline.

markdown Observed timeline:

Wall-clockΔ from chaosEventDB Status
13:28:05Pre-chaos baseline (all 3 master pods ONLINE, ~3–4m CPU)Ready
13:28:050sStressChaos applied — 4 workers at 50% CPU load injected on all master podsReady
13:28:05+0sCPU stress injected into shard0-0, shard1-0, shard2-0Ready
13:28:10~+5sCPU usage spikes to ~1100–2000m per master pod; command latency increasesReady
13:28:05~+30sRedis remains available; single-threaded command processing visibly slowerReady
13:30:05+2m00sChaos experiment window ends; CPU stress removed from all master podsReady
13:30:10~+2m05sCPU usage returns to baseline (~3–4m); cluster fully convergedReady

Result: PASS — CPU stress raised master pod usage from ~3–4m to ~1100–2000m cores. Redis remained available throughout the experiment, but command latency increased due to CPU contention. After the 2-minute chaos window ended, all pods recovered to baseline CPU usage automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 9: Memory Stress

What it does: Allocates a large chunk of memory inside a Redis pod to simulate memory pressure, approaching OOM conditions.

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: redis-memory-stress
  namespace: demo
spec:
  mode: one
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/instance: "redis-cluster"
      app.kubernetes.io/name: "redis"
  stressors:
    memory:
      workers: 2
      size: "256MB"
  duration: "60s"

Observe:

kubectl top pods -n demo
NAME                     CPU(cores)   MEMORY(bytes)   
redis-cluster-shard0-0   3m           5Mi             
redis-cluster-shard0-1   4m           6Mi             
redis-cluster-shard1-0   4m           252Mi           
redis-cluster-shard1-1   4m           5Mi             
redis-cluster-shard2-0   3m           4Mi             
redis-cluster-shard2-1   3m           5Mi

Watch whether Redis begins evicting keys (depending on your maxmemory-policy setting) and whether it recovers cleanly once the stressor is removed.

markdown Observed timeline:

Wall-clockΔ from chaosEventDB Status
14:04:17Pre-chaos baseline (all master pods ONLINE, ~5–6Mi memory)Ready
14:04:170sStressChaos applied — 2 workers allocating 256MB injected on one master podReady
14:04:17+0sMemory stress injected into redis-cluster-shard1-0Ready
14:04:22~+5sMemory usage on shard1-0 spikes to ~252Mi; Redis under pressureReady
14:04:47~+30sRedis remains available; no OOM kill triggered within limitsReady
14:05:17+1m00sChaos experiment window ends; memory stress removed from podReady
14:05:20~+1m03sMemory usage returns to baseline (~5Mi); cluster fully convergedReady

Result: PASS — Memory stress raised shard1-0 usage from ~5Mi to ~252Mi. Redis remained available throughout the experiment and no OOM kill was triggered. After the 60-second chaos window ended, the stressor was removed and memory returned to baseline automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 10: I/O Chaos Latency

What it does: Injects latency into disk I/O operations for the Redis data directory. This simulates a slow or degraded storage backend affecting AOF writes and RDB snapshots.

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-latency-example
  namespace: demo
spec:
  action: latency
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  volumePath: /data
  path: '/data/**/*'
  delay: '15000ms'
  percent: 50
  duration: '100s

Observe:

kubectl logs -n demo redis-cluster-shard0-0 --tail=50 -f

Redis will log warnings about slow I/O operations. AOF fsync and RDB snapshot latency will increase. After the experiment ends, I/O returns to normal and persistence operations resume.

Observed timeline:

Wall-clockΔ from chaosEventDB Status
09:16:01Pre-chaos baseline (all 3 master pods ONLINE)Ready
09:16:010sIOChaos applied — 15000ms I/O latency injected on 50% of ops for shard0-0, shard1-0, shard2-0Ready
09:16:01+0sI/O latency injected into /data path on all master podsNotReady
09:16:10~+9sAOF fsync and RDB snapshot operations slow significantly; write latency increasesNotReady
09:17:41+100sChaos experiment window ends; I/O latency removed from all master podsReady
09:17:45~+1m44sPersistence operations return to baseline; cluster fully convergedReady

Result: PASS — I/O latency of 15000ms was injected into 50% of disk operations on the /data path across all master pods. Redis remained available throughout the experiment as it primarily operates in-memory, but persistence operations (AOF/RDB) were significantly delayed. After the 100-second chaos window ended, all pods recovered automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 11: I/O Chaos Fault

What it does: Injects I/O faults (EIO — errno 5) into 50% of disk operations on the Redis /data directory. This simulates a failing or corrupted storage device, causing read/write syscalls to return errors rather than just slowing down.

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-fault-example
  namespace: demo
spec:
  action: fault
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  volumePath: /data
  path: /data/**/*
  errno: 5
  percent: 50
  duration: '120s'

Apply and observe:

kubectl get rd,pods -n demo
kubectl logs -n demo redis-cluster-shard0-0 --tail=20 -f

Unlike I/O latency (Experiment 10), this experiment causes syscalls to fail outright. Redis immediately starts logging AOF write errors. Because errno 5 (EIO) is injected on 50% of I/O syscalls, even kubectl exec into the affected pods fails — the working directory /data itself becomes unreadable:

error: Internal error occurred: error executing command in container: failed to exec in container:
OCI runtime exec failed: exec failed: unable to start container process:
chdir to cwd ("/data") set in config.json failed: input/output error

Redis logs during the experiment:

74:M 08 May 2026 09:30:16.611 # Error writing to the AOF file: Input/output error
74:M 08 May 2026 09:30:16.622 * AOF write error looks solved, Redis can write again.
74:M 08 May 2026 09:30:16.632 # Fail to fsync the AOF file: Input/output error

The intermittent nature of the fault (50% of ops) causes Redis to oscillate between detecting and resolving the error within the same second, exposing the retry logic in the AOF write path.

Observed timeline:

Wall-clockΔ from chaosEventDB Status
09:30:11Pre-chaos baseline (all 3 master pods ONLINE)Ready
09:30:110sIOChaos applied — errno 5 injected on 50% of /data I/O ops for shard0-0, shard1-0, shard2-0Ready
09:30:16~+5sAOF fsync fails: Fail to fsync the AOF file: Input/output error on shard1-0; Error writing to the AOF file on shard0-0NotReady
09:30:16~+5skubectl exec into affected pods fails — /data working directory returns input/output errorNotReady
09:32:11+120sChaos experiment window ends; I/O fault removed from all master podsNotReady
~09:32:13~+2m02sAOF write error resolves: AOF write error looks solved, Redis can write againReady

Result: PASS — errno 5 (EIO) was injected into 50% of I/O syscalls on the /data path across all master pods. Redis detected and logged AOF write failures within 5 seconds. The fault was severe enough to prevent kubectl exec from entering affected containers. After the 120-second window ended, Redis self-healed — AOF writes resumed and the cluster re-converged without data loss.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 12: Network Loss

What it does: Drops 100% of outgoing network packets from all Redis master pods to the rest of the cluster. This simulates a complete one-directional network failure, where masters can receive traffic but cannot send any responses.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: rd-master-packet-loss
  namespace: demo
spec:
  action: loss
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  loss:
    loss: "100"
    correlation: "100"
  direction: to
  target:
    mode: all
    selector:
      namespaces:
        - demo
      labelSelectors:
        app.kubernetes.io/component: database
  duration: "2m"

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS     AGE
redis.kubedb.com/redis-cluster   7.4.0     NotReady   5h26m

NAME                         READY   STATUS    RESTARTS   AGE
pod/redis-cluster-shard0-0   1/1     Running   0          5h26m
pod/redis-cluster-shard0-1   1/1     Running   0          5h26m
pod/redis-cluster-shard1-0   1/1     Running   0          5h26m
pod/redis-cluster-shard1-1   1/1     Running   0          5h26m
pod/redis-cluster-shard2-0   1/1     Running   0          5h26m
pod/redis-cluster-shard2-1   1/1     Running   0          5h26m

With 100% outgoing packet loss from all master pods, the cluster loses inter-node gossip and replication traffic completely. Replica pods will detect master unavailability and attempt failover. After the 2-minute window, packet loss is lifted and the cluster re-converges.

Observed timeline:

Wall-clockΔ from chaosEventDB Status
10:32:18Pre-chaos baseline (all 3 master pods ONLINE)Ready
10:32:180sNetworkChaos applied — 100% outgoing packet loss injected on all master podsReady
10:32:18+0sAll 6 pods targeted; master pods unable to send any packets to cluster peersCritical
10:32:25~+7sReplica pods detect master heartbeat loss; Redis Cluster initiates replica promotionNotReady
10:32:30~+12sKubeDB detects degraded cluster stateNotReady
10:34:18+2m00sChaos experiment window ends; packet loss removed from all master podsCritical
~10:36:00~+3m42sAll shards re-converge; cluster topology stabilizedReady

Result: PASS — 100% outgoing packet loss was injected on all master pods for 2 minutes. The Redis Cluster detected the loss of master heartbeats and promoted replicas to maintain availability. After the chaos window ended, all pods recovered automatically and KubeDB reconciled the cluster back to Ready. The test key chaos-test remained intact after full recovery.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 13: Network Duplicate

What it does: Duplicates 100% of outgoing network packets from all Redis master pods to the rest of the cluster. This simulates a noisy or faulty network where duplicate packets cause redundant processing, increased bandwidth usage, and potential out-of-order delivery.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: rd-master-packet-duplicate
  namespace: demo
spec:
  action: duplicate
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duplicate:
    duplicate: "100"
    correlation: "100"
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - demo
      labelSelectors:
        app.kubernetes.io/component: database
  duration: "2m"

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    6h33m

NAME                         READY   STATUS    RESTARTS   AGE
pod/redis-cluster-shard0-0   1/1     Running   0          6h33m
pod/redis-cluster-shard0-1   1/1     Running   0          6h32m
pod/redis-cluster-shard1-0   1/1     Running   0          6h33m
pod/redis-cluster-shard1-1   1/1     Running   0          6h32m
pod/redis-cluster-shard2-0   1/1     Running   0          6h33m
pod/redis-cluster-shard2-1   1/1     Running   0          6h32m

With 100% packet duplication on all master pods, Redis inter-node gossip and replication traffic will contain duplicate packets. TCP will deduplicate most of them transparently, but increased bandwidth usage and potential sequence number handling overhead may cause elevated latency. After the 2-minute window, duplication is lifted and the cluster re-converges.

Observed timeline:

Wall-clockΔ from chaosEventDB Status
11:38:35Pre-chaos baseline (all 3 master pods ONLINE)Ready
11:38:350sNetworkChaos applied — 100% packet duplication injected on all master pods (bidirectional)Ready
11:38:35+0sAll 6 pods targeted; duplicate packets injected on both master and replica podsReady
11:38:40~+5sRedis gossip and replication traffic contains duplicate packets; TCP deduplication handles most transparentlyReady
11:38:55~+20sCluster remains functional; elevated bandwidth usage observed across all shardsReady
11:40:35+2m00sChaos experiment window ends; duplication removed from all podsReady
11:40:40~+2m05sAll pods return to normal network behavior; cluster fully convergedReady

Result: PASS — 100% packet duplication was injected bidirectionally across all master and replica pods. TCP’s built-in deduplication handled the duplicate packets transparently, keeping the Redis Cluster available throughout the experiment. No quorum loss or data corruption was observed. After the 2-minute chaos window ended, all pods returned to normal network behavior automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 14: Time Offset

What it does: Shifts the system clock of all Redis master pods back by 2 hours. This simulates clock skew between nodes, which can affect TTL expiry, token validation, certificate checks, and distributed coordination logic that depends on wall-clock time.

apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: rd-master-time-offset
  namespace: demo
spec:
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  clockIds:
    - CLOCK_REALTIME
  timeOffset: "-2h"
  duration: "2m"

Observe:

kubectl get rd,pods -n demo

With the clock skewed by -2 hours on all master pods, any time-sensitive operations (TTL expiry, TLS certificate validation, token checks) will behave as if the pods are 2 hours in the past. After the 2-minute window, the clock is restored to the real time.

Observed timeline:

Wall-clockΔ from chaosEventDB Status
11:52:10Pre-chaos baseline (all 3 master pods ONLINE)Ready
11:52:100sTimeChaos applied — -2h clock offset injected on all master podsReady
11:52:10+0sCLOCK_REALTIME skewed by -2h on shard0-0, shard1-0, shard2-0Ready
11:52:15~+5sRedis continues operating normally; cluster heartbeat and gossip unaffectedReady
11:54:10+2m00sChaos experiment window ends; clock offset recovered on all master podsReady
11:54:10~+2m00sAll master pod clocks restored to real time; cluster fully convergedReady

Result: PASS — A -2h clock offset was injected on all master pods via CLOCK_REALTIME. Redis remained available throughout the experiment — the cluster heartbeat and gossip protocol were unaffected by the skew. After the 2-minute chaos window ended, the clock was automatically restored on all pods. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 15: DNS Chaos

What it does: Injects DNS errors into all Redis master pods, causing DNS resolution failures for any hostname lookups. This simulates a DNS outage or misconfiguration that affects service discovery and inter-node communication relying on DNS.

apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: rd-master-dns-error
  namespace: demo
spec:
  action: error
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duration: "2m"

Apply and observe:

kubectl get rd,pods -n demo

With DNS errors injected, any hostname-based lookups from master pods will fail. Redis Cluster primarily uses IP addresses for gossip and replication, so the impact is limited — but any client or sidecar relying on DNS-based service discovery will be affected.

Observed timeline:

Wall-clockΔ from chaosEventDB Status
11:58:42Pre-chaos baseline (all 3 master pods ONLINE)Ready
11:58:420sDNSChaos applied — DNS error injected on all master podsReady
11:58:43~+1sDNS error injection succeeded on shard0-0, shard1-0, shard2-0Ready
11:58:48~+6sDNS resolution failures observed; Redis cluster gossip unaffected (IP-based)Ready
12:00:42+2m00sChaos experiment window ends; DNS errors removed from all master podsReady
12:00:42~+2m00sDNS resolution restored on all master pods; cluster fully convergedReady

Result: PASS — DNS errors were injected on all master pods for 2 minutes. Since Redis Cluster gossip and replication use IP addresses rather than hostnames, the cluster remained fully available throughout the experiment. DNS injection and recovery were confirmed via the DNSChaos status (AllInjectedAllRecovered). The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 16: io attr-override

What it does: Overrides the file permission attributes on all files in the Redis /data directory to perm: 72 (octal 0110 — execute-only for owner and group, no read or write). This simulates a scenario where the storage volume becomes inaccessible due to incorrect permissions, preventing Redis from reading or writing its data files.

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: rd-master-io-attr-override
  namespace: demo
spec:
  action: attrOverride
  attr:
    perm: 72
  duration: "2m"
  mode: all
  path: /data/**/*
  percent: 100
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  volumePath: /data

Apply and observe:

kubectl get rd,pods -n demo
kubectl logs -n demo redis-cluster-shard0-0 --tail=20 -f

With file permissions overridden to execute-only (--x--x---), Redis loses read and write access to its data directory. AOF and RDB persistence operations will fail immediately. After the 2-minute window ends, Chaos Mesh restores the original permissions.

Observed timeline:

Wall-clockΔ from chaosEventDB Status
12:07:37Pre-chaos baseline (all 3 master pods ONLINE)Ready
12:07:370sIOChaos applied — perm: 72 (execute-only) overridden on 100% of /data ops for all master podsReady
12:07:37+0sFile attribute override injected on shard0-1, shard1-1, shard2-1 (AllInjected)Ready
12:07:42~+5sRedis persistence operations (AOF/RDB) fail; in-memory operations continue normallyReady
12:09:37+2m00sChaos experiment window ends; file permissions restored on all targeted pods (AllRecovered)Ready
12:09:40~+2m03sPersistence operations resume; cluster fully convergedReady

Result: PASS — File permissions on the /data path were overridden to execute-only (perm: 72) across all targeted pods for 2 minutes. Redis continued serving in-memory read/write operations normally, as the permission fault only affected disk-level persistence (AOF/RDB). After the chaos window ended, Chaos Mesh restored the original permissions and persistence operations resumed automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 17: I/O Mistake

What it does: Injects data corruption into READ and WRITE operations on the Redis /data directory by overwriting up to 10 bytes with zeros on 100% of I/O operations. This simulates silent data corruption from a faulty storage device — where syscalls succeed but the data returned or written is silently corrupted.

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-mistake-example
  namespace: demo
spec:
  action: mistake
  duration: "120s"
  methods:
    - READ
    - WRITE
  mistake:
    filling: zero
    maxLength: 10
    maxOccurrences: 1
  mode: all
  path: /data/**/*
  percent: 100
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  volumePath: /data

Apply and observe:

kubectl get rd,pods -n demo
kubectl logs -n demo redis-cluster-shard0-0 --tail=20 -f

Unlike I/O fault (Experiment 11), this experiment does not return errors — syscalls appear to succeed, but up to 10 bytes per operation are silently zeroed out. This is particularly insidious as Redis may not detect the corruption immediately. AOF and RDB persistence operations are the most likely to be affected.

Observed timeline:

Wall-clockΔ from chaosEventDB Status
12:24:04Pre-chaos baseline (all 3 master pods ONLINE)Ready
12:24:040sIOChaos applied — zero-fill mistake injected on 100% of READ/WRITE ops for all master podsReady
12:24:04+0sI/O mistake injected on shard0-1, shard1-1, shard2-1 (AllInjected)Ready
12:24:09~+5sSilent data corruption begins on AOF/RDB I/O ops; Redis in-memory operations unaffectedReady
12:26:04+2m00sChaos experiment window ends; I/O mistake removed from all targeted podsReady
12:26:07~+2m03sPersistence operations return to normal; cluster fully convergedReady

Result: PASS — Silent zero-fill corruption was injected into 100% of READ and WRITE I/O operations on the /data path across all master pods for 2 minutes. Redis continued serving in-memory operations normally — the corruption only affected disk-level persistence paths (AOF/RDB). Since Redis primarily operates in-memory and the corruption was limited to 10 bytes per operation, no in-memory data was affected. After the chaos window ended, Chaos Mesh removed the injection (AllRecovered) and persistence resumed automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Summary

We ran 17 chaos experiments against a KubeDB-managed Redis Cluster on Kubernetes. Every experiment resulted in a PASS — the cluster recovered automatically and the test key chaos-test remained intact throughout.

#ExperimentToolFault TypeDurationResult
1Master Pod FailurePodChaosPod marked unavailable1m✅ PASS
2Pod Kill MasterPodChaosSIGKILL all master pods5m✅ PASS
3Container KillPodChaosKill redis container in-place5m✅ PASS
4Network DelayNetworkChaos1000ms latency on master pods60s✅ PASS
5Network BandwidthNetworkChaos1 Mbps cap on all pods60s✅ PASS
6Network CorruptionNetworkChaos100% packet corruption on all pods60s✅ PASS
7Network PartitionNetworkChaosBidirectional master↔replica split10m✅ PASS
8CPU StressStressChaos4 workers at 50% CPU on masters2m✅ PASS
9Memory StressStressChaos256MB allocation on one pod60s✅ PASS
10I/O LatencyIOChaos15000ms latency on 50% of /data ops100s✅ PASS
11I/O FaultIOChaoserrno 5 on 50% of /data ops120s✅ PASS
12Network LossNetworkChaos100% outgoing packet loss from masters2m✅ PASS
13Network DuplicateNetworkChaos100% packet duplication (bidirectional)2m✅ PASS
14Time OffsetTimeChaos-2h clock skew on all master pods2m✅ PASS
15DNS ChaosDNSChaosDNS error injection on all master pods2m✅ PASS
16I/O Attr OverrideIOChaosExecute-only permissions on /data2m✅ PASS
17I/O MistakeIOChaosSilent zero-fill on 100% of READ/WRITE ops2m✅ PASS

Key takeaways:

  • KubeDB continuously reconciles the Redis Cluster back to the desired state after every fault — no manual intervention was needed.
  • Redis Cluster mode provides built-in resilience: replica promotion, automatic re-election, and shard re-convergence all worked as expected.
  • Pod-level faults (kill, failure, container kill) recovered within seconds to a few minutes via Kubernetes StatefulSet rescheduling.
  • Network faults (partition, delay, bandwidth, corruption, loss, duplicate) caused temporary Critical or NotReady states but the cluster re-converged automatically once the fault was lifted.
  • Resource stress (CPU, memory) degraded performance but never caused data loss or cluster failure within the tested limits.
  • I/O faults are the most disruptive class — errno injection can prevent kubectl exec from entering pods — but Redis self-healed after the chaos window ended.
  • Time and DNS chaos had minimal impact on Redis Cluster because gossip and replication rely on IP addresses and in-memory state rather than wall-clock time or hostname resolution.

Cleanup

Once you are done with the experiments, remove all resources to avoid incurring unnecessary costs.

Delete all Chaos Mesh experiments:

kubectl delete podchaos,networkchaos,stresschaos,iochaos,timechaos,dnschaos --all -n demo

Delete the Redis Cluster:

kubectl delete redis redis-cluster -n demo

Delete the namespace:

kubectl delete ns demo

Uninstall Chaos Mesh (if installed via Helm):

helm uninstall chaos-mesh -n chaos-mesh
kubectl delete ns chaos-mesh

Uninstall KubeDB (if installed via Helm):

helm uninstall kubedb -n kubedb
kubectl delete ns kubedb

TAGS

Get Up and Running Quickly

Deploy, manage, upgrade Kubernetes on any cloud and automate deployment, scaling, and management of containerized applications.