Chaos Testing Redis on Kubernetes with KubeDB and Chaos Mesh

Redis is a popular in-memory data store used for caching, session management, pub/sub messaging, leaderboards, and real-time analytics. In many architectures, Redis sits in the critical path — a failure can cause cascading degradation across the entire application stack.

Chaos engineering is the discipline of proactively injecting failures into a system to discover weaknesses before they manifest as production incidents.

In this post, we will:

Deploy a production-style Redis Cluster using KubeDB
Inject ten different failure scenarios using Chaos Mesh
Observe and validate how Redis and KubeDB respond to each fault

What is Chaos Engineering?

Chaos engineering is the practice of intentionally introducing controlled failures into a system to observe how it behaves. The goal is not to break things — it is to learn how the system responds so you can make it more resilient.

For a Redis deployment, the questions chaos engineering helps answer are:

Does Redis recover automatically after a pod is killed?
Does the cluster re-converge after a network partition?
Can clients reconnect seamlessly after a disruption?
Does KubeDB restore the desired state after a fault?
How does Redis behave under CPU or memory pressure?

Tools Used

Tool	Purpose
KubeDB	Manages Redis lifecycle on Kubernetes
Chaos Mesh	Injects chaos experiments into the cluster
Redis 7.4.0	The database under test

Prerequisites

Before you begin, make sure you have the following:

A running Kubernetes cluster (GKE, EKS, AKS, or a local cluster using Kind or Minikube)
kubectl configured to access the cluster
KubeDB operator installed — follow the KubeDB setup guide
Chaos Mesh installed — follow the Chaos Mesh installation guide
A default or usable StorageClass in the cluster

Step 1: Deploy Redis Cluster with KubeDB

We will deploy a Redis Cluster with 3 shards and 2 replicas per shard. This gives us a proper distributed setup where we can observe failover and re-convergence behavior.

Create the namespace:

kubectl create ns demo

Apply the Redis manifest:

apiVersion: kubedb.com/v1
kind: Redis
metadata:
  name: redis-cluster
  namespace: demo
spec:
  version: "7.4.0"
  mode: Cluster
  cluster:
    shards: 3
    replicas: 2
  storageType: Durable
  storage:
    resources:
      requests:
        storage: 1Gi
    accessModes:
      - ReadWriteOnce
  deletionPolicy: WipeOut

kubectl apply -f redis-cluster.yaml

Wait for Redis to become ready:

NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    73s

NAME                         READY   STATUS    RESTARTS   AGE
pod/redis-cluster-shard0-0   1/1     Running   0          70s
pod/redis-cluster-shard0-1   1/1     Running   0          54s
pod/redis-cluster-shard1-0   1/1     Running   0          67s
pod/redis-cluster-shard1-1   1/1     Running   0          53s
pod/redis-cluster-shard2-0   1/1     Running   0          66s
pod/redis-cluster-shard2-1   1/1     Running   0          53s

Step 2: Verify Redis is Healthy

Retrieve the Redis password from the secret KubeDB created:

export PASSWORD=$(kubectl get secret -n demo redis-cluster-auth \
  -o jsonpath='{.data.password}' | base64 -d)

Check the cluster state:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c CLUSTER INFO

Write a test key that we will check after each experiment to verify data integrity:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c SET chaos-test "before-chaos"

Read it back:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
                              redis-cli -a $PASSWORD -c GET chaos-test
Defaulted container "redis" out of: redis, redis-init (init)
"before-chaos"

Step 3: Verify Chaos Mesh is Ready

kubectl get pods -n chaos-mesh
NAME                                      READY   STATUS    RESTARTS   AGE
chaos-controller-manager-b8d65b98-75s8w   1/1     Running   0          2m15s
chaos-controller-manager-b8d65b98-jcmnt   1/1     Running   0          2m13s
chaos-controller-manager-b8d65b98-tfwfd   1/1     Running   0          2m14s
chaos-daemon-jhth2                        1/1     Running   0          2m15s
chaos-dashboard-566b9f5c4b-zmplh          1/1     Running   0          2m15s
chaos-dns-server-85b8846dc9-ksljn         1/1     Running   0          116m

Chaos Experiments

For each experiment below, the workflow is:

Apply the manifest
Watch pod and Redis status
Verify data is still accessible
Delete the experiment and wait for full recovery before running the next one

Experiment 1: Master Pod Failure

What it does: Marks a Redis pod as unavailable without killing the process. This simulates Kubernetes declaring a pod unhealthy.

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: rd-master-pod-failure-short
  namespace: demo
spec:
  action: pod-failure
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duration: "1m"

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    39m

NAME                         READY   STATUS    RESTARTS         AGE
pod/redis-cluster-shard0-0   1/1     Running   10 (3m16s ago)   39m
pod/redis-cluster-shard0-1   1/1     Running   0                39m
pod/redis-cluster-shard1-0   1/1     Running   10 (3m16s ago)   39m
pod/redis-cluster-shard1-1   1/1     Running   0                39m
pod/redis-cluster-shard2-0   1/1     Running   10 (3m16s ago)   39m
pod/redis-cluster-shard2-1   1/1     Running   0                39m

One pod will go into a not-ready state. After the 30-second duration, the experiment ends and the pod recovers automatically.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
                                    redis-cli -a $PASSWORD -c GET chaos-test
                                    
Defaulted container "redis" out of: redis, redis-init (init)
"before-chaos"

markdown Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
11:39:00	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
11:44:07	0s	`PodChaos` applied (all master pods targeted)	`Ready`
11:44:07	+0s	All master pods marked unavailable	`NotReady`
11:44:21	+14s	Operator marks masters unhealthy	`Critical`
11:45:07	+1m00s	Chaos auto-recovered, master pods restart	`Critical`
11:45:40	~+1m33s	Pods in `RECOVERING` (rejoining cluster)	`Critical`
~11:48–52	~+4–8m	All masters reach `Running`, cluster re-converges	`Ready`

Result: PASS — All master pods recovered automatically after the 1-minute fault window. KubeDB reconciled the Redis Cluster back to the desired state with zero data loss. The chaos-impacted pods rejoined the cluster and the test key chaos-test remained intact.

Experiment 2: Pod Kill Master

What it does: Sends SIGKILL to a Redis pod, simulating an OOM kill or a sudden node failure. Unlike pod-failure, the pod process is actually terminated and Kubernetes must reschedule it.

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: rd-master-pod-kill-short
  namespace: demo
spec:
  action: pod-kill
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duration: "5m"
  gracePeriod: 0

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    72m

NAME                         READY   STATUS    RESTARTS   AGE
pod/redis-cluster-shard0-0   1/1     Running   0          6m55s
pod/redis-cluster-shard0-1   1/1     Running   0          72m
pod/redis-cluster-shard1-0   1/1     Running   0          6m55s
pod/redis-cluster-shard1-1   1/1     Running   0          72m
pod/redis-cluster-shard2-0   1/1     Running   0          6m55s
pod/redis-cluster-shard2-1   1/1     Running   0          72m

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test
  
"before-chaos"

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
12:14:19	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
12:14:19	0s	`PodChaos` applied — all master pods SIGKILLed	`Ready`
12:14:19	+0s	All master pods terminated (`pod-kill` injected for shard0-0, shard1-0, shard2-0)	`Critical`
12:14:30	~+11s	Kubernetes reschedules master pods; replicas promoted	`Critical`
12:14:55	~+36s	New master pods reach `Running`; cluster re-elects primaries	`Critical`
12:19:19	+5m00s	Chaos experiment window ends	`Critical`
~12:21:00	~+7m	All shards re-converge; cluster topology stabilized	`Ready`

Result: PASS — All master pods were hard-killed simultaneously. Kubernetes rescheduled them via the StatefulSet controller and KubeDB reconciled the cluster back to the desired state. Replica pods were promoted to masters during the kill window and the test key chaos-test remained intact after full recovery.

Experiment 3: Container Kill

What it does: Kills only the redis container inside a pod, without deleting the pod itself. This tests in-place container restart behavior and is useful when sidecars or init containers are involved.

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: rd-master-container-kill-short
  namespace: demo
spec:
  action: container-kill
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duration: "5m"
  containerNames: ['redis']

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    80m

NAME                         READY   STATUS    RESTARTS      AGE
pod/redis-cluster-shard0-0   1/1     Running   1 (61s ago)   14m
pod/redis-cluster-shard0-1   1/1     Running   0             79m
pod/redis-cluster-shard1-0   1/1     Running   1 (61s ago)   14m
pod/redis-cluster-shard1-1   1/1     Running   0             79m
pod/redis-cluster-shard2-0   1/1     Running   1 (61s ago)   14m
pod/redis-cluster-shard2-1   1/1     Running   0             79m

The pod stays alive. Only the redis container restarts. Once it comes back up, it rejoins the cluster automatically.

markdown Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
12:27:43	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
12:27:43	0s	`PodChaos` applied — `redis` container killed in shard0-0, shard1-0, shard2-0	`Ready`
12:27:43	+0s	All 3 master containers killed simultaneously (`container-kill` injected)	`Critical`
12:27:44	~+1s	Kubernetes detects container exit; kubelet restarts `redis` container in-place	`Critical`
12:28:04	~+21s	Containers restart with 1 restart count; pods remain scheduled on same nodes	`Critical`
12:28:44	~+1m	Restarted containers rejoin the Redis Cluster; cluster re-converges	`Ready`
12:32:43	+5m00s	Chaos experiment window ends; no further kills injected	`Ready`

Result: PASS — All three master containers were killed simultaneously. Kubernetes restarted them in-place (pod identity preserved, no rescheduling needed). KubeDB reconciled the cluster back to the desired state with zero data loss. The test key chaos-test remained intact after full recovery.

Experiment 4: Network Delay

What it does: Injects artificial latency into the network traffic of Redis pods. This simulates cross-availability-zone communication, saturated network links, or noisy neighbour conditions.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: delay
  namespace: demo
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  delay:
    latency: '1000ms'
    correlation: '100'
    jitter: '50ms'
  duration: "60s"

Observe:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c PING

Commands that normally complete in under 1ms will now take 1000ms or more. After the 60-second window, latency returns to normal.

markdown Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
T+0:00	—	Pre-chaos baseline (all 6 pods ONLINE)	`Ready`
T+0:00	0s	`NetworkChaos` applied — 1000ms latency injected on all master pods	`Ready`
T+0:05	~+5s	Redis commands begin timing out; write attempts fail with `context deadline exceeded`	`NotReady`
T+0:17	~+17s	Cluster marks affected shards unavailable; KubeDB detects degraded state	`NotReady`
T+1:00	+60s	Chaos experiment window ends; latency injection removed	`NotReady`
T+3:00	~+3m	All shards re-converge; KubeDB reconciles cluster to desired state	`Ready`

Result: PASS — During the 1000ms latency window, Redis write operations failed with context deadline exceeded errors due to cluster heartbeat timeouts exceeding the configured threshold. After the chaos window ended, all pods recovered automatically and KubeDB reconciled the cluster back to Ready. Data integrity was confirmed — the test key chaos-test remained intact after full recovery.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 5: Network Bandwidth

markdown What it does: Throttles the network bandwidth available to all Redis pods to 1 Mbps. This simulates a congested or limited network link, such as a cross-region replication scenario or a degraded network interface.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: bandwidth
  namespace: demo
spec:
  action: bandwidth
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
  bandwidth:
    rate: 1mbps
    limit: 20971520
    buffer: 10000
  duration: "60s"

Apply and observe:

kubectl get rd,pods -n demo
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c PING

With only 1 Mbps available, Redis inter-node gossip and replication traffic will compete with client traffic. Large key writes or bulk operations will slow significantly. After the 60-second window, bandwidth returns to normal.

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
T+0:00	—	Pre-chaos baseline (all 6 pods ONLINE)	`Ready`
T+0:00	0s	`NetworkChaos` applied — 1 Mbps cap injected on all pods	`Ready`
T+0:05	~+5s	Replication and gossip traffic degraded; write latency increases	`Ready`
T+0:30	~+30s	Cluster remains functional but throughput is visibly throttled	`Ready`
T+1:00	+60s	Chaos experiment window ends; bandwidth restriction lifted	`Ready`
T+1:10	~+1m10s	All pods return to full throughput; cluster fully converged	`Ready`

Result: PASS — The 1 Mbps bandwidth cap throttled replication and gossip traffic across all shards. Redis remained available throughout the experiment with elevated latency on large operations, but the cluster did not lose quorum. After the chaos window ended, all pods returned to normal throughput and the test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 6: Network Corruption

What it does: Corrupts a percentage of network packets for Redis pods. This simulates bit-flip errors from faulty hardware or bad cables.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: redis-network-corruption
  namespace: demo
spec:
  action: corrupt
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
  corrupt:
    corrupt: "100"
    correlation: "100"
  duration: "60s"

markdown Observe:

kubectl get rd,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    145m

NAME                         READY   STATUS    RESTARTS      AGE
pod/redis-cluster-shard0-0   1/1     Running   1 (86m ago)   99m
pod/redis-cluster-shard0-1   1/1     Running   0             145m
pod/redis-cluster-shard1-0   1/1     Running   1 (86m ago)   99m
pod/redis-cluster-shard1-1   1/1     Running   0             145m
pod/redis-cluster-shard2-0   1/1     Running   1 (86m ago)   99m
pod/redis-cluster-shard2-1   1/1     Running   0             145m

TCP will detect and retransmit corrupted packets, so most Redis commands will still succeed but with higher latency and occasional errors. After the experiment, the network returns to normal.

Note: Chaos Mesh may initially fail to apply the tc (traffic control) rules on some pods — retrying automatically until injection succeeds. This is visible in the NetworkChaos status as repeated Failed → Succeeded events per pod.

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
09:59:46	—	Pre-chaos baseline (all 6 pods ONLINE)	`Ready`
09:59:46	0s	`NetworkChaos` applied — 100% packet corruption injected on all pods	`Ready`
09:59:46	+0s	Chaos Mesh begins injecting `tc` rules; some pods fail initial injection and retry	`Ready`
10:00:00	~+14s	All pods successfully injected; Redis inter-node gossip and replication traffic degraded	`Ready`
10:00:10	~+24s	TCP retransmissions increase; Redis commands experience elevated latency and occasional errors	`NotReady`
10:00:46	+60s	Chaos experiment window ends; corruption rules removed from all pods	`NotReady`
10:00:50	~+1m04s	All pods return to normal network; cluster fully converged	`Ready`

Result: PASS — 100% packet corruption was injected across all 6 Redis pods. TCP retransmission handled most corrupted packets transparently, keeping the cluster available with elevated latency. Some pods required multiple injection retries by Chaos Mesh before tc rules were successfully applied. After the 60-second window ended, all pods recovered automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 7: Network Partition

What it does: Completely isolates one Redis pod from the rest of the cluster. This is the most aggressive network experiment and tests how the Redis Cluster handles a split-brain scenario.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: partition
  namespace: demo
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - demo
      labelSelectors:
        app.kubernetes.io/name: redises.kubedb.com
        kubedb.com/role: slave
  duration: "10m"

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS     AGE
redis.kubedb.com/redis-cluster   7.4.0     Critical   119m

NAME                         READY   STATUS    RESTARTS      AGE
pod/redis-cluster-shard0-0   1/1     Running   1 (40m ago)   53m
pod/redis-cluster-shard0-1   1/1     Running   0             118m
pod/redis-cluster-shard1-0   1/1     Running   1 (40m ago)   53m
pod/redis-cluster-shard1-1   1/1     Running   0             118m
pod/redis-cluster-shard2-0   1/1     Running   1 (40m ago)   53m
pod/redis-cluster-shard2-1   1/1     Running   0             118m

The isolated pod is cut off from its peers. Redis Cluster will mark the affected shard as unavailable. Once the 10m window ends, the partition is lifted and the cluster re-converges.

observe after 10m:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    126m

NAME                         READY   STATUS    RESTARTS      AGE
pod/redis-cluster-shard0-0   1/1     Running   1 (47m ago)   60m
pod/redis-cluster-shard0-1   1/1     Running   0             126m
pod/redis-cluster-shard1-0   1/1     Running   1 (47m ago)   60m
pod/redis-cluster-shard1-1   1/1     Running   0             126m
pod/redis-cluster-shard2-0   1/1     Running   1 (47m ago)   60m
pod/redis-cluster-shard2-1   1/1     Running   0             126m

markdown

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
13:04:51	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
13:04:51	0s	`NetworkChaos` applied — bidirectional partition between all masters and all replicas	`Ready`
13:04:52	~+1s	All 6 pods partitioned (masters isolated from replicas across all 3 shards)	`Critical`
13:05:10	~+20s	Redis Cluster marks affected shards unavailable; KubeDB detects degraded state	`Critical`
13:14:51	+10m00s	Chaos experiment window ends; partition lifted on all pods	`Critical`
~13:16:00	~+11m	All shards re-converge; cluster topology stabilized	`Ready`

Result: PASS — All master pods were bidirectionally partitioned from their replicas simultaneously. During the 10-minute window, the Redis Cluster entered a Critical state as shards lost quorum visibility. Once the partition was lifted, all pods automatically re-converged and KubeDB reconciled the cluster back to Ready. The test key chaos-test remained intact after full recovery.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 8: CPU Stress

What it does: Saturates the CPU of a Redis pod to simulate CPU-intensive workloads on the same node or a resource-constrained environment.

current CPU usage:

kubectl top pods -n demo
NAME                     CPU(cores)   MEMORY(bytes)   
redis-cluster-shard0-0   3m           5Mi             
redis-cluster-shard0-1   4m           6Mi             
redis-cluster-shard1-0   4m           5Mi             
redis-cluster-shard1-1   4m           5Mi             
redis-cluster-shard2-0   3m           4Mi             
redis-cluster-shard2-1   3m           6Mi

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress-example
  namespace: demo
spec:
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duration: "2m"
  stressors:
    cpu:
      workers: 4
      load: 50

Apply and Observe:

kubectl top pods -n demo
NAME                     CPU(cores)   MEMORY(bytes)   
redis-cluster-shard0-0   1901m        9Mi             
redis-cluster-shard0-1   3m           5Mi             
redis-cluster-shard1-0   2007m        9Mi             
redis-cluster-shard1-1   3m           6Mi             
redis-cluster-shard2-0   1104m        9Mi             
redis-cluster-shard2-1   3m           5Mi

Redis is single-threaded for command processing, so heavy CPU stress will noticeably increase command latency. After the experiment, CPU usage drops and performance returns to baseline.

markdown Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
13:28:05	—	Pre-chaos baseline (all 3 master pods ONLINE, ~3–4m CPU)	`Ready`
13:28:05	0s	`StressChaos` applied — 4 workers at 50% CPU load injected on all master pods	`Ready`
13:28:05	+0s	CPU stress injected into shard0-0, shard1-0, shard2-0	`Ready`
13:28:10	~+5s	CPU usage spikes to ~1100–2000m per master pod; command latency increases	`Ready`
13:28:05	~+30s	Redis remains available; single-threaded command processing visibly slower	`Ready`
13:30:05	+2m00s	Chaos experiment window ends; CPU stress removed from all master pods	`Ready`
13:30:10	~+2m05s	CPU usage returns to baseline (~3–4m); cluster fully converged	`Ready`

Result: PASS — CPU stress raised master pod usage from ~3–4m to ~1100–2000m cores. Redis remained available throughout the experiment, but command latency increased due to CPU contention. After the 2-minute chaos window ended, all pods recovered to baseline CPU usage automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 9: Memory Stress

What it does: Allocates a large chunk of memory inside a Redis pod to simulate memory pressure, approaching OOM conditions.

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: redis-memory-stress
  namespace: demo
spec:
  mode: one
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/instance: "redis-cluster"
      app.kubernetes.io/name: "redis"
  stressors:
    memory:
      workers: 2
      size: "256MB"
  duration: "60s"

Observe:

kubectl top pods -n demo
NAME                     CPU(cores)   MEMORY(bytes)   
redis-cluster-shard0-0   3m           5Mi             
redis-cluster-shard0-1   4m           6Mi             
redis-cluster-shard1-0   4m           252Mi           
redis-cluster-shard1-1   4m           5Mi             
redis-cluster-shard2-0   3m           4Mi             
redis-cluster-shard2-1   3m           5Mi

Watch whether Redis begins evicting keys (depending on your maxmemory-policy setting) and whether it recovers cleanly once the stressor is removed.

markdown Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
14:04:17	—	Pre-chaos baseline (all master pods ONLINE, ~5–6Mi memory)	`Ready`
14:04:17	0s	`StressChaos` applied — 2 workers allocating 256MB injected on one master pod	`Ready`
14:04:17	+0s	Memory stress injected into `redis-cluster-shard1-0`	`Ready`
14:04:22	~+5s	Memory usage on `shard1-0` spikes to ~252Mi; Redis under pressure	`Ready`
14:04:47	~+30s	Redis remains available; no OOM kill triggered within limits	`Ready`
14:05:17	+1m00s	Chaos experiment window ends; memory stress removed from pod	`Ready`
14:05:20	~+1m03s	Memory usage returns to baseline (~5Mi); cluster fully converged	`Ready`

Result: PASS — Memory stress raised shard1-0 usage from ~5Mi to ~252Mi. Redis remained available throughout the experiment and no OOM kill was triggered. After the 60-second chaos window ended, the stressor was removed and memory returned to baseline automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 10: I/O Chaos Latency

What it does: Injects latency into disk I/O operations for the Redis data directory. This simulates a slow or degraded storage backend affecting AOF writes and RDB snapshots.

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-latency-example
  namespace: demo
spec:
  action: latency
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  volumePath: /data
  path: '/data/**/*'
  delay: '15000ms'
  percent: 50
  duration: '100s

Observe:

kubectl logs -n demo redis-cluster-shard0-0 --tail=50 -f

Redis will log warnings about slow I/O operations. AOF fsync and RDB snapshot latency will increase. After the experiment ends, I/O returns to normal and persistence operations resume.

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
09:16:01	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
09:16:01	0s	`IOChaos` applied — 15000ms I/O latency injected on 50% of ops for shard0-0, shard1-0, shard2-0	`Ready`
09:16:01	+0s	I/O latency injected into `/data` path on all master pods	`NotReady`
09:16:10	~+9s	AOF fsync and RDB snapshot operations slow significantly; write latency increases	`NotReady`
09:17:41	+100s	Chaos experiment window ends; I/O latency removed from all master pods	`Ready`
09:17:45	~+1m44s	Persistence operations return to baseline; cluster fully converged	`Ready`

Result: PASS — I/O latency of 15000ms was injected into 50% of disk operations on the /data path across all master pods. Redis remained available throughout the experiment as it primarily operates in-memory, but persistence operations (AOF/RDB) were significantly delayed. After the 100-second chaos window ended, all pods recovered automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 11: I/O Chaos Fault

What it does: Injects I/O faults (EIO — errno 5) into 50% of disk operations on the Redis /data directory. This simulates a failing or corrupted storage device, causing read/write syscalls to return errors rather than just slowing down.

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-fault-example
  namespace: demo
spec:
  action: fault
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  volumePath: /data
  path: /data/**/*
  errno: 5
  percent: 50
  duration: '120s'

Apply and observe:

kubectl get rd,pods -n demo
kubectl logs -n demo redis-cluster-shard0-0 --tail=20 -f

Unlike I/O latency (Experiment 10), this experiment causes syscalls to fail outright. Redis immediately starts logging AOF write errors. Because errno 5 (EIO) is injected on 50% of I/O syscalls, even kubectl exec into the affected pods fails — the working directory /data itself becomes unreadable:

error: Internal error occurred: error executing command in container: failed to exec in container:
OCI runtime exec failed: exec failed: unable to start container process:
chdir to cwd ("/data") set in config.json failed: input/output error

Redis logs during the experiment:

74:M 08 May 2026 09:30:16.611 # Error writing to the AOF file: Input/output error
74:M 08 May 2026 09:30:16.622 * AOF write error looks solved, Redis can write again.
74:M 08 May 2026 09:30:16.632 # Fail to fsync the AOF file: Input/output error

The intermittent nature of the fault (50% of ops) causes Redis to oscillate between detecting and resolving the error within the same second, exposing the retry logic in the AOF write path.

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
09:30:11	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
09:30:11	0s	`IOChaos` applied — errno 5 injected on 50% of `/data` I/O ops for shard0-0, shard1-0, shard2-0	`Ready`
09:30:16	~+5s	AOF fsync fails: `Fail to fsync the AOF file: Input/output error` on shard1-0; `Error writing to the AOF file` on shard0-0	`NotReady`
09:30:16	~+5s	`kubectl exec` into affected pods fails — `/data` working directory returns `input/output error`	`NotReady`
09:32:11	+120s	Chaos experiment window ends; I/O fault removed from all master pods	`NotReady`
~09:32:13	~+2m02s	AOF write error resolves: `AOF write error looks solved, Redis can write again`	`Ready`

Result: PASS — errno 5 (EIO) was injected into 50% of I/O syscalls on the /data path across all master pods. Redis detected and logged AOF write failures within 5 seconds. The fault was severe enough to prevent kubectl exec from entering affected containers. After the 120-second window ended, Redis self-healed — AOF writes resumed and the cluster re-converged without data loss.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 12: Network Loss

What it does: Drops 100% of outgoing network packets from all Redis master pods to the rest of the cluster. This simulates a complete one-directional network failure, where masters can receive traffic but cannot send any responses.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: rd-master-packet-loss
  namespace: demo
spec:
  action: loss
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  loss:
    loss: "100"
    correlation: "100"
  direction: to
  target:
    mode: all
    selector:
      namespaces:
        - demo
      labelSelectors:
        app.kubernetes.io/component: database
  duration: "2m"

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS     AGE
redis.kubedb.com/redis-cluster   7.4.0     NotReady   5h26m

NAME                         READY   STATUS    RESTARTS   AGE
pod/redis-cluster-shard0-0   1/1     Running   0          5h26m
pod/redis-cluster-shard0-1   1/1     Running   0          5h26m
pod/redis-cluster-shard1-0   1/1     Running   0          5h26m
pod/redis-cluster-shard1-1   1/1     Running   0          5h26m
pod/redis-cluster-shard2-0   1/1     Running   0          5h26m
pod/redis-cluster-shard2-1   1/1     Running   0          5h26m

With 100% outgoing packet loss from all master pods, the cluster loses inter-node gossip and replication traffic completely. Replica pods will detect master unavailability and attempt failover. After the 2-minute window, packet loss is lifted and the cluster re-converges.

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
10:32:18	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
10:32:18	0s	`NetworkChaos` applied — 100% outgoing packet loss injected on all master pods	`Ready`
10:32:18	+0s	All 6 pods targeted; master pods unable to send any packets to cluster peers	`Critical`
10:32:25	~+7s	Replica pods detect master heartbeat loss; Redis Cluster initiates replica promotion	`NotReady`
10:32:30	~+12s	KubeDB detects degraded cluster state	`NotReady`
10:34:18	+2m00s	Chaos experiment window ends; packet loss removed from all master pods	`Critical`
~10:36:00	~+3m42s	All shards re-converge; cluster topology stabilized	`Ready`

Result: PASS — 100% outgoing packet loss was injected on all master pods for 2 minutes. The Redis Cluster detected the loss of master heartbeats and promoted replicas to maintain availability. After the chaos window ended, all pods recovered automatically and KubeDB reconciled the cluster back to Ready. The test key chaos-test remained intact after full recovery.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 13: Network Duplicate

What it does: Duplicates 100% of outgoing network packets from all Redis master pods to the rest of the cluster. This simulates a noisy or faulty network where duplicate packets cause redundant processing, increased bandwidth usage, and potential out-of-order delivery.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: rd-master-packet-duplicate
  namespace: demo
spec:
  action: duplicate
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duplicate:
    duplicate: "100"
    correlation: "100"
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - demo
      labelSelectors:
        app.kubernetes.io/component: database
  duration: "2m"

Observe:

kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME                             VERSION   STATUS   AGE
redis.kubedb.com/redis-cluster   7.4.0     Ready    6h33m

NAME                         READY   STATUS    RESTARTS   AGE
pod/redis-cluster-shard0-0   1/1     Running   0          6h33m
pod/redis-cluster-shard0-1   1/1     Running   0          6h32m
pod/redis-cluster-shard1-0   1/1     Running   0          6h33m
pod/redis-cluster-shard1-1   1/1     Running   0          6h32m
pod/redis-cluster-shard2-0   1/1     Running   0          6h33m
pod/redis-cluster-shard2-1   1/1     Running   0          6h32m

With 100% packet duplication on all master pods, Redis inter-node gossip and replication traffic will contain duplicate packets. TCP will deduplicate most of them transparently, but increased bandwidth usage and potential sequence number handling overhead may cause elevated latency. After the 2-minute window, duplication is lifted and the cluster re-converges.

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
11:38:35	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
11:38:35	0s	`NetworkChaos` applied — 100% packet duplication injected on all master pods (bidirectional)	`Ready`
11:38:35	+0s	All 6 pods targeted; duplicate packets injected on both master and replica pods	`Ready`
11:38:40	~+5s	Redis gossip and replication traffic contains duplicate packets; TCP deduplication handles most transparently	`Ready`
11:38:55	~+20s	Cluster remains functional; elevated bandwidth usage observed across all shards	`Ready`
11:40:35	+2m00s	Chaos experiment window ends; duplication removed from all pods	`Ready`
11:40:40	~+2m05s	All pods return to normal network behavior; cluster fully converged	`Ready`

Result: PASS — 100% packet duplication was injected bidirectionally across all master and replica pods. TCP’s built-in deduplication handled the duplicate packets transparently, keeping the Redis Cluster available throughout the experiment. No quorum loss or data corruption was observed. After the 2-minute chaos window ended, all pods returned to normal network behavior automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 14: Time Offset

What it does: Shifts the system clock of all Redis master pods back by 2 hours. This simulates clock skew between nodes, which can affect TTL expiry, token validation, certificate checks, and distributed coordination logic that depends on wall-clock time.

apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: rd-master-time-offset
  namespace: demo
spec:
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  clockIds:
    - CLOCK_REALTIME
  timeOffset: "-2h"
  duration: "2m"

Observe:

kubectl get rd,pods -n demo

With the clock skewed by -2 hours on all master pods, any time-sensitive operations (TTL expiry, TLS certificate validation, token checks) will behave as if the pods are 2 hours in the past. After the 2-minute window, the clock is restored to the real time.

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
11:52:10	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
11:52:10	0s	`TimeChaos` applied — `-2h` clock offset injected on all master pods	`Ready`
11:52:10	+0s	`CLOCK_REALTIME` skewed by -2h on shard0-0, shard1-0, shard2-0	`Ready`
11:52:15	~+5s	Redis continues operating normally; cluster heartbeat and gossip unaffected	`Ready`
11:54:10	+2m00s	Chaos experiment window ends; clock offset recovered on all master pods	`Ready`
11:54:10	~+2m00s	All master pod clocks restored to real time; cluster fully converged	`Ready`

Result: PASS — A -2h clock offset was injected on all master pods via CLOCK_REALTIME. Redis remained available throughout the experiment — the cluster heartbeat and gossip protocol were unaffected by the skew. After the 2-minute chaos window ended, the clock was automatically restored on all pods. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 15: DNS Chaos

What it does: Injects DNS errors into all Redis master pods, causing DNS resolution failures for any hostname lookups. This simulates a DNS outage or misconfiguration that affects service discovery and inter-node communication relying on DNS.

apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: rd-master-dns-error
  namespace: demo
spec:
  action: error
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  duration: "2m"

Apply and observe:

kubectl get rd,pods -n demo

With DNS errors injected, any hostname-based lookups from master pods will fail. Redis Cluster primarily uses IP addresses for gossip and replication, so the impact is limited — but any client or sidecar relying on DNS-based service discovery will be affected.

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
11:58:42	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
11:58:42	0s	`DNSChaos` applied — DNS error injected on all master pods	`Ready`
11:58:43	~+1s	DNS error injection succeeded on shard0-0, shard1-0, shard2-0	`Ready`
11:58:48	~+6s	DNS resolution failures observed; Redis cluster gossip unaffected (IP-based)	`Ready`
12:00:42	+2m00s	Chaos experiment window ends; DNS errors removed from all master pods	`Ready`
12:00:42	~+2m00s	DNS resolution restored on all master pods; cluster fully converged	`Ready`

Result: PASS — DNS errors were injected on all master pods for 2 minutes. Since Redis Cluster gossip and replication use IP addresses rather than hostnames, the cluster remained fully available throughout the experiment. DNS injection and recovery were confirmed via the DNSChaos status (AllInjected → AllRecovered). The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 16: io attr-override

What it does: Overrides the file permission attributes on all files in the Redis /data directory to perm: 72 (octal 0110 — execute-only for owner and group, no read or write). This simulates a scenario where the storage volume becomes inaccessible due to incorrect permissions, preventing Redis from reading or writing its data files.

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: rd-master-io-attr-override
  namespace: demo
spec:
  action: attrOverride
  attr:
    perm: 72
  duration: "2m"
  mode: all
  path: /data/**/*
  percent: 100
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  volumePath: /data

Apply and observe:

kubectl get rd,pods -n demo
kubectl logs -n demo redis-cluster-shard0-0 --tail=20 -f

With file permissions overridden to execute-only (--x--x---), Redis loses read and write access to its data directory. AOF and RDB persistence operations will fail immediately. After the 2-minute window ends, Chaos Mesh restores the original permissions.

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
12:07:37	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
12:07:37	0s	`IOChaos` applied — `perm: 72` (execute-only) overridden on 100% of `/data` ops for all master pods	`Ready`
12:07:37	+0s	File attribute override injected on shard0-1, shard1-1, shard2-1 (`AllInjected`)	`Ready`
12:07:42	~+5s	Redis persistence operations (AOF/RDB) fail; in-memory operations continue normally	`Ready`
12:09:37	+2m00s	Chaos experiment window ends; file permissions restored on all targeted pods (`AllRecovered`)	`Ready`
12:09:40	~+2m03s	Persistence operations resume; cluster fully converged	`Ready`

Result: PASS — File permissions on the /data path were overridden to execute-only (perm: 72) across all targeted pods for 2 minutes. Redis continued serving in-memory read/write operations normally, as the permission fault only affected disk-level persistence (AOF/RDB). After the chaos window ended, Chaos Mesh restored the original permissions and persistence operations resumed automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Experiment 17: I/O Mistake

What it does: Injects data corruption into READ and WRITE operations on the Redis /data directory by overwriting up to 10 bytes with zeros on 100% of I/O operations. This simulates silent data corruption from a faulty storage device — where syscalls succeed but the data returned or written is silently corrupted.

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-mistake-example
  namespace: demo
spec:
  action: mistake
  duration: "120s"
  methods:
    - READ
    - WRITE
  mistake:
    filling: zero
    maxLength: 10
    maxOccurrences: 1
  mode: all
  path: /data/**/*
  percent: 100
  selector:
    namespaces:
      - demo
    labelSelectors:
      app.kubernetes.io/name: redises.kubedb.com
      kubedb.com/role: master
  volumePath: /data

Apply and observe:

kubectl get rd,pods -n demo
kubectl logs -n demo redis-cluster-shard0-0 --tail=20 -f

Unlike I/O fault (Experiment 11), this experiment does not return errors — syscalls appear to succeed, but up to 10 bytes per operation are silently zeroed out. This is particularly insidious as Redis may not detect the corruption immediately. AOF and RDB persistence operations are the most likely to be affected.

Observed timeline:

Wall-clock	Δ from chaos	Event	DB Status
12:24:04	—	Pre-chaos baseline (all 3 master pods ONLINE)	`Ready`
12:24:04	0s	`IOChaos` applied — zero-fill mistake injected on 100% of READ/WRITE ops for all master pods	`Ready`
12:24:04	+0s	I/O mistake injected on shard0-1, shard1-1, shard2-1 (`AllInjected`)	`Ready`
12:24:09	~+5s	Silent data corruption begins on AOF/RDB I/O ops; Redis in-memory operations unaffected	`Ready`
12:26:04	+2m00s	Chaos experiment window ends; I/O mistake removed from all targeted pods	`Ready`
12:26:07	~+2m03s	Persistence operations return to normal; cluster fully converged	`Ready`

Result: PASS — Silent zero-fill corruption was injected into 100% of READ and WRITE I/O operations on the /data path across all master pods for 2 minutes. Redis continued serving in-memory operations normally — the corruption only affected disk-level persistence paths (AOF/RDB). Since Redis primarily operates in-memory and the corruption was limited to 10 bytes per operation, no in-memory data was affected. After the chaos window ended, Chaos Mesh removed the injection (AllRecovered) and persistence resumed automatically. The test key chaos-test remained intact.

Verify data:

kubectl exec -it -n demo redis-cluster-shard0-0 -- \
  redis-cli -a $PASSWORD -c GET chaos-test

"before-chaos"

Summary

We ran 17 chaos experiments against a KubeDB-managed Redis Cluster on Kubernetes. Every experiment resulted in a PASS — the cluster recovered automatically and the test key chaos-test remained intact throughout.

#	Experiment	Tool	Fault Type	Duration	Result
1	Master Pod Failure	PodChaos	Pod marked unavailable	1m	✅ PASS
2	Pod Kill Master	PodChaos	SIGKILL all master pods	5m	✅ PASS
3	Container Kill	PodChaos	Kill `redis` container in-place	5m	✅ PASS
4	Network Delay	NetworkChaos	1000ms latency on master pods	60s	✅ PASS
5	Network Bandwidth	NetworkChaos	1 Mbps cap on all pods	60s	✅ PASS
6	Network Corruption	NetworkChaos	100% packet corruption on all pods	60s	✅ PASS
7	Network Partition	NetworkChaos	Bidirectional master↔replica split	10m	✅ PASS
8	CPU Stress	StressChaos	4 workers at 50% CPU on masters	2m	✅ PASS
9	Memory Stress	StressChaos	256MB allocation on one pod	60s	✅ PASS
10	I/O Latency	IOChaos	15000ms latency on 50% of `/data` ops	100s	✅ PASS
11	I/O Fault	IOChaos	errno 5 on 50% of `/data` ops	120s	✅ PASS
12	Network Loss	NetworkChaos	100% outgoing packet loss from masters	2m	✅ PASS
13	Network Duplicate	NetworkChaos	100% packet duplication (bidirectional)	2m	✅ PASS
14	Time Offset	TimeChaos	-2h clock skew on all master pods	2m	✅ PASS
15	DNS Chaos	DNSChaos	DNS error injection on all master pods	2m	✅ PASS
16	I/O Attr Override	IOChaos	Execute-only permissions on `/data`	2m	✅ PASS
17	I/O Mistake	IOChaos	Silent zero-fill on 100% of READ/WRITE ops	2m	✅ PASS

Key takeaways:

KubeDB continuously reconciles the Redis Cluster back to the desired state after every fault — no manual intervention was needed.
Redis Cluster mode provides built-in resilience: replica promotion, automatic re-election, and shard re-convergence all worked as expected.
Pod-level faults (kill, failure, container kill) recovered within seconds to a few minutes via Kubernetes StatefulSet rescheduling.
Network faults (partition, delay, bandwidth, corruption, loss, duplicate) caused temporary Critical or NotReady states but the cluster re-converged automatically once the fault was lifted.
Resource stress (CPU, memory) degraded performance but never caused data loss or cluster failure within the tested limits.
I/O faults are the most disruptive class — errno injection can prevent kubectl exec from entering pods — but Redis self-healed after the chaos window ended.
Time and DNS chaos had minimal impact on Redis Cluster because gossip and replication rely on IP addresses and in-memory state rather than wall-clock time or hostname resolution.

Cleanup

Once you are done with the experiments, remove all resources to avoid incurring unnecessary costs.

Delete all Chaos Mesh experiments:

kubectl delete podchaos,networkchaos,stresschaos,iochaos,timechaos,dnschaos --all -n demo

Delete the Redis Cluster:

kubectl delete redis redis-cluster -n demo

Delete the namespace:

kubectl delete ns demo

Uninstall Chaos Mesh (if installed via Helm):

helm uninstall chaos-mesh -n chaos-mesh
kubectl delete ns chaos-mesh

Uninstall KubeDB (if installed via Helm):

helm uninstall kubedb -n kubedb
kubectl delete ns kubedb

KubeDB

KubeStash

Stash

KubeVault

Voyager

ConfigSyncer

RESOURCES

RECENT NEWS/BLOG

Chaos Testing Redis on Kubernetes with KubeDB and Chaos Mesh

What is Chaos Engineering?

Tools Used

Prerequisites

Step 1: Deploy Redis Cluster with KubeDB

Step 2: Verify Redis is Healthy

Step 3: Verify Chaos Mesh is Ready

Chaos Experiments

Experiment 1: Master Pod Failure

Experiment 2: Pod Kill Master

Experiment 3: Container Kill

Experiment 4: Network Delay

Experiment 5: Network Bandwidth

Experiment 6: Network Corruption

Experiment 7: Network Partition

Experiment 8: CPU Stress

Experiment 9: Memory Stress

Experiment 10: I/O Chaos Latency

Experiment 11: I/O Chaos Fault

Experiment 12: Network Loss

Experiment 13: Network Duplicate

Experiment 14: Time Offset

Experiment 15: DNS Chaos

Experiment 16: io attr-override

Experiment 17: I/O Mistake

Summary

Cleanup

TAGS

Get Up and Running Quickly