
Redis is a popular in-memory data store used for caching, session management, pub/sub messaging, leaderboards, and real-time analytics. In many architectures, Redis sits in the critical path — a failure can cause cascading degradation across the entire application stack.
Chaos engineering is the discipline of proactively injecting failures into a system to discover weaknesses before they manifest as production incidents.
In this post, we will:
- Deploy a production-style Redis Cluster using KubeDB
- Inject ten different failure scenarios using Chaos Mesh
- Observe and validate how Redis and KubeDB respond to each fault
What is Chaos Engineering?
Chaos engineering is the practice of intentionally introducing controlled failures into a system to observe how it behaves. The goal is not to break things — it is to learn how the system responds so you can make it more resilient.
For a Redis deployment, the questions chaos engineering helps answer are:
- Does Redis recover automatically after a pod is killed?
- Does the cluster re-converge after a network partition?
- Can clients reconnect seamlessly after a disruption?
- Does KubeDB restore the desired state after a fault?
- How does Redis behave under CPU or memory pressure?
Tools Used
| Tool | Purpose |
|---|---|
| KubeDB | Manages Redis lifecycle on Kubernetes |
| Chaos Mesh | Injects chaos experiments into the cluster |
| Redis 7.4.0 | The database under test |
Prerequisites
Before you begin, make sure you have the following:
- A running Kubernetes cluster (GKE, EKS, AKS, or a local cluster using Kind or Minikube)
kubectlconfigured to access the cluster- KubeDB operator installed — follow the KubeDB setup guide
- Chaos Mesh installed — follow the Chaos Mesh installation guide
- A default or usable
StorageClassin the cluster
Step 1: Deploy Redis Cluster with KubeDB
We will deploy a Redis Cluster with 3 shards and 2 replicas per shard. This gives us a proper distributed setup where we can observe failover and re-convergence behavior.
Create the namespace:
kubectl create ns demo
Apply the Redis manifest:
apiVersion: kubedb.com/v1
kind: Redis
metadata:
name: redis-cluster
namespace: demo
spec:
version: "7.4.0"
mode: Cluster
cluster:
shards: 3
replicas: 2
storageType: Durable
storage:
resources:
requests:
storage: 1Gi
accessModes:
- ReadWriteOnce
deletionPolicy: WipeOut
kubectl apply -f redis-cluster.yaml
Wait for Redis to become ready:
NAME VERSION STATUS AGE
redis.kubedb.com/redis-cluster 7.4.0 Ready 73s
NAME READY STATUS RESTARTS AGE
pod/redis-cluster-shard0-0 1/1 Running 0 70s
pod/redis-cluster-shard0-1 1/1 Running 0 54s
pod/redis-cluster-shard1-0 1/1 Running 0 67s
pod/redis-cluster-shard1-1 1/1 Running 0 53s
pod/redis-cluster-shard2-0 1/1 Running 0 66s
pod/redis-cluster-shard2-1 1/1 Running 0 53s
Step 2: Verify Redis is Healthy
Retrieve the Redis password from the secret KubeDB created:
export PASSWORD=$(kubectl get secret -n demo redis-cluster-auth \
-o jsonpath='{.data.password}' | base64 -d)
Check the cluster state:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c CLUSTER INFO
Write a test key that we will check after each experiment to verify data integrity:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c SET chaos-test "before-chaos"
Read it back:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
Defaulted container "redis" out of: redis, redis-init (init)
"before-chaos"
Step 3: Verify Chaos Mesh is Ready
kubectl get pods -n chaos-mesh
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-b8d65b98-75s8w 1/1 Running 0 2m15s
chaos-controller-manager-b8d65b98-jcmnt 1/1 Running 0 2m13s
chaos-controller-manager-b8d65b98-tfwfd 1/1 Running 0 2m14s
chaos-daemon-jhth2 1/1 Running 0 2m15s
chaos-dashboard-566b9f5c4b-zmplh 1/1 Running 0 2m15s
chaos-dns-server-85b8846dc9-ksljn 1/1 Running 0 116m
Chaos Experiments
For each experiment below, the workflow is:
- Apply the manifest
- Watch pod and Redis status
- Verify data is still accessible
- Delete the experiment and wait for full recovery before running the next one
Experiment 1: Master Pod Failure
What it does: Marks a Redis pod as unavailable without killing the process. This simulates Kubernetes declaring a pod unhealthy.
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: rd-master-pod-failure-short
namespace: demo
spec:
action: pod-failure
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
duration: "1m"
Observe:
kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME VERSION STATUS AGE
redis.kubedb.com/redis-cluster 7.4.0 Ready 39m
NAME READY STATUS RESTARTS AGE
pod/redis-cluster-shard0-0 1/1 Running 10 (3m16s ago) 39m
pod/redis-cluster-shard0-1 1/1 Running 0 39m
pod/redis-cluster-shard1-0 1/1 Running 10 (3m16s ago) 39m
pod/redis-cluster-shard1-1 1/1 Running 0 39m
pod/redis-cluster-shard2-0 1/1 Running 10 (3m16s ago) 39m
pod/redis-cluster-shard2-1 1/1 Running 0 39m
One pod will go into a not-ready state. After the 30-second duration, the experiment ends and the pod recovers automatically.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
Defaulted container "redis" out of: redis, redis-init (init)
"before-chaos"
markdown Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 11:39:00 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 11:44:07 | 0s | PodChaos applied (all master pods targeted) | Ready |
| 11:44:07 | +0s | All master pods marked unavailable | NotReady |
| 11:44:21 | +14s | Operator marks masters unhealthy | Critical |
| 11:45:07 | +1m00s | Chaos auto-recovered, master pods restart | Critical |
| 11:45:40 | ~+1m33s | Pods in RECOVERING (rejoining cluster) | Critical |
| ~11:48–52 | ~+4–8m | All masters reach Running, cluster re-converges | Ready |
Result: PASS — All master pods recovered automatically after the 1-minute fault window. KubeDB reconciled the Redis Cluster back to the desired state with zero data loss. The chaos-impacted pods rejoined the cluster and the test key chaos-test remained intact.
Experiment 2: Pod Kill Master
What it does: Sends SIGKILL to a Redis pod, simulating an OOM kill or a sudden node failure. Unlike pod-failure, the pod process is actually terminated and Kubernetes must reschedule it.
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: rd-master-pod-kill-short
namespace: demo
spec:
action: pod-kill
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
duration: "5m"
gracePeriod: 0
Observe:
kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME VERSION STATUS AGE
redis.kubedb.com/redis-cluster 7.4.0 Ready 72m
NAME READY STATUS RESTARTS AGE
pod/redis-cluster-shard0-0 1/1 Running 0 6m55s
pod/redis-cluster-shard0-1 1/1 Running 0 72m
pod/redis-cluster-shard1-0 1/1 Running 0 6m55s
pod/redis-cluster-shard1-1 1/1 Running 0 72m
pod/redis-cluster-shard2-0 1/1 Running 0 6m55s
pod/redis-cluster-shard2-1 1/1 Running 0 72m
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 12:14:19 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 12:14:19 | 0s | PodChaos applied — all master pods SIGKILLed | Ready |
| 12:14:19 | +0s | All master pods terminated (pod-kill injected for shard0-0, shard1-0, shard2-0) | Critical |
| 12:14:30 | ~+11s | Kubernetes reschedules master pods; replicas promoted | Critical |
| 12:14:55 | ~+36s | New master pods reach Running; cluster re-elects primaries | Critical |
| 12:19:19 | +5m00s | Chaos experiment window ends | Critical |
| ~12:21:00 | ~+7m | All shards re-converge; cluster topology stabilized | Ready |
Result: PASS — All master pods were hard-killed simultaneously. Kubernetes rescheduled them via the StatefulSet controller and KubeDB reconciled the cluster back to the desired state. Replica pods were promoted to masters during the kill window and the test key chaos-test remained intact after full recovery.
Experiment 3: Container Kill
What it does: Kills only the redis container inside a pod, without deleting the pod itself. This tests in-place container restart behavior and is useful when sidecars or init containers are involved.
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: rd-master-container-kill-short
namespace: demo
spec:
action: container-kill
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
duration: "5m"
containerNames: ['redis']
Observe:
kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME VERSION STATUS AGE
redis.kubedb.com/redis-cluster 7.4.0 Ready 80m
NAME READY STATUS RESTARTS AGE
pod/redis-cluster-shard0-0 1/1 Running 1 (61s ago) 14m
pod/redis-cluster-shard0-1 1/1 Running 0 79m
pod/redis-cluster-shard1-0 1/1 Running 1 (61s ago) 14m
pod/redis-cluster-shard1-1 1/1 Running 0 79m
pod/redis-cluster-shard2-0 1/1 Running 1 (61s ago) 14m
pod/redis-cluster-shard2-1 1/1 Running 0 79m
The pod stays alive. Only the redis container restarts. Once it comes back up, it rejoins the cluster automatically.
markdown Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 12:27:43 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 12:27:43 | 0s | PodChaos applied — redis container killed in shard0-0, shard1-0, shard2-0 | Ready |
| 12:27:43 | +0s | All 3 master containers killed simultaneously (container-kill injected) | Critical |
| 12:27:44 | ~+1s | Kubernetes detects container exit; kubelet restarts redis container in-place | Critical |
| 12:28:04 | ~+21s | Containers restart with 1 restart count; pods remain scheduled on same nodes | Critical |
| 12:28:44 | ~+1m | Restarted containers rejoin the Redis Cluster; cluster re-converges | Ready |
| 12:32:43 | +5m00s | Chaos experiment window ends; no further kills injected | Ready |
Result: PASS — All three master containers were killed simultaneously. Kubernetes restarted them in-place (pod identity preserved, no rescheduling needed). KubeDB reconciled the cluster back to the desired state with zero data loss. The test key chaos-test remained intact after full recovery.
Experiment 4: Network Delay
What it does: Injects artificial latency into the network traffic of Redis pods. This simulates cross-availability-zone communication, saturated network links, or noisy neighbour conditions.
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: delay
namespace: demo
spec:
action: delay
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
delay:
latency: '1000ms'
correlation: '100'
jitter: '50ms'
duration: "60s"
Observe:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c PING
Commands that normally complete in under 1ms will now take 1000ms or more. After the 60-second window, latency returns to normal.
markdown Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| T+0:00 | — | Pre-chaos baseline (all 6 pods ONLINE) | Ready |
| T+0:00 | 0s | NetworkChaos applied — 1000ms latency injected on all master pods | Ready |
| T+0:05 | ~+5s | Redis commands begin timing out; write attempts fail with context deadline exceeded | NotReady |
| T+0:17 | ~+17s | Cluster marks affected shards unavailable; KubeDB detects degraded state | NotReady |
| T+1:00 | +60s | Chaos experiment window ends; latency injection removed | NotReady |
| T+3:00 | ~+3m | All shards re-converge; KubeDB reconciles cluster to desired state | Ready |
Result: PASS — During the 1000ms latency window, Redis write operations failed with context deadline exceeded errors due to cluster heartbeat timeouts exceeding the configured threshold. After the chaos window ended, all pods recovered automatically and KubeDB reconciled the cluster back to Ready. Data integrity was confirmed — the test key chaos-test remained intact after full recovery.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 5: Network Bandwidth
markdown What it does: Throttles the network bandwidth available to all Redis pods to 1 Mbps. This simulates a congested or limited network link, such as a cross-region replication scenario or a degraded network interface.
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: bandwidth
namespace: demo
spec:
action: bandwidth
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
bandwidth:
rate: 1mbps
limit: 20971520
buffer: 10000
duration: "60s"
Apply and observe:
kubectl get rd,pods -n demo
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c PING
With only 1 Mbps available, Redis inter-node gossip and replication traffic will compete with client traffic. Large key writes or bulk operations will slow significantly. After the 60-second window, bandwidth returns to normal.
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| T+0:00 | — | Pre-chaos baseline (all 6 pods ONLINE) | Ready |
| T+0:00 | 0s | NetworkChaos applied — 1 Mbps cap injected on all pods | Ready |
| T+0:05 | ~+5s | Replication and gossip traffic degraded; write latency increases | Ready |
| T+0:30 | ~+30s | Cluster remains functional but throughput is visibly throttled | Ready |
| T+1:00 | +60s | Chaos experiment window ends; bandwidth restriction lifted | Ready |
| T+1:10 | ~+1m10s | All pods return to full throughput; cluster fully converged | Ready |
Result: PASS — The 1 Mbps bandwidth cap throttled replication and gossip traffic across all shards. Redis remained available throughout the experiment with elevated latency on large operations, but the cluster did not lose quorum. After the chaos window ended, all pods returned to normal throughput and the test key chaos-test remained intact.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 6: Network Corruption
What it does: Corrupts a percentage of network packets for Redis pods. This simulates bit-flip errors from faulty hardware or bad cables.
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: redis-network-corruption
namespace: demo
spec:
action: corrupt
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
corrupt:
corrupt: "100"
correlation: "100"
duration: "60s"
markdown Observe:
kubectl get rd,pods -n demo
NAME VERSION STATUS AGE
redis.kubedb.com/redis-cluster 7.4.0 Ready 145m
NAME READY STATUS RESTARTS AGE
pod/redis-cluster-shard0-0 1/1 Running 1 (86m ago) 99m
pod/redis-cluster-shard0-1 1/1 Running 0 145m
pod/redis-cluster-shard1-0 1/1 Running 1 (86m ago) 99m
pod/redis-cluster-shard1-1 1/1 Running 0 145m
pod/redis-cluster-shard2-0 1/1 Running 1 (86m ago) 99m
pod/redis-cluster-shard2-1 1/1 Running 0 145m
TCP will detect and retransmit corrupted packets, so most Redis commands will still succeed but with higher latency and occasional errors. After the experiment, the network returns to normal.
Note: Chaos Mesh may initially fail to apply the
tc(traffic control) rules on some pods — retrying automatically until injection succeeds. This is visible in theNetworkChaosstatus as repeatedFailed→Succeededevents per pod.
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 09:59:46 | — | Pre-chaos baseline (all 6 pods ONLINE) | Ready |
| 09:59:46 | 0s | NetworkChaos applied — 100% packet corruption injected on all pods | Ready |
| 09:59:46 | +0s | Chaos Mesh begins injecting tc rules; some pods fail initial injection and retry | Ready |
| 10:00:00 | ~+14s | All pods successfully injected; Redis inter-node gossip and replication traffic degraded | Ready |
| 10:00:10 | ~+24s | TCP retransmissions increase; Redis commands experience elevated latency and occasional errors | NotReady |
| 10:00:46 | +60s | Chaos experiment window ends; corruption rules removed from all pods | NotReady |
| 10:00:50 | ~+1m04s | All pods return to normal network; cluster fully converged | Ready |
Result: PASS — 100% packet corruption was injected across all 6 Redis pods. TCP retransmission handled most corrupted packets transparently, keeping the cluster available with elevated latency. Some pods required multiple injection retries by Chaos Mesh before tc rules were successfully applied. After the 60-second window ended, all pods recovered automatically. The test key chaos-test remained intact.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 7: Network Partition
What it does: Completely isolates one Redis pod from the rest of the cluster. This is the most aggressive network experiment and tests how the Redis Cluster handles a split-brain scenario.
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: partition
namespace: demo
spec:
action: partition
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
direction: both
target:
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: slave
duration: "10m"
Observe:
kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME VERSION STATUS AGE
redis.kubedb.com/redis-cluster 7.4.0 Critical 119m
NAME READY STATUS RESTARTS AGE
pod/redis-cluster-shard0-0 1/1 Running 1 (40m ago) 53m
pod/redis-cluster-shard0-1 1/1 Running 0 118m
pod/redis-cluster-shard1-0 1/1 Running 1 (40m ago) 53m
pod/redis-cluster-shard1-1 1/1 Running 0 118m
pod/redis-cluster-shard2-0 1/1 Running 1 (40m ago) 53m
pod/redis-cluster-shard2-1 1/1 Running 0 118m
The isolated pod is cut off from its peers. Redis Cluster will mark the affected shard as unavailable. Once the 10m window ends, the partition is lifted and the cluster re-converges.
observe after 10m:
kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME VERSION STATUS AGE
redis.kubedb.com/redis-cluster 7.4.0 Ready 126m
NAME READY STATUS RESTARTS AGE
pod/redis-cluster-shard0-0 1/1 Running 1 (47m ago) 60m
pod/redis-cluster-shard0-1 1/1 Running 0 126m
pod/redis-cluster-shard1-0 1/1 Running 1 (47m ago) 60m
pod/redis-cluster-shard1-1 1/1 Running 0 126m
pod/redis-cluster-shard2-0 1/1 Running 1 (47m ago) 60m
pod/redis-cluster-shard2-1 1/1 Running 0 126m
markdown
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 13:04:51 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 13:04:51 | 0s | NetworkChaos applied — bidirectional partition between all masters and all replicas | Ready |
| 13:04:52 | ~+1s | All 6 pods partitioned (masters isolated from replicas across all 3 shards) | Critical |
| 13:05:10 | ~+20s | Redis Cluster marks affected shards unavailable; KubeDB detects degraded state | Critical |
| 13:14:51 | +10m00s | Chaos experiment window ends; partition lifted on all pods | Critical |
| ~13:16:00 | ~+11m | All shards re-converge; cluster topology stabilized | Ready |
Result: PASS — All master pods were bidirectionally partitioned from their replicas simultaneously. During the 10-minute window, the Redis Cluster entered a Critical state as shards lost quorum visibility. Once the partition was lifted, all pods automatically re-converged and KubeDB reconciled the cluster back to Ready. The test key chaos-test remained intact after full recovery.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 8: CPU Stress
What it does: Saturates the CPU of a Redis pod to simulate CPU-intensive workloads on the same node or a resource-constrained environment.
current CPU usage:
kubectl top pods -n demo
NAME CPU(cores) MEMORY(bytes)
redis-cluster-shard0-0 3m 5Mi
redis-cluster-shard0-1 4m 6Mi
redis-cluster-shard1-0 4m 5Mi
redis-cluster-shard1-1 4m 5Mi
redis-cluster-shard2-0 3m 4Mi
redis-cluster-shard2-1 3m 6Mi
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress-example
namespace: demo
spec:
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
duration: "2m"
stressors:
cpu:
workers: 4
load: 50
Apply and Observe:
kubectl top pods -n demo
NAME CPU(cores) MEMORY(bytes)
redis-cluster-shard0-0 1901m 9Mi
redis-cluster-shard0-1 3m 5Mi
redis-cluster-shard1-0 2007m 9Mi
redis-cluster-shard1-1 3m 6Mi
redis-cluster-shard2-0 1104m 9Mi
redis-cluster-shard2-1 3m 5Mi
Redis is single-threaded for command processing, so heavy CPU stress will noticeably increase command latency. After the experiment, CPU usage drops and performance returns to baseline.
markdown Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 13:28:05 | — | Pre-chaos baseline (all 3 master pods ONLINE, ~3–4m CPU) | Ready |
| 13:28:05 | 0s | StressChaos applied — 4 workers at 50% CPU load injected on all master pods | Ready |
| 13:28:05 | +0s | CPU stress injected into shard0-0, shard1-0, shard2-0 | Ready |
| 13:28:10 | ~+5s | CPU usage spikes to ~1100–2000m per master pod; command latency increases | Ready |
| 13:28:05 | ~+30s | Redis remains available; single-threaded command processing visibly slower | Ready |
| 13:30:05 | +2m00s | Chaos experiment window ends; CPU stress removed from all master pods | Ready |
| 13:30:10 | ~+2m05s | CPU usage returns to baseline (~3–4m); cluster fully converged | Ready |
Result: PASS — CPU stress raised master pod usage from ~3–4m to ~1100–2000m cores. Redis remained available throughout the experiment, but command latency increased due to CPU contention. After the 2-minute chaos window ended, all pods recovered to baseline CPU usage automatically. The test key chaos-test remained intact.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 9: Memory Stress
What it does: Allocates a large chunk of memory inside a Redis pod to simulate memory pressure, approaching OOM conditions.
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: redis-memory-stress
namespace: demo
spec:
mode: one
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/instance: "redis-cluster"
app.kubernetes.io/name: "redis"
stressors:
memory:
workers: 2
size: "256MB"
duration: "60s"
Observe:
kubectl top pods -n demo
NAME CPU(cores) MEMORY(bytes)
redis-cluster-shard0-0 3m 5Mi
redis-cluster-shard0-1 4m 6Mi
redis-cluster-shard1-0 4m 252Mi
redis-cluster-shard1-1 4m 5Mi
redis-cluster-shard2-0 3m 4Mi
redis-cluster-shard2-1 3m 5Mi
Watch whether Redis begins evicting keys (depending on your maxmemory-policy setting) and whether it recovers cleanly once the stressor is removed.
markdown Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 14:04:17 | — | Pre-chaos baseline (all master pods ONLINE, ~5–6Mi memory) | Ready |
| 14:04:17 | 0s | StressChaos applied — 2 workers allocating 256MB injected on one master pod | Ready |
| 14:04:17 | +0s | Memory stress injected into redis-cluster-shard1-0 | Ready |
| 14:04:22 | ~+5s | Memory usage on shard1-0 spikes to ~252Mi; Redis under pressure | Ready |
| 14:04:47 | ~+30s | Redis remains available; no OOM kill triggered within limits | Ready |
| 14:05:17 | +1m00s | Chaos experiment window ends; memory stress removed from pod | Ready |
| 14:05:20 | ~+1m03s | Memory usage returns to baseline (~5Mi); cluster fully converged | Ready |
Result: PASS — Memory stress raised shard1-0 usage from ~5Mi to ~252Mi. Redis remained available throughout the experiment and no OOM kill was triggered. After the 60-second chaos window ended, the stressor was removed and memory returned to baseline automatically. The test key chaos-test remained intact.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 10: I/O Chaos Latency
What it does: Injects latency into disk I/O operations for the Redis data directory. This simulates a slow or degraded storage backend affecting AOF writes and RDB snapshots.
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-latency-example
namespace: demo
spec:
action: latency
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
volumePath: /data
path: '/data/**/*'
delay: '15000ms'
percent: 50
duration: '100s
Observe:
kubectl logs -n demo redis-cluster-shard0-0 --tail=50 -f
Redis will log warnings about slow I/O operations. AOF fsync and RDB snapshot latency will increase. After the experiment ends, I/O returns to normal and persistence operations resume.
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 09:16:01 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 09:16:01 | 0s | IOChaos applied — 15000ms I/O latency injected on 50% of ops for shard0-0, shard1-0, shard2-0 | Ready |
| 09:16:01 | +0s | I/O latency injected into /data path on all master pods | NotReady |
| 09:16:10 | ~+9s | AOF fsync and RDB snapshot operations slow significantly; write latency increases | NotReady |
| 09:17:41 | +100s | Chaos experiment window ends; I/O latency removed from all master pods | Ready |
| 09:17:45 | ~+1m44s | Persistence operations return to baseline; cluster fully converged | Ready |
Result: PASS — I/O latency of 15000ms was injected into 50% of disk operations on the /data path across all master pods. Redis remained available throughout the experiment as it primarily operates in-memory, but persistence operations (AOF/RDB) were significantly delayed. After the 100-second chaos window ended, all pods recovered automatically. The test key chaos-test remained intact.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 11: I/O Chaos Fault
What it does: Injects I/O faults (EIO — errno 5) into 50% of disk operations on the Redis /data directory. This simulates a failing or corrupted storage device, causing read/write syscalls to return errors rather than just slowing down.
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-fault-example
namespace: demo
spec:
action: fault
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
volumePath: /data
path: /data/**/*
errno: 5
percent: 50
duration: '120s'
Apply and observe:
kubectl get rd,pods -n demo
kubectl logs -n demo redis-cluster-shard0-0 --tail=20 -f
Unlike I/O latency (Experiment 10), this experiment causes syscalls to fail outright. Redis immediately starts logging AOF write errors. Because errno 5 (EIO) is injected on 50% of I/O syscalls, even kubectl exec into the affected pods fails — the working directory /data itself becomes unreadable:
error: Internal error occurred: error executing command in container: failed to exec in container:
OCI runtime exec failed: exec failed: unable to start container process:
chdir to cwd ("/data") set in config.json failed: input/output error
Redis logs during the experiment:
74:M 08 May 2026 09:30:16.611 # Error writing to the AOF file: Input/output error
74:M 08 May 2026 09:30:16.622 * AOF write error looks solved, Redis can write again.
74:M 08 May 2026 09:30:16.632 # Fail to fsync the AOF file: Input/output error
The intermittent nature of the fault (50% of ops) causes Redis to oscillate between detecting and resolving the error within the same second, exposing the retry logic in the AOF write path.
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 09:30:11 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 09:30:11 | 0s | IOChaos applied — errno 5 injected on 50% of /data I/O ops for shard0-0, shard1-0, shard2-0 | Ready |
| 09:30:16 | ~+5s | AOF fsync fails: Fail to fsync the AOF file: Input/output error on shard1-0; Error writing to the AOF file on shard0-0 | NotReady |
| 09:30:16 | ~+5s | kubectl exec into affected pods fails — /data working directory returns input/output error | NotReady |
| 09:32:11 | +120s | Chaos experiment window ends; I/O fault removed from all master pods | NotReady |
| ~09:32:13 | ~+2m02s | AOF write error resolves: AOF write error looks solved, Redis can write again | Ready |
Result: PASS — errno 5 (EIO) was injected into 50% of I/O syscalls on the /data path across all master pods. Redis detected and logged AOF write failures within 5 seconds. The fault was severe enough to prevent kubectl exec from entering affected containers. After the 120-second window ended, Redis self-healed — AOF writes resumed and the cluster re-converged without data loss.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 12: Network Loss
What it does: Drops 100% of outgoing network packets from all Redis master pods to the rest of the cluster. This simulates a complete one-directional network failure, where masters can receive traffic but cannot send any responses.
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: rd-master-packet-loss
namespace: demo
spec:
action: loss
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
loss:
loss: "100"
correlation: "100"
direction: to
target:
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/component: database
duration: "2m"
Observe:
kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME VERSION STATUS AGE
redis.kubedb.com/redis-cluster 7.4.0 NotReady 5h26m
NAME READY STATUS RESTARTS AGE
pod/redis-cluster-shard0-0 1/1 Running 0 5h26m
pod/redis-cluster-shard0-1 1/1 Running 0 5h26m
pod/redis-cluster-shard1-0 1/1 Running 0 5h26m
pod/redis-cluster-shard1-1 1/1 Running 0 5h26m
pod/redis-cluster-shard2-0 1/1 Running 0 5h26m
pod/redis-cluster-shard2-1 1/1 Running 0 5h26m
With 100% outgoing packet loss from all master pods, the cluster loses inter-node gossip and replication traffic completely. Replica pods will detect master unavailability and attempt failover. After the 2-minute window, packet loss is lifted and the cluster re-converges.
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 10:32:18 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 10:32:18 | 0s | NetworkChaos applied — 100% outgoing packet loss injected on all master pods | Ready |
| 10:32:18 | +0s | All 6 pods targeted; master pods unable to send any packets to cluster peers | Critical |
| 10:32:25 | ~+7s | Replica pods detect master heartbeat loss; Redis Cluster initiates replica promotion | NotReady |
| 10:32:30 | ~+12s | KubeDB detects degraded cluster state | NotReady |
| 10:34:18 | +2m00s | Chaos experiment window ends; packet loss removed from all master pods | Critical |
| ~10:36:00 | ~+3m42s | All shards re-converge; cluster topology stabilized | Ready |
Result: PASS — 100% outgoing packet loss was injected on all master pods for 2 minutes. The Redis Cluster detected the loss of master heartbeats and promoted replicas to maintain availability. After the chaos window ended, all pods recovered automatically and KubeDB reconciled the cluster back to Ready. The test key chaos-test remained intact after full recovery.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 13: Network Duplicate
What it does: Duplicates 100% of outgoing network packets from all Redis master pods to the rest of the cluster. This simulates a noisy or faulty network where duplicate packets cause redundant processing, increased bandwidth usage, and potential out-of-order delivery.
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: rd-master-packet-duplicate
namespace: demo
spec:
action: duplicate
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
duplicate:
duplicate: "100"
correlation: "100"
direction: both
target:
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/component: database
duration: "2m"
Observe:
kubectl get rd,rds,rdops,rdsops,pods -n demo
NAME VERSION STATUS AGE
redis.kubedb.com/redis-cluster 7.4.0 Ready 6h33m
NAME READY STATUS RESTARTS AGE
pod/redis-cluster-shard0-0 1/1 Running 0 6h33m
pod/redis-cluster-shard0-1 1/1 Running 0 6h32m
pod/redis-cluster-shard1-0 1/1 Running 0 6h33m
pod/redis-cluster-shard1-1 1/1 Running 0 6h32m
pod/redis-cluster-shard2-0 1/1 Running 0 6h33m
pod/redis-cluster-shard2-1 1/1 Running 0 6h32m
With 100% packet duplication on all master pods, Redis inter-node gossip and replication traffic will contain duplicate packets. TCP will deduplicate most of them transparently, but increased bandwidth usage and potential sequence number handling overhead may cause elevated latency. After the 2-minute window, duplication is lifted and the cluster re-converges.
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 11:38:35 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 11:38:35 | 0s | NetworkChaos applied — 100% packet duplication injected on all master pods (bidirectional) | Ready |
| 11:38:35 | +0s | All 6 pods targeted; duplicate packets injected on both master and replica pods | Ready |
| 11:38:40 | ~+5s | Redis gossip and replication traffic contains duplicate packets; TCP deduplication handles most transparently | Ready |
| 11:38:55 | ~+20s | Cluster remains functional; elevated bandwidth usage observed across all shards | Ready |
| 11:40:35 | +2m00s | Chaos experiment window ends; duplication removed from all pods | Ready |
| 11:40:40 | ~+2m05s | All pods return to normal network behavior; cluster fully converged | Ready |
Result: PASS — 100% packet duplication was injected bidirectionally across all master and replica pods. TCP’s built-in deduplication handled the duplicate packets transparently, keeping the Redis Cluster available throughout the experiment. No quorum loss or data corruption was observed. After the 2-minute chaos window ended, all pods returned to normal network behavior automatically. The test key chaos-test remained intact.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 14: Time Offset
What it does: Shifts the system clock of all Redis master pods back by 2 hours. This simulates clock skew between nodes, which can affect TTL expiry, token validation, certificate checks, and distributed coordination logic that depends on wall-clock time.
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
name: rd-master-time-offset
namespace: demo
spec:
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
clockIds:
- CLOCK_REALTIME
timeOffset: "-2h"
duration: "2m"
Observe:
kubectl get rd,pods -n demo
With the clock skewed by -2 hours on all master pods, any time-sensitive operations (TTL expiry, TLS certificate validation, token checks) will behave as if the pods are 2 hours in the past. After the 2-minute window, the clock is restored to the real time.
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 11:52:10 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 11:52:10 | 0s | TimeChaos applied — -2h clock offset injected on all master pods | Ready |
| 11:52:10 | +0s | CLOCK_REALTIME skewed by -2h on shard0-0, shard1-0, shard2-0 | Ready |
| 11:52:15 | ~+5s | Redis continues operating normally; cluster heartbeat and gossip unaffected | Ready |
| 11:54:10 | +2m00s | Chaos experiment window ends; clock offset recovered on all master pods | Ready |
| 11:54:10 | ~+2m00s | All master pod clocks restored to real time; cluster fully converged | Ready |
Result: PASS — A -2h clock offset was injected on all master pods via CLOCK_REALTIME. Redis remained available throughout the experiment — the cluster heartbeat and gossip protocol were unaffected by the skew. After the 2-minute chaos window ended, the clock was automatically restored on all pods. The test key chaos-test remained intact.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 15: DNS Chaos
What it does: Injects DNS errors into all Redis master pods, causing DNS resolution failures for any hostname lookups. This simulates a DNS outage or misconfiguration that affects service discovery and inter-node communication relying on DNS.
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
name: rd-master-dns-error
namespace: demo
spec:
action: error
mode: all
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
duration: "2m"
Apply and observe:
kubectl get rd,pods -n demo
With DNS errors injected, any hostname-based lookups from master pods will fail. Redis Cluster primarily uses IP addresses for gossip and replication, so the impact is limited — but any client or sidecar relying on DNS-based service discovery will be affected.
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 11:58:42 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 11:58:42 | 0s | DNSChaos applied — DNS error injected on all master pods | Ready |
| 11:58:43 | ~+1s | DNS error injection succeeded on shard0-0, shard1-0, shard2-0 | Ready |
| 11:58:48 | ~+6s | DNS resolution failures observed; Redis cluster gossip unaffected (IP-based) | Ready |
| 12:00:42 | +2m00s | Chaos experiment window ends; DNS errors removed from all master pods | Ready |
| 12:00:42 | ~+2m00s | DNS resolution restored on all master pods; cluster fully converged | Ready |
Result: PASS — DNS errors were injected on all master pods for 2 minutes. Since Redis Cluster gossip and replication use IP addresses rather than hostnames, the cluster remained fully available throughout the experiment. DNS injection and recovery were confirmed via the DNSChaos status (AllInjected → AllRecovered). The test key chaos-test remained intact.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 16: io attr-override
What it does: Overrides the file permission attributes on all files in the Redis /data directory to perm: 72 (octal 0110 — execute-only for owner and group, no read or write). This simulates a scenario where the storage volume becomes inaccessible due to incorrect permissions, preventing Redis from reading or writing its data files.
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: rd-master-io-attr-override
namespace: demo
spec:
action: attrOverride
attr:
perm: 72
duration: "2m"
mode: all
path: /data/**/*
percent: 100
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
volumePath: /data
Apply and observe:
kubectl get rd,pods -n demo
kubectl logs -n demo redis-cluster-shard0-0 --tail=20 -f
With file permissions overridden to execute-only (--x--x---), Redis loses read and write access to its data directory. AOF and RDB persistence operations will fail immediately. After the 2-minute window ends, Chaos Mesh restores the original permissions.
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 12:07:37 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 12:07:37 | 0s | IOChaos applied — perm: 72 (execute-only) overridden on 100% of /data ops for all master pods | Ready |
| 12:07:37 | +0s | File attribute override injected on shard0-1, shard1-1, shard2-1 (AllInjected) | Ready |
| 12:07:42 | ~+5s | Redis persistence operations (AOF/RDB) fail; in-memory operations continue normally | Ready |
| 12:09:37 | +2m00s | Chaos experiment window ends; file permissions restored on all targeted pods (AllRecovered) | Ready |
| 12:09:40 | ~+2m03s | Persistence operations resume; cluster fully converged | Ready |
Result: PASS — File permissions on the /data path were overridden to execute-only (perm: 72) across all targeted pods for 2 minutes. Redis continued serving in-memory read/write operations normally, as the permission fault only affected disk-level persistence (AOF/RDB). After the chaos window ended, Chaos Mesh restored the original permissions and persistence operations resumed automatically. The test key chaos-test remained intact.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Experiment 17: I/O Mistake
What it does: Injects data corruption into READ and WRITE operations on the Redis /data directory by overwriting up to 10 bytes with zeros on 100% of I/O operations. This simulates silent data corruption from a faulty storage device — where syscalls succeed but the data returned or written is silently corrupted.
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-mistake-example
namespace: demo
spec:
action: mistake
duration: "120s"
methods:
- READ
- WRITE
mistake:
filling: zero
maxLength: 10
maxOccurrences: 1
mode: all
path: /data/**/*
percent: 100
selector:
namespaces:
- demo
labelSelectors:
app.kubernetes.io/name: redises.kubedb.com
kubedb.com/role: master
volumePath: /data
Apply and observe:
kubectl get rd,pods -n demo
kubectl logs -n demo redis-cluster-shard0-0 --tail=20 -f
Unlike I/O fault (Experiment 11), this experiment does not return errors — syscalls appear to succeed, but up to 10 bytes per operation are silently zeroed out. This is particularly insidious as Redis may not detect the corruption immediately. AOF and RDB persistence operations are the most likely to be affected.
Observed timeline:
| Wall-clock | Δ from chaos | Event | DB Status |
|---|---|---|---|
| 12:24:04 | — | Pre-chaos baseline (all 3 master pods ONLINE) | Ready |
| 12:24:04 | 0s | IOChaos applied — zero-fill mistake injected on 100% of READ/WRITE ops for all master pods | Ready |
| 12:24:04 | +0s | I/O mistake injected on shard0-1, shard1-1, shard2-1 (AllInjected) | Ready |
| 12:24:09 | ~+5s | Silent data corruption begins on AOF/RDB I/O ops; Redis in-memory operations unaffected | Ready |
| 12:26:04 | +2m00s | Chaos experiment window ends; I/O mistake removed from all targeted pods | Ready |
| 12:26:07 | ~+2m03s | Persistence operations return to normal; cluster fully converged | Ready |
Result: PASS — Silent zero-fill corruption was injected into 100% of READ and WRITE I/O operations on the /data path across all master pods for 2 minutes. Redis continued serving in-memory operations normally — the corruption only affected disk-level persistence paths (AOF/RDB). Since Redis primarily operates in-memory and the corruption was limited to 10 bytes per operation, no in-memory data was affected. After the chaos window ended, Chaos Mesh removed the injection (AllRecovered) and persistence resumed automatically. The test key chaos-test remained intact.
Verify data:
kubectl exec -it -n demo redis-cluster-shard0-0 -- \
redis-cli -a $PASSWORD -c GET chaos-test
"before-chaos"
Summary
We ran 17 chaos experiments against a KubeDB-managed Redis Cluster on Kubernetes. Every experiment resulted in a PASS — the cluster recovered automatically and the test key chaos-test remained intact throughout.
| # | Experiment | Tool | Fault Type | Duration | Result |
|---|---|---|---|---|---|
| 1 | Master Pod Failure | PodChaos | Pod marked unavailable | 1m | ✅ PASS |
| 2 | Pod Kill Master | PodChaos | SIGKILL all master pods | 5m | ✅ PASS |
| 3 | Container Kill | PodChaos | Kill redis container in-place | 5m | ✅ PASS |
| 4 | Network Delay | NetworkChaos | 1000ms latency on master pods | 60s | ✅ PASS |
| 5 | Network Bandwidth | NetworkChaos | 1 Mbps cap on all pods | 60s | ✅ PASS |
| 6 | Network Corruption | NetworkChaos | 100% packet corruption on all pods | 60s | ✅ PASS |
| 7 | Network Partition | NetworkChaos | Bidirectional master↔replica split | 10m | ✅ PASS |
| 8 | CPU Stress | StressChaos | 4 workers at 50% CPU on masters | 2m | ✅ PASS |
| 9 | Memory Stress | StressChaos | 256MB allocation on one pod | 60s | ✅ PASS |
| 10 | I/O Latency | IOChaos | 15000ms latency on 50% of /data ops | 100s | ✅ PASS |
| 11 | I/O Fault | IOChaos | errno 5 on 50% of /data ops | 120s | ✅ PASS |
| 12 | Network Loss | NetworkChaos | 100% outgoing packet loss from masters | 2m | ✅ PASS |
| 13 | Network Duplicate | NetworkChaos | 100% packet duplication (bidirectional) | 2m | ✅ PASS |
| 14 | Time Offset | TimeChaos | -2h clock skew on all master pods | 2m | ✅ PASS |
| 15 | DNS Chaos | DNSChaos | DNS error injection on all master pods | 2m | ✅ PASS |
| 16 | I/O Attr Override | IOChaos | Execute-only permissions on /data | 2m | ✅ PASS |
| 17 | I/O Mistake | IOChaos | Silent zero-fill on 100% of READ/WRITE ops | 2m | ✅ PASS |
Key takeaways:
- KubeDB continuously reconciles the Redis Cluster back to the desired state after every fault — no manual intervention was needed.
- Redis Cluster mode provides built-in resilience: replica promotion, automatic re-election, and shard re-convergence all worked as expected.
- Pod-level faults (kill, failure, container kill) recovered within seconds to a few minutes via Kubernetes StatefulSet rescheduling.
- Network faults (partition, delay, bandwidth, corruption, loss, duplicate) caused temporary
CriticalorNotReadystates but the cluster re-converged automatically once the fault was lifted. - Resource stress (CPU, memory) degraded performance but never caused data loss or cluster failure within the tested limits.
- I/O faults are the most disruptive class — errno injection can prevent
kubectl execfrom entering pods — but Redis self-healed after the chaos window ended. - Time and DNS chaos had minimal impact on Redis Cluster because gossip and replication rely on IP addresses and in-memory state rather than wall-clock time or hostname resolution.
Cleanup
Once you are done with the experiments, remove all resources to avoid incurring unnecessary costs.
Delete all Chaos Mesh experiments:
kubectl delete podchaos,networkchaos,stresschaos,iochaos,timechaos,dnschaos --all -n demo
Delete the Redis Cluster:
kubectl delete redis redis-cluster -n demo
Delete the namespace:
kubectl delete ns demo
Uninstall Chaos Mesh (if installed via Helm):
helm uninstall chaos-mesh -n chaos-mesh
kubectl delete ns chaos-mesh
Uninstall KubeDB (if installed via Helm):
helm uninstall kubedb -n kubedb
kubectl delete ns kubedb





