Chaos Engineering KubeDB MySQL Group Replication on Kubernetes, Testing Group Replication Cluster Resilience

Overview

We conducted 80+ chaos experiments across 3 MySQL versions (8.0.36, 8.4.8, 9.6.0) and 2 Group Replication topologies (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers zero data loss, automatic failover, and self-healing recovery under realistic failure conditions with production-level write loads.

The result: every experiment passed with zero data loss, zero split-brain, and zero errant GTIDs.

This post summarizes the methodology, results, and key findings from comprehensive chaos testing of KubeDB MySQL Group Replication.

Why Chaos Testing?

Running databases on Kubernetes introduces failure modes that traditional infrastructure does not have — pods can be evicted, nodes can go down, network policies can partition traffic, and resource limits can trigger OOMKills at any time. Chaos engineering deliberately injects these failures to verify that the system recovers correctly before they happen in production.

For a MySQL Group Replication cluster managed by KubeDB, we needed to answer:

  • Does the cluster lose data when a primary is killed mid-transaction?
  • Does automatic failover work under network partitions?
  • Can the cluster self-heal after a full outage with no manual intervention?
  • Are GTIDs consistent across all nodes after recovery?
  • Does the cluster survive combined failures (CPU + memory + load simultaneously)?

Test Environment

ComponentDetails
Kuberneteskind (local cluster)
KubeDB Version2026.4.27
Cluster Topology3-node Group Replication (Single-Primary & Multi-Primary)
MySQL Versions8.0.36, 8.4.8, 9.6.0
Storage2Gi PVC per node (Durable, ReadWriteOnce)
Memory Limit1.5Gi per MySQL pod
CPU Request500m per pod
Chaos EngineChaos Mesh
Load Generatorsysbench oltp_write_only, 4-12 tables, 4-16 threads
Baseline TPS~2,400 (Single-Primary) / ~1,150 (Multi-Primary)

All experiments were run under sustained sysbench write load to simulate production traffic during failures.

Setup Guide

Step 1: Create a kind Cluster

We used kind (Kubernetes IN Docker) as our local Kubernetes cluster. Follow the kind installation guide to install it, then create a cluster:

kind create cluster --name chaos-test

Step 2: Install KubeDB

Install KubeDB operator using Helm:

helm install kubedb oci://ghcr.io/appscode-charts/kubedb \
  --version v2026.4.27 \
  --namespace kubedb --create-namespace \
  --set-file global.license=/path/to/license.txt \
  --wait --burst-limit=10000 --debug

Step 3: Install Chaos Mesh

Install Chaos Mesh for fault injection:

helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update chaos-mesh

helm upgrade -i chaos-mesh chaos-mesh/chaos-mesh \
  -n chaos-mesh --create-namespace \
  --set dashboard.create=true \
  --set dashboard.securityMode=false \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --set chaosDaemon.privileged=true

Step 4: Deploy MySQL Cluster

Create the namespace:

kubectl create namespace demo

Single-Primary Mode:

apiVersion: kubedb.com/v1
kind: MySQL
metadata:
  name: mysql-ha-cluster
  namespace: demo
spec:
  deletionPolicy: Delete
  podTemplate:
    spec:
      containers:
        - name: mysql
          resources:
            limits:
              memory: 1.5Gi
            requests:
              cpu: 500m
              memory: 1.5Gi
  replicas: 3
  storage:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 2Gi
  storageType: Durable
  topology:
    mode: GroupReplication
    group:
      mode: Single-Primary
  version: 8.4.8

Multi-Primary Mode (change only the group mode):

Note: Multi-Primary mode in KubeDB is available from MySQL version 8.4.2 and above.

apiVersion: kubedb.com/v1
kind: MySQL
metadata:
  name: mysql-ha-cluster
  namespace: demo
spec:
  deletionPolicy: Delete
  podTemplate:
    spec:
      containers:
        - name: mysql
          resources:
            limits:
              memory: 1.5Gi
            requests:
              cpu: 500m
              memory: 1.5Gi
  replicas: 3
  storage:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 2Gi
  storageType: Durable
  topology:
    mode: GroupReplication
    group:
      mode: Multi-Primary
  version: 8.4.8

Deploy and wait for Ready:

kubectl apply -f mysql-ha-cluster.yaml
kubectl wait --for=jsonpath='{.status.phase}'=Ready mysql/mysql-ha-cluster -n demo --timeout=5m

Step 5: Deploy sysbench Load Generator

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sysbench-load
  namespace: demo
  labels:
    app: sysbench
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sysbench
  template:
    metadata:
      labels:
        app: sysbench
    spec:
      containers:
        - name: sysbench
          image: perconalab/sysbench:latest
          command: ["/bin/sleep", "infinity"]
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2"
              memory: "2Gi"
          env:
            - name: MYSQL_HOST
              value: "mysql-ha-cluster.demo.svc.cluster.local"
            - name: MYSQL_PORT
              value: "3306"
            - name: MYSQL_USER
              value: "root"
            - name: MYSQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mysql-ha-cluster-auth
                  key: password
            - name: MYSQL_DB
              value: "sbtest"
kubectl apply -f sysbench.yaml

Step 6: Prepare sysbench Tables

#!/bin/bash

# Get the MySQL root password
PASS=$(kubectl get secret mysql-ha-cluster-auth -n demo -o jsonpath='{.data.password}' | base64 -d)


kubectl exec -n demo svc/mysql-ha-cluster -c mysql -- \
  mysql -uroot -p"$PASS" -h mysql-ha-cluster.demo -e "DROP DATABASE IF EXISTS sbtest;"

# Create the sbtest database
kubectl exec -n demo svc/mysql-ha-cluster -c mysql -- \
  mysql -uroot -p"$PASS" -h mysql-ha-cluster.demo -e "CREATE DATABASE IF NOT EXISTS sbtest;"

# Get the sysbench pod name
SBPOD=$(kubectl get pods -n demo -l app=sysbench -o jsonpath='{.items[0].metadata.name}')

# Standard write load (used during most experiments)
kubectl exec -n demo $SBPOD -- sysbench oltp_write_only \
  --mysql-host=mysql-ha-cluster --mysql-port=3306 \
  --mysql-user=root --mysql-password="$PASS" \
  --mysql-db=sbtest --tables=12 --table-size=100000 \
  --threads=8 prepare

Step 7: Run sysbench During Chaos

# Standard write load (used during most experiments)
kubectl exec -n demo $SBPOD -- sysbench oltp_write_only \
  --mysql-host=mysql-ha-cluster --mysql-port=3306 \
  --mysql-user=root --mysql-password="$PASS" \
  --mysql-db=sbtest --tables=12 --table-size=100000 \
  --threads=8 --time=60 --report-interval=10 run

Chaos Testing

We will run chaos experiments to see how our cluster behaves under failure scenarios like pod kill, OOM kill, network partition, network latency, IO latency, IO fault, and more. We will use sysbench to simulate high write load on the cluster during each experiment.

Important Notes on Database Status:

  • Ready — Database is fully operational. All pods are ONLINE.
  • Critical — Primary is accepting connections and operational, but one or more replicas may be down.
  • NotReady — Primary is not available. No writes can be accepted.

You can read/write in your database in both Ready and Critical states. So even if your db is in Critical state, your uptime is not compromised.

Verify Cluster is Ready

Before starting chaos experiments, verify the cluster is healthy:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE    ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    174m

NAME                                 READY   STATUS    RESTARTS   AGE    ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          174m   primary
pod/mysql-ha-cluster-1               2/2     Running   0          174m   standby
pod/mysql-ha-cluster-2               2/2     Running   0          174m   standby
pod/sysbench-load-849bdc4cdc-h2zpx   1/1     Running   0          8d

Inspect the GR members:

➤ kubectl exec -n demo mysql-ha-cluster-0 -c mysql -- \
    mysql -uroot -p"$PASS" -e "SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE \
    FROM performance_schema.replication_group_members;"
MEMBER_HOST                                           MEMBER_PORT  MEMBER_STATE  MEMBER_ROLE
mysql-ha-cluster-2.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY
mysql-ha-cluster-1.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY
mysql-ha-cluster-0.mysql-ha-cluster-pods.demo         3306         ONLINE        PRIMARY

The pod having kubedb.com/role=primary is the primary and kubedb.com/role=standby are the secondaries. With the cluster ready and sysbench tables prepared, we are ready to run chaos experiments.

Chaos#1: Kill the Primary Pod

We are about to kill the primary pod and see how fast the failover happens. Save this yaml as tests/01-pod-kill.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: mysql-primary-pod-kill
  namespace: chaos-mesh
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - demo
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  gracePeriod: 0

What this chaos does: Terminates the primary pod abruptly with grace-period=0, forcing an immediate failover to a standby replica.

  • Expected behavior: Primary pod is killed → cluster transitions ReadyCritical (1 replica missing after failover to a standby) → when the killed pod rejoins as standby, cluster returns to Ready. Zero data loss, GTIDs and checksums consistent across all 3 nodes.

  • Actual result: Pod-0 killed → pod-2 elected as new primary in ~1 seconds → pod-0 rejoined as standby ~30s later → cluster returned to Ready. All 3 members ONLINE in GR, GTIDs match across nodes, checksums match on every sysbench table. PASS.

Before running, let’s see who is the primary:

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                                 READY   STATUS    RESTARTS   AGE    ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          174m   primary
pod/mysql-ha-cluster-1               2/2     Running   0          174m   standby
pod/mysql-ha-cluster-2               2/2     Running   0          174m   standby

Now run watch kubectl get mysql,pods -n demo -L kubedb.com/role in one terminal, and apply the chaos in another:

➤ kubectl apply -f tests/01-pod-kill.yaml
podchaos.chaos-mesh.org/mysql-primary-pod-kill created

Within seconds, the primary pod is killed and a new primary is elected. The database goes Critical:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS     AGE    ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Critical   174m

NAME                                 READY   STATUS    RESTARTS   AGE    ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          14s    standby
pod/mysql-ha-cluster-1               2/2     Running   0          174m   standby
pod/mysql-ha-cluster-2               2/2     Running   0          174m   primary

Note the STATUS: Critical — this means the new primary (pod-2) is accepting connections and operational, but the killed pod-0 hasn’t rejoined yet. After ~30 seconds, the old primary comes back as a standby and the cluster returns to Ready:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h1m

NAME                                 READY   STATUS    RESTARTS   AGE     ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          7m24s   standby
pod/mysql-ha-cluster-1               2/2     Running   0          3h1m    standby
pod/mysql-ha-cluster-2               2/2     Running   0          3h1m    primary

Verify data integrity — all 3 nodes must have matching GR status, GTIDs and checksums:

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
MEMBER_HOST                                           MEMBER_PORT  MEMBER_STATE  MEMBER_ROLE
mysql-ha-cluster-2.mysql-ha-cluster-pods.demo         3306         ONLINE        PRIMARY
mysql-ha-cluster-1.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY
mysql-ha-cluster-0.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY

All 3 members ONLINE. Check GTIDs and checksums:

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-7743:1000001-1000004
pod-1: 65a93aae-...:1-7743:1000001-1000004
pod-2: 65a93aae-...:1-7743:1000001-1000004

# Checksums — all match ✅
pod-0: sbtest1=2141984737, sbtest2=779706826, sbtest3=3549430025, sbtest4=3045058695
pod-1: sbtest1=2141984737, sbtest2=779706826, sbtest3=3549430025, sbtest4=3045058695
pod-2: sbtest1=2141984737, sbtest2=779706826, sbtest3=3549430025, sbtest4=3045058695

Result: PASS — Zero data loss. Failover completed in ~5 seconds. The old primary rejoined as standby automatically. All GTIDs and checksums match across all 3 nodes.

Clean up:

➤ kubectl delete -f tests/01-pod-kill.yaml
podchaos.chaos-mesh.org "mysql-primary-pod-kill" deleted

Chaos#2: Pod Failure on Primary (5 min)

Inject a 5-minute pod-failure fault into the current primary pod — Chaos Mesh keeps the container in a failed state for the entire duration before clearing the fault. Unlike pod-kill (Chaos#1), the pod is not deleted and rescheduled — it stays in place but its container is unavailable, exercising long-duration primary unreachability and the operator’s clone-vs-incremental rejoin logic.

  • Expected behavior: Primary container becomes unavailable → cluster transitions ReadyNotReady → Group Replication elects a new primary and the operator marks the failed pod unhealthy → state moves to Critical. When the 5-minute fault clears, the old primary container restarts, rejoins as SECONDARY, and the cluster returns to Ready. Zero data loss.

  • Actual result: Failover completed in 8 s. Cluster transitioned ReadyNotReady (+5s) → Critical (+21s). Old primary auto-recovered after the 5-min duration and rejoined via incremental recovery; cluster reached Ready ~9–13 min after chaos cleared. Zero errors after sysbench reconnect, zero data loss, zero errant GTIDs. PASS.

Save this yaml as tests/02-pod-failure.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: mysql-pod-failure-primary
  namespace: demo
spec:
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  mode: all
  action: pod-failure
  duration: 5m
➤ kubectl apply -f tests/02-pod-failure.yaml
podchaos.chaos-mesh.org/mysql-pod-failure-primary created

# During chaos — pod-1 promoted, pod-2 stuck failed
➤ kubectl get mysql -n demo
NAME               VERSION   STATUS     AGE
mysql-ha-cluster   8.4.8     Critical   16h

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                 READY   STATUS    RESTARTS   ROLE
mysql-ha-cluster-0   2/2     Running   2          standby
mysql-ha-cluster-1   2/2     Running   2          primary    # promoted (was standby)
mysql-ha-cluster-2   2/2     Running   4          standby    # under chaos

# sysbench during failover
[ 10s ] thds: 8 tps: 1012.65 qps: 6079.89 lat (ms,95%): 16.41 err/s: 0.00
[ 20s ] thds: 8 tps:  577.40 qps: 3464.41 lat (ms,95%): 37.56 err/s: 0.00
[ 30s ] thds: 8 tps:  399.20 qps: 2394.80 lat (ms,95%): 45.79 err/s: 0.00
[ 40s ] thds: 8 tps:  314.70 qps: 1888.58 lat (ms,95%): 64.47 err/s: 0.00
[ 60s ] thds: 8 tps: 1007.99 qps: 6048.34 lat (ms,95%): 21.11 err/s: 0.00

# After 5-min duration — chaos auto-recovered, pod-2 rejoins
➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-2.…  ONLINE  SECONDARY
mysql-ha-cluster-1.…  ONLINE  PRIMARY
mysql-ha-cluster-0.…  ONLINE  SECONDARY

➤ kubectl get mysql -n demo
NAME               VERSION   STATUS   AGE
mysql-ha-cluster   8.4.8     Ready    16h

➤ SELECT COUNT(*) FROM sbtest.sbtest1;   # 100000 — intact

Observed timeline:

Wall-clockΔ from chaosEventDB Status
11:39:00Pre-chaos baseline (pod-2 = primary, all 3 ONLINE)Ready
11:39:210sPodChaos applied (pod-2 targeted)Ready
11:39:26+5sPrimary unreachable detectedNotReady
11:39:29+8sGroup Replication promotes pod-1 → PRIMARYNotReady
11:39:42+21sOperator marks pod-2 unhealthyCritical
11:44:21+5m00sChaos auto-recovered, pod-2 container restartsCritical
11:45:40+6m19spod-2 in RECOVERING (incremental catch-up)Critical
~11:48–52~+9–13mpod-2 reaches ONLINE SECONDARYReady

Result: PASS — Group Replication failed over in 8 s, sysbench recovered to ~1000 TPS on the new primary, and the chaos-impacted pod rejoined the group automatically once the fault expired. Zero data loss, zero errant GTIDs across all 3 nodes.

➤ kubectl delete -f tests/02-pod-failure.yaml

Chaos#3: Scheduled Pod Kill (Every 1 Min, 3 Min Duration)

We schedule random pod kills every minute for 3 minutes — simulating repeated intermittent failures.

Save this yaml as tests/03-scheduled-pod-kill.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: mysql-scheduled-pod-kill
  namespace: chaos-mesh
spec:
  schedule: "*/1 * * * *"
  historyLimit: 5
  concurrencyPolicy: "Allow"
  type: "PodChaos"
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces:
        - demo
      labelSelectors:
        "app.kubernetes.io/instance": "mysql-ha-cluster"

What this chaos does: Kills a random MySQL pod every minute for 3 minutes. The cluster must repeatedly recover from pod failures.

  • Expected behavior: Random pod killed each minute → rapid primary reelection when primary is hit, quick rejoin when standby is hit → between kills cluster returns to Ready → after schedule ends and enough time passes, all 3 members ONLINE. Zero data loss.

  • Actual result: Multiple pods killed over 3-minute window. Cluster auto-recovered after each kill. After schedule removed, all 3 members ONLINE, pod-0 primary, pod-1/pod-2 standbys. GTIDs and checksums match. PASS.

➤ kubectl apply -f tests/03-scheduled-pod-kill.yaml
schedule.chaos-mesh.org/mysql-scheduled-pod-kill created

After 3 minutes of repeated kills, multiple pods have been restarted (note the different ages — pod-1 is 2m, pod-2 is 89s):

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                             READY   STATUS    RESTARTS   AGE    ROLE
mysql-ha-cluster-0               2/2     Running   0          10m    primary
mysql-ha-cluster-1               2/2     Running   0          89s    standby
mysql-ha-cluster-2               2/2     Running   0          29s    standby

After deleting the schedule and waiting for full recovery:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h49m

NAME                                 READY   STATUS    RESTARTS   AGE     ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          11m     primary
pod/mysql-ha-cluster-1               2/2     Running   0          2m29s   standby
pod/mysql-ha-cluster-2               2/2     Running   0          89s     standby

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
+-----------------------------------------------+-------------+--------------+-------------+

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-130027:1000001-1000052
pod-1: 65a93aae-...:1-130027:1000001-1000052
pod-2: 65a93aae-...:1-130027:1000001-1000052

# Checksums — all match ✅
pod-0: sbtest1=3941398607, sbtest2=2282875795, sbtest3=2179429078, sbtest4=2316165905
pod-1: sbtest1=3941398607, sbtest2=2282875795, sbtest3=2179429078, sbtest4=2316165905
pod-2: sbtest1=3941398607, sbtest2=2282875795, sbtest3=2179429078, sbtest4=2316165905

Result: PASS — Multiple pods killed on schedule, all auto-recovered. Zero data loss.

Clean up:

➤ kubectl delete schedule mysql-scheduled-pod-kill -n chaos-mesh

Chaos#4: Double Primary Kill

Kill the primary, wait for new election, then immediately kill the new primary. Tests whether the cluster survives two consecutive leader failures.

  • Expected behavior: First primary killed → new primary elected → second kill (of newly elected primary) → third election → surviving standby becomes primary → killed pods rejoin → cluster returns to Ready. Zero data loss despite rapid leader churn.

  • Actual result: Pod-2 killed → pod-1 elected → pod-1 killed within ~15s → pod-2 re-elected (after its restart) as third primary. All pods rejoined. All 3 members ONLINE, GTIDs and checksums match. PASS.

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                             READY   STATUS    RESTARTS   AGE   ROLE
mysql-ha-cluster-0               2/2     Running   0          2m    standby
mysql-ha-cluster-1               2/2     Running   0          5m    standby
mysql-ha-cluster-2               2/2     Running   0          4m    primary

# Kill first primary (pod-2)
kubectl delete pod mysql-ha-cluster-2 -n demo --force --grace-period=0
pod "mysql-ha-cluster-2" force deleted

After 15 seconds, pod-1 is elected as new primary:

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                             READY   STATUS    RESTARTS   AGE   ROLE
mysql-ha-cluster-0               2/2     Running   0          2m    standby
mysql-ha-cluster-1               2/2     Running   0          5m    primary   ← new primary
mysql-ha-cluster-2               2/2     Running   0          15s   standby

# Kill second primary (pod-1) immediately
kubectl delete pod mysql-ha-cluster-1 -n demo --force --grace-period=0
pod "mysql-ha-cluster-1" force deleted

Database goes Critical — two primaries killed in quick succession:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS     AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Critical   3h53m

NAME                                 READY   STATUS    RESTARTS   AGE   ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          2m    standby
pod/mysql-ha-cluster-1               2/2     Running   0          17s   standby
pod/mysql-ha-cluster-2               2/2     Running   0          32s   primary

pod-2 was re-elected as third primary. After full recovery:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h55m

NAME                                 READY   STATUS    RESTARTS   AGE   ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          4m    standby
pod/mysql-ha-cluster-1               2/2     Running   0          110s  standby
pod/mysql-ha-cluster-2               2/2     Running   0          2m    primary

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-130185:1000001-1000056
pod-1: 65a93aae-...:1-130185:1000001-1000056
pod-2: 65a93aae-...:1-130185:1000001-1000056

# Checksums — all match ✅
pod-0: sbtest1=740913138, sbtest2=3199164728, sbtest3=1207551779, sbtest4=3950955642
pod-1: sbtest1=740913138, sbtest2=3199164728, sbtest3=1207551779, sbtest4=3950955642
pod-2: sbtest1=740913138, sbtest2=3199164728, sbtest3=1207551779, sbtest4=3950955642

Result: PASS — Two consecutive primary kills survived. The cluster elected a third primary and recovered fully. Zero data loss.


Chaos#5: Rolling Restart (0→1→2)

Simulate a rolling upgrade — delete each pod sequentially with 40-second gaps. Tests graceful rolling restart behavior.

  • Expected behavior: Pods deleted one at a time (0 → 1 → 2) with 40s gap between kills → each pod rejoins before the next is killed → when the primary is hit, a quick failover occurs → cluster returns to Ready between each step. Zero data loss.

  • Actual result: All 3 pods restarted in sequence. Single failover when primary (pod-2) was killed. Each pod rejoined within ~40s. GTIDs and checksums match across all 3 nodes. PASS.

# Delete pod-0 (standby)
kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0

# 40s later — pod-0 recovered, pod-2 still primary
kubectl get pods -n demo -L kubedb.com/role
NAME                             READY   STATUS    RESTARTS   AGE   ROLE
mysql-ha-cluster-0               2/2     Running   0          40s   standby
mysql-ha-cluster-1               2/2     Running   0          3m    standby
mysql-ha-cluster-2               2/2     Running   0          3m    primary

# Delete pod-1 (standby)
kubectl delete pod mysql-ha-cluster-1 -n demo --force --grace-period=0

# 40s later — pod-1 recovered, pod-2 still primary
kubectl get pods -n demo -L kubedb.com/role
NAME                             READY   STATUS    RESTARTS   AGE   ROLE
mysql-ha-cluster-0               2/2     Running   0          80s   standby
mysql-ha-cluster-1               2/2     Running   0          40s   standby
mysql-ha-cluster-2               2/2     Running   0          4m    primary

# Delete pod-2 (primary!) — triggers failover
kubectl delete pod mysql-ha-cluster-2 -n demo --force --grace-period=0

# 60s later — pod-1 elected as new primary
kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h58m

NAME                                 READY   STATUS    RESTARTS   AGE    ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          2m     standby
pod/mysql-ha-cluster-1               2/2     Running   0          100s   primary
pod/mysql-ha-cluster-2               2/2     Running   0          60s    standby

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-130225:1000001-1000058
pod-1: 65a93aae-...:1-130225:1000001-1000058
pod-2: 65a93aae-...:1-130225:1000001-1000058

# Checksums — all match ✅
pod-0: sbtest1=740913138, sbtest2=3199164728, sbtest3=1207551779, sbtest4=3950955642
pod-1: sbtest1=740913138, sbtest2=3199164728, sbtest3=1207551779, sbtest4=3950955642
pod-2: sbtest1=740913138, sbtest2=3199164728, sbtest3=1207551779, sbtest4=3950955642

Result: PASS — All 3 pods deleted sequentially. Failover triggered only when primary was deleted. Each pod recovered and rejoined within ~30s. Zero data loss.


Chaos#6: Full Cluster Kill (All 3 Pods)

The ultimate stress test — we force-delete all 3 MySQL pods simultaneously. The entire cluster goes down. Can it recover automatically?

Method: kubectl delete pod --force --grace-period=0 on all 3 pods at once.

  • Expected behavior: All 3 pods deleted → cluster NotReady (no primary) → coordinator detects full outage → identifies pod with highest GTID → bootstraps new cluster from it → other pods rejoin as standbys → cluster returns to Ready. Zero data loss.

  • Actual result: Full cluster outage. Coordinator detected outage and elected pod-0 (highest GTID) as new primary. Cluster recovered in ~2 minutes. All 3 members ONLINE, GTIDs and checksums match across all 3 nodes. PASS.

Before running:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h38m

NAME                                 READY   STATUS    RESTARTS   AGE     ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          43m     standby
pod/mysql-ha-cluster-1               2/2     Running   0          3h38m   primary
pod/mysql-ha-cluster-2               2/2     Running   0          3h38m   standby

Kill all 3 pods simultaneously:

➤ kubectl delete pod mysql-ha-cluster-0 mysql-ha-cluster-1 mysql-ha-cluster-2 \
    -n demo --force --grace-period=0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated.
pod "mysql-ha-cluster-0" force deleted
pod "mysql-ha-cluster-1" force deleted
pod "mysql-ha-cluster-2" force deleted

The database immediately goes NotReady — no primary available:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS     AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     NotReady   3h38m

NAME                                 READY   STATUS    RESTARTS   AGE   ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          16s   standby
pod/mysql-ha-cluster-1               2/2     Running   0          14s   standby
pod/mysql-ha-cluster-2               2/2     Running   0          12s   standby

All 3 pods are restarting. The coordinator on one of the pods will detect a full outage, find the pod with the highest GTID, and bootstrap a new cluster from it. After ~2 minutes:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h42m

NAME                                 READY   STATUS    RESTARTS   AGE   ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          4m    primary
pod/mysql-ha-cluster-1               2/2     Running   0          3m    standby
pod/mysql-ha-cluster-2               2/2     Running   0          3m    standby

Verify data integrity:

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
+-----------------------------------------------+-------------+--------------+-------------+

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-83046:1000001-1000052
pod-1: 65a93aae-...:1-83046:1000001-1000052
pod-2: 65a93aae-...:1-83046:1000001-1000052

# Checksums — all match ✅
pod-0: sbtest1=4068722757, sbtest2=1583533506, sbtest3=1588922375, sbtest4=1810696374
pod-1: sbtest1=4068722757, sbtest2=1583533506, sbtest3=1588922375, sbtest4=1810696374
pod-2: sbtest1=4068722757, sbtest2=1583533506, sbtest3=1588922375, sbtest4=1810696374

Result: PASS — Complete cluster outage (all 3 pods killed simultaneously). The coordinator automatically detected the outage, elected the pod with highest GTID as new primary, and rebuilt the cluster. Recovery in ~2 minutes. Zero data loss.


Chaos#7: PVC Delete + Pod Kill (Full Data Rebuild)

Completely destroy a node’s data — delete both the pod and its PVC. The node must rebuild from scratch using the CLONE plugin.

  • Expected behavior: Pod and PVC deleted → new pod provisioned with fresh empty PVC → pod enters Init:0/1 while storage binds → operator initiates CLONE from a healthy primary → data fully restored → node rejoins as standby → cluster returns to Ready. Zero data loss on remaining nodes.

  • Actual result: Pod-0 destroyed (pod + PVC). Pod reprovisioned, CLONE plugin rebuilt the datadir from primary. Cluster Critical during rebuild (~2 min), then Ready with all 3 members ONLINE. GTIDs and checksums match across all 3 nodes. PASS.

➤ kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
pod "mysql-ha-cluster-0" force deleted
➤ kubectl delete pvc data-mysql-ha-cluster-0 -n demo
persistentvolumeclaim "data-mysql-ha-cluster-0" deleted

The pod enters Init:0/1 while waiting for a new PVC:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS     AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Critical   4h26m

NAME                                 READY   STATUS     RESTARTS   AGE   ROLE
pod/mysql-ha-cluster-0               0/2     Init:0/1   0          9m
pod/mysql-ha-cluster-1               2/2     Running    0          29m   standby
pod/mysql-ha-cluster-2               2/2     Running    0          28m   primary

➤ kubectl get pvc -n demo
NAME                      STATUS    AGE
data-mysql-ha-cluster-0   Pending   45s    ← new PVC being provisioned
data-mysql-ha-cluster-1   Bound     4h27m
data-mysql-ha-cluster-2   Bound     4h27m

Once the new PVC is bound, the CLONE plugin copies a full data snapshot from a donor. After ~2 minutes, the node is fully recovered:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    4h30m

NAME                                 READY   STATUS    RESTARTS   AGE   ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          99s   standby   ← rebuilt from scratch
pod/mysql-ha-cluster-1               2/2     Running   0          33m   standby
pod/mysql-ha-cluster-2               2/2     Running   0          33m   primary

➤ kubectl get pvc -n demo
NAME                      STATUS   CAPACITY   AGE
data-mysql-ha-cluster-0   Bound    2Gi        3m     ← brand new PVC
data-mysql-ha-cluster-1   Bound    2Gi        4h30m
data-mysql-ha-cluster-2   Bound    2Gi        4h30m

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

# GTIDs — all match ✅ (pod-0 has identical GTIDs after CLONE)
pod-0: 65a93aae-...:1-166554:1000001-1000198
pod-1: 65a93aae-...:1-166554:1000001-1000198
pod-2: 65a93aae-...:1-166554:1000001-1000198

# Checksums — all match ✅
pod-0: sbtest1=2710640468, sbtest2=1851529474, sbtest3=809748038, sbtest4=3029682236
pod-1: sbtest1=2710640468, sbtest2=1851529474, sbtest3=809748038, sbtest4=3029682236
pod-2: sbtest1=2710640468, sbtest2=1851529474, sbtest3=809748038, sbtest4=3029682236

Result: PASS — Complete data destruction (PVC deleted). The CLONE plugin rebuilt the node from scratch with identical data. Zero data loss. This is the ultimate recovery test — MySQL 8.0+ handles it fully automatically.


Chaos#8: OOMKill the Primary Pod

Now we are going to OOMKill the primary pod. This is a realistic scenario — in production, your primary pod might get OOMKilled due to high memory usage from large queries or connection spikes.

Save this yaml as tests/08-oomkill.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: mysql-primary-memory-stress
  namespace: chaos-mesh
spec:
  mode: one
  selector:
    namespaces:
      - demo
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  stressors:
    memory:
      workers: 2
      size: "1200MB"
  duration: "10m"

What this chaos does: Allocates 1200MB of extra memory on the primary pod. Combined with MySQL’s memory usage (~500MB), this exceeds the 1.5Gi limit and triggers an OOMKill.

  • Expected behavior: Primary pod is OOMKilled during sysbench writes → cluster goes NotReady during failover → after ~20s(based on config) a new primary is elected and the OOMKilled pod rejoins as standby → cluster returns to Ready. Zero data loss, GTIDs and checksums consistent across nodes.

  • Actual result: Primary pod-2 OOMKilled mid-writes. Sysbench lost connection (error 2013). After ~20s(as replication_unreachable_majority_timeout = 20) the unreachable member was expelled, pod-1 elected as new primary. Pod-2 restarted (Restarts: 1) and rejoined as standby ~82s after kill. All 3 ONLINE in GR, GTIDs match, checksums match. PASS.

First, start the sysbench load test so we have writes in-flight when the OOMKill hits:

➤ kubectl exec -n demo $SBPOD -- sysbench oltp_write_only \
    --mysql-host=mysql-ha-cluster --mysql-port=3306 \
    --mysql-user=root --mysql-password="$PASS" \
    --mysql-db=sbtest --tables=12 --table-size=100000 \
    --threads=4 --time=120 --report-interval=10 run

[ 10s ] thds: 4 tps: 744.67 qps: 4469.72 lat (ms,95%): 9.56 err/s: 0.00

Now apply the memory stress while sysbench is running:

➤ kubectl apply -f tests/08-oomkill.yaml
stresschaos.chaos-mesh.org/mysql-primary-memory-stress created

The primary pod (pod-2) gets OOMKilled. Sysbench loses connection immediately:

FATAL: mysql_stmt_execute() returned error 2013 (Lost connection to MySQL server during query)

The database goes NotReady during failover:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS     AGE    ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     NotReady   3h3m

NAME                                 READY   STATUS    RESTARTS     AGE    ROLE
pod/mysql-ha-cluster-0               2/2     Running   0            8m     standby
pod/mysql-ha-cluster-1               2/2     Running   0            3h3m   standby
pod/mysql-ha-cluster-2               2/2     Running   1 (9s ago)   3h3m   standby

Note the Restarts: 1 on pod-2 — it was OOMKilled and restarted by Kubernetes. The status NotReady means failover is in progress. This is an ungraceful shutdown for the primary node. The node remains part of the group but becomes unreachable. After 20 seconds, the unreachable node is expelled from the group and a new primary is elected. Within approximately 60 seconds, the cluster fully recovers, and the failed node rejoins as a standby.

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE    ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h4m

NAME                                 READY   STATUS    RESTARTS      AGE    ROLE
pod/mysql-ha-cluster-0               2/2     Running   0             10m    standby
pod/mysql-ha-cluster-1               2/2     Running   0             3h4m   primary
pod/mysql-ha-cluster-2               2/2     Running   1 (82s ago)   3h4m   standby

pod-1 is now the new primary. pod-2 (the OOMKilled pod) rejoined as standby. Verify data integrity:

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
MEMBER_HOST                                           MEMBER_PORT  MEMBER_STATE  MEMBER_ROLE
mysql-ha-cluster-2.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY
mysql-ha-cluster-1.mysql-ha-cluster-pods.demo         3306         ONLINE        PRIMARY
mysql-ha-cluster-0.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-19216:1000001-1000006
pod-1: 65a93aae-...:1-19216:1000001-1000006
pod-2: 65a93aae-...:1-19216:1000001-1000006

# Checksums — all match ✅
pod-0: sbtest1=1213558811, sbtest2=1030289216, sbtest3=1306867904, sbtest4=3604669046
pod-1: sbtest1=1213558811, sbtest2=1030289216, sbtest3=1306867904, sbtest4=3604669046
pod-2: sbtest1=1213558811, sbtest2=1030289216, sbtest3=1306867904, sbtest4=3604669046

Result: PASS — OOMKill triggered on primary during active writes. Failover completed automatically. The OOMKilled pod restarted and rejoined as standby. Zero data loss — all GTIDs and checksums match across all 3 nodes.

Clean up:

➤ kubectl delete -f tests/08-oomkill.yaml
stresschaos.chaos-mesh.org "mysql-primary-memory-stress" deleted

Chaos#9: CPU Stress (98%) on Primary

We apply 98% CPU stress on the primary pod to test how MySQL handles extreme CPU pressure.

Save this yaml as tests/09-cpu-stress.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: mysql-primary-cpu-stress
  namespace: chaos-mesh
spec:
  mode: one
  selector:
    namespaces:
      - demo
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  stressors:
    cpu:
      workers: 2
      load: 98
  duration: "5m"

What this chaos does: Consumes 98% of the CPU on the primary pod, leaving minimal CPU for MySQL query processing and Paxos consensus.

  • Expected behavior: Primary CPU saturated at 98% → TPS drops significantly → cluster stays Ready (no failover — CPU contention slows MySQL but doesn’t make it unresponsive to GR heartbeats) → recovers when chaos ends. Zero data loss.

  • Actual result: TPS reduced from ~686 → ~212 (~69% drop). No failover, no errors. All 3 members stayed ONLINE. GTIDs and checksums match across all 3 nodes. PASS.

Apply the chaos and run sysbench:

➤ kubectl apply -f tests/09-cpu-stress.yaml
stresschaos.chaos-mesh.org/mysql-primary-cpu-stress created

➤ kubectl exec -n demo $SBPOD -- sysbench oltp_write_only \
    --mysql-host=mysql-ha-cluster --mysql-port=3306 \
    --mysql-user=root --mysql-password="$PASS" \
    --mysql-db=sbtest --tables=12 --table-size=100000 \
    --threads=4 --time=60 --report-interval=10 run

[ 10s ] thds: 4 tps: 685.57 lat (ms,95%): 10.65 err/s: 0.00  # before full stress
[ 20s ] thds: 4 tps: 566.01 lat (ms,95%): 18.28 err/s: 0.00
[ 30s ] thds: 4 tps: 448.70 lat (ms,95%): 27.66 err/s: 0.00  # degrading
[ 40s ] thds: 4 tps: 405.90 lat (ms,95%): 31.94 err/s: 0.00
[ 50s ] thds: 4 tps: 384.90 lat (ms,95%): 33.12 err/s: 0.00
[ 60s ] thds: 4 tps: 211.80 lat (ms,95%): 41.85 err/s: 0.00  # lowest point

SQL statistics:
    transactions:                        27033  (449.80 per sec.)
    ignored errors:                      0      (0.00 per sec.)

During the experiment, the cluster stays Ready — no failover triggered. All 3 members remain ONLINE:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h24m

NAME                                 READY   STATUS    RESTARTS   AGE     ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          29m     standby
pod/mysql-ha-cluster-1               2/2     Running   0          3h24m   standby
pod/mysql-ha-cluster-2               2/2     Running   0          3h24m   primary

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

Verify data integrity:

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-70415:1000001-1000048
pod-1: 65a93aae-...:1-70415:1000001-1000048
pod-2: 65a93aae-...:1-70415:1000001-1000048

# Checksums — all match ✅
pod-0: sbtest1=2711127555, sbtest2=1662235926, sbtest3=1102673461, sbtest4=1599950507
pod-1: sbtest1=2711127555, sbtest2=1662235926, sbtest3=1102673461, sbtest4=1599950507
pod-2: sbtest1=2711127555, sbtest2=1662235926, sbtest3=1102673461, sbtest4=1599950507

Result: PASS — 98% CPU stress reduced TPS from ~686 to ~212 (~69% reduction) but the cluster stayed Ready with all members ONLINE. Zero errors, zero data loss.

Clean up:

➤ kubectl delete -f tests/09-cpu-stress.yaml
stresschaos.chaos-mesh.org "mysql-primary-cpu-stress" deleted

Chaos#10: Combined Stress (Memory + CPU + Write Load)

This is the most aggressive test — we apply memory stress on the primary, CPU stress on all nodes, and sustained write load simultaneously. This simulates a “worst case” production incident where multiple things go wrong at once.

Chaos YAMLs applied simultaneously:

  • stress-memory-primary.yaml — 1200MB memory stress on primary
  • stress-cpu-all.yaml — 90% CPU stress on all 3 nodes

What this chaos does: Applies memory pressure + CPU exhaustion while the database is under active write load. The primary is likely to get OOMKilled.

  • Expected behavior: Combined memory + CPU stress under sysbench load → primary OOMKilled → cluster NotReady during failover → new primary elected from standby → OOMKilled pod restarts and rejoins → cluster returns to Ready. Zero data loss.

  • Actual result: Primary pod-2 OOMKilled (restart count 2). Cluster went NotReady briefly. Pod-1 elected as new primary. Pod-2 rejoined as standby within ~90s. All 3 members ONLINE, GTIDs match, checksums match. PASS.

Start sysbench load first (~1186 TPS baseline), then apply both stress experiments:

➤ kubectl exec -n demo $SBPOD -- sysbench oltp_write_only \
    --threads=8 --time=120 --report-interval=10 run
[ 10s ] thds: 8 tps: 1186.74 lat (ms,95%): 13.70 err/s: 0.00  # baseline under load

➤ kubectl apply -f tests/stress-memory-primary.yaml
stresschaos.chaos-mesh.org/mysql-primary-memory-stress created
➤ kubectl apply -f tests/stress-cpu-all.yaml
stresschaos.chaos-mesh.org/mysql-all-cpu-stress created

The primary pod (pod-2) gets OOMKilled under the combined pressure. Sysbench loses all connections:

FATAL: mysql_stmt_execute() returned error 2013 (Lost connection to MySQL server during query)

During recovery, the database goes NotReady and the OOMKilled pod restarts:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS     AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     NotReady   3h30m

NAME                                 READY   STATUS    RESTARTS      AGE     ROLE
pod/mysql-ha-cluster-0               2/2     Running   0             35m     standby
pod/mysql-ha-cluster-1               2/2     Running   0             3h30m   standby
pod/mysql-ha-cluster-2               2/2     Running   2 (10s ago)   3h30m   standby

After ~90 seconds, the cluster fully recovers:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h33m

NAME                                 READY   STATUS    RESTARTS        AGE     ROLE
pod/mysql-ha-cluster-0               2/2     Running   0               38m     standby
pod/mysql-ha-cluster-1               2/2     Running   0               3h33m   primary
pod/mysql-ha-cluster-2               2/2     Running   2 (3m ago)      3h33m   standby

pod-1 is now the primary. pod-2 (OOMKilled) rejoined as standby. Verify data integrity:

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-82950:1000001-1000052
pod-1: 65a93aae-...:1-82950:1000001-1000052
pod-2: 65a93aae-...:1-82950:1000001-1000052

# Checksums — all match ✅
pod-0: sbtest1=4068722757, sbtest2=1583533506, sbtest3=1588922375, sbtest4=1810696374
pod-1: sbtest1=4068722757, sbtest2=1583533506, sbtest3=1588922375, sbtest4=1810696374
pod-2: sbtest1=4068722757, sbtest2=1583533506, sbtest3=1588922375, sbtest4=1810696374

Result: PASS — Combined memory + CPU + write load caused OOMKill on the primary. Failover completed automatically. The OOMKilled pod rejoined as standby. Zero data loss — all GTIDs and checksums match.

Clean up:

➤ kubectl delete -f tests/stress-memory-primary.yaml
➤ kubectl delete -f tests/stress-cpu-all.yaml

Chaos#11: OOMKill via Natural Load (90 JOINs + Writes)

Instead of using StressChaos, we try to trigger an OOMKill naturally by running 90 concurrent large JOIN queries (5-table cross-joins) across all 3 pods while sysbench writes are in-flight.

Method: Launch 90 heavy JOIN queries across all pods + 4-thread sysbench for 120s.

  • Expected behavior: Heavy JOINs + concurrent writes push memory toward the 1.5 Gi limit → either (a) primary survives by spilling to temp tables, staying Ready throughout, or (b) OOMKill triggers and cluster auto-recovers via failover. Zero data loss either way.

  • Actual result: MySQL 8.4.8 survived — no OOMKill, no pod restarts. 388 TPS sustained across the full 120s. Zero errors, GTIDs and checksums match. PASS. (Note: same test triggers OOMKill on MySQL 9.6.0 due to different memory allocation — also recovers cleanly.)

# Launch 90 large JOINs + sysbench writes
SQL statistics:
    transactions:                        46889  (388.43 per sec.)
    ignored errors:                      0      (0.00 per sec.)

MySQL 8.4.8 survived — no OOMKill triggered. The 1.5Gi memory limit provides sufficient headroom. No pod restarts:

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                             READY   STATUS    RESTARTS   AGE   ROLE
mysql-ha-cluster-0               2/2     Running   0          6m    primary
mysql-ha-cluster-1               2/2     Running   0          6m    standby
mysql-ha-cluster-2               2/2     Running   0          6m    standby
# GTIDs — all match ✅
pod-0: 65a93aae-...:1-129971:1000001-1000052
pod-1: 65a93aae-...:1-129971:1000001-1000052
pod-2: 65a93aae-...:1-129971:1000001-1000052

# Checksums — all match ✅
pod-0: sbtest1=3941398607, sbtest2=2282875795, sbtest3=2179429078, sbtest4=2316165905
pod-1: sbtest1=3941398607, sbtest2=2282875795, sbtest3=2179429078, sbtest4=2316165905
pod-2: sbtest1=3941398607, sbtest2=2282875795, sbtest3=2179429078, sbtest4=2316165905

Result: PASS — MySQL 8.4.8 handles memory conservatively and did not OOMKill under 90 concurrent large JOINs. 388 TPS sustained. Zero errors, zero data loss.

Note: The same test triggers OOMKill on MySQL 9.6.0 due to different memory allocation behavior. Both versions pass with zero data loss.


Chaos#12: Continuous OOM Stress on Primary (15× 30s loop)

Inject a tight loop of memory-stress chaos against the current primary to mimic a runaway query / leak that keeps OOM-killing mysqld while sysbench keeps writing. Each iteration allocates 100 GB with oomScoreAdj=-1000 so the kernel kills the MySQL container almost immediately. Fifteen iterations were applied back-to-back (≈ 8 minutes of effective OOM pressure).

  • Expected behavior: Primary container is OOMKilled repeatedly → cluster transitions ReadyCritical once one replica goes unhealthy → after Group Replication notices the primary is gone, fail over to one of the secondaries → original primary enters CrashLoopBackOff until OOM pressure clears → afterwards the pod rejoins the group, GTIDs reconcile, and the cluster returns to Ready. Zero data loss expected.

  • Actual result: Cluster transitioned ReadyCritical (+1m52s, pod-2 1/2 not ready) → NotReady briefly (+2m02s) → pod-1 promoted PRIMARY (+2m14s) → Critical until pod-2 stabilized. Pod-2 hit 11 restarts (CrashLoopBackOff while OOM loop continued) and then went into RECOVERING for ~9 minutes while the coordinator emitted the errant-GTID warning. Pod-2 finally reached ONLINE SECONDARY and the cluster returned to Ready at +12 minutes. Final GTIDs match exactly on all 3 nodes. PASS.

Save this yaml as tests/12-oom-primary.yaml (replace <primary-pod-name> with the actual current primary):

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: mysql-oom-primary
  namespace: demo
spec:
  selector:
    namespaces: [demo]
    labelSelectors:
      "statefulset.kubernetes.io/pod-name": "<primary-pod-name>"
  mode: all
  stressors:
    memory:
      workers: 1
      size: "100GB"
      oomScoreAdj: -1000
  duration: 30s

Run a tight loop so the OOM keeps recurring (otherwise a single 30 s pulse may be absorbed before sysbench notices):

PRIMARY=$(kubectl get pods -n demo -l app.kubernetes.io/name=mysqls.kubedb.com,kubedb.com/role=primary \
  -o jsonpath='{.items[0].metadata.name}')

for i in $(seq 1 15); do
  sed "s/mysql-oom-primary/mysql-oom-primary-${i}/; s/<primary-pod-name>/${PRIMARY}/" \
    tests/12-oom-primary.yaml | kubectl apply -f -
  sleep 2
done
# During chaos — pod-2 OOMKilled, pod-1 promoted
➤ kubectl get mysql -n demo
NAME               VERSION   STATUS     AGE
mysql-ha-cluster   8.4.8     Critical   17h

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                 READY   STATUS             RESTARTS   ROLE
mysql-ha-cluster-0   2/2     Running            2          standby
mysql-ha-cluster-1   2/2     Running            0          primary    # promoted
mysql-ha-cluster-2   1/2     CrashLoopBackOff   8          standby

# sysbench during OOM loop
[ 10s ] thds: 8 tps: 1195.65 qps: 7177.51 lat (ms,95%): 12.98 err/s: 0.00
[ 60s ] thds: 8 tps:  201.10 qps: 1207.20 lat (ms,95%): 110.66 err/s: 0.00
[ 80s ] thds: 8 tps:   78.10 qps:  469.00 lat (ms,95%): 434.83 err/s: 0.00
[120s ] thds: 8 tps:    0.00 qps:    0.00 lat (ms,95%):   0.00 err/s: 0.00
[140s ] thds: 8 tps:    0.00 qps:    0.00 lat (ms,95%):   0.00 reconn/s: 0.60
[150s ] thds: 8 tps:  633.48 qps: 3804.70 lat (ms,95%):  20.74 err/s: 0.00 reconn/s: 0.20

# Coordinator surfaces errant GTIDs from the partial commits before OOM
➤ kubectl logs -n demo mysql-ha-cluster-2 -c mysql-coordinator | grep -i errant
WARNING: instance mysql-ha-cluster-2 has extra GTIDs not on primary:
  b5a48606-...:384291-384301 (these will be lost if clone proceeds)
instance mysql-ha-cluster-2 has extra transactions not present on the primary
  — waiting for manual approval to use clone so sync group
to approve clone, create the file:
  kubectl exec -n demo mysql-ha-cluster-2 -c mysql -- touch /scripts/approve-clone

# After OOM pressure stops, pod-2 rejoins via GR distributed recovery
➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-2.…  ONLINE  SECONDARY
mysql-ha-cluster-1.…  ONLINE  PRIMARY
mysql-ha-cluster-0.…  ONLINE  SECONDARY

# GTIDs match on all 3 nodes
pod-0: b5a48606-…:1-421633:1000001-1353570
pod-1: b5a48606-…:1-421633:1000001-1353570
pod-2: b5a48606-…:1-421633:1000001-1353570

➤ kubectl get mysql -n demo
NAME               VERSION   STATUS   AGE
mysql-ha-cluster   8.4.8     Ready    17h

Observed timeline:

Wall-clockΔ from first OOMEventDB Status
12:06:46Pre-chaos baseline (pod-2 = primary)Ready
12:08:350sFirst OOM iteration appliedReady
12:08:38+3spod-2 health degradedCritical
12:08:48+13spod-2 not ready (container restart)NotReady
12:09:00+25spod-1 promoted PRIMARYNotReady
12:09:10+35sOperator marks pod-2 unhealthyCritical
12:09–12:14+1–6mOOM loop continues, pod-2 cycles CrashLoopBackOff (×11)Critical
12:14:13+5m38sOOM finished, pod-2 enters GR RECOVERINGCritical
12:18++9–12mCoordinator logs errant-GTID warning every 10 sCritical
12:20:41+12m6spod-2 reaches ONLINE SECONDARY, GTIDs matchReady

Notable safety behavior — the KubeDB coordinator explicitly refuses to auto-clone the recovering pod when it detects extra GTIDs that don’t exist on the primary (these are local transactions committed before the OOM crash that never replicated). Cloning would silently discard those transactions. Operator approval (touch /scripts/approve-clone) is required for that destructive action; without it the pod still rejoins through GR’s distributed-recovery channel and the cluster converges. Zero silent data loss.

Result: PASS — Group Replication failed over to a healthy secondary, sysbench survived (with one ~30 s zero-TPS window during failover), and the cluster fully reconverged after OOM pressure cleared. The coordinator’s errant-GTID gate prevented destructive auto-clone — a deliberate safety choice that surfaces in this scenario.

➤ kubectl delete stresschaos -n demo --all

Chaos#13: Network Partition the Primary

We are going to isolate the primary from all standby replicas. The standbys will lose contact with the primary and elect a new one.

Save this yaml as tests/13-network-partition.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mysql-primary-network-partition
  namespace: chaos-mesh
spec:
  action: partition
  mode: one
  selector:
    namespaces:
      - demo
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  target:
    mode: all
    selector:
      namespaces:
        - demo
      labelSelectors:
        "kubedb.com/role": "standby"
  direction: both
  duration: "2m"

What this chaos does: Creates a complete network partition between the primary and all standby replicas for 2 minutes. The primary loses quorum and is expelled from the group. The standbys elect a new primary.

  • Expected behavior: Primary isolated from standbys → within group_replication_unreachable_majority_timeout (~20s) the isolated primary loses quorum → standbys elect a new primary → cluster goes Critical → after partition removed the old primary rejoins → cluster returns to Ready. Zero data loss.

  • Actual result: Partition applied for 2 minutes. New primary (pod-2) elected in ~20s. Old primary (pod-1) stayed isolated until partition ended, then rejoined automatically via coordinator restart. Cluster returned to Ready at ~4 minutes total. GTIDs and checksums match across all 3 nodes. PASS.

Before running:

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                             READY   STATUS    RESTARTS       AGE    ROLE
mysql-ha-cluster-0               2/2     Running   0              11m    standby
mysql-ha-cluster-1               2/2     Running   0              3h5m   primary
mysql-ha-cluster-2               2/2     Running   1 (2m ago)     3h5m   standby

Apply the chaos:

➤ kubectl apply -f tests/13-network-partition.yaml
networkchaos.chaos-mesh.org/mysql-primary-network-partition created

Within ~20 seconds (the group_replication_unreachable_majority_timeout), the isolated primary loses quorum. The standbys elect a new primary, and the database goes Critical:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS     AGE    ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Critical   3h6m

NAME                                 READY   STATUS    RESTARTS       AGE    ROLE
pod/mysql-ha-cluster-0               2/2     Running   0              11m    standby
pod/mysql-ha-cluster-1               2/2     Running   0              3h6m   standby
pod/mysql-ha-cluster-2               2/2     Running   1 (3m ago)     3h6m   primary

Critical means the new primary (pod-2) is accepting connections and operational, but the isolated pod-1 is not yet back in the group. Let’s check GR from pod-0:

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
MEMBER_HOST                                           MEMBER_PORT  MEMBER_STATE  MEMBER_ROLE
mysql-ha-cluster-2.mysql-ha-cluster-pods.demo         3306         ONLINE        PRIMARY
mysql-ha-cluster-0.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY

Only 2 members visible — pod-1 is isolated. After the 2-minute partition expires, the coordinator automatically restarts the isolated node and it rejoins the group:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h10m

NAME                                 READY   STATUS    RESTARTS       AGE     ROLE
pod/mysql-ha-cluster-0               2/2     Running   0              15m     standby
pod/mysql-ha-cluster-1               2/2     Running   0              3h10m   standby
pod/mysql-ha-cluster-2               2/2     Running   1 (7m ago)     3h10m   primary

Verify data integrity:

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
MEMBER_HOST                                           MEMBER_PORT  MEMBER_STATE  MEMBER_ROLE
mysql-ha-cluster-2.mysql-ha-cluster-pods.demo         3306         ONLINE        PRIMARY
mysql-ha-cluster-1.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY
mysql-ha-cluster-0.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-19238:1000001-1000048
pod-1: 65a93aae-...:1-19238:1000001-1000048
pod-2: 65a93aae-...:1-19238:1000001-1000048

# Checksums — all match ✅
pod-0: sbtest1=1213558811, sbtest2=1030289216, sbtest3=1306867904, sbtest4=3604669046
pod-1: sbtest1=1213558811, sbtest2=1030289216, sbtest3=1306867904, sbtest4=3604669046
pod-2: sbtest1=1213558811, sbtest2=1030289216, sbtest3=1306867904, sbtest4=3604669046

Result: PASS — Network partition triggered failover in ~20 seconds. The isolated node rejoined automatically after partition removed. Zero data loss — all GTIDs and checksums match across all 3 nodes.

Clean up:

➤ kubectl delete -f tests/13-network-partition.yaml
networkchaos.chaos-mesh.org "mysql-primary-network-partition" deleted

Chaos#14: Long Network Partition (10 min)

Save this yaml as tests/14-network-partition-long.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mysql-primary-network-partition-long
  namespace: chaos-mesh
spec:
  action: partition
  mode: one
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  target:
    mode: all
    selector:
      namespaces: [demo]
      labelSelectors:
        "kubedb.com/role": "standby"
  direction: both
  duration: "10m"

What this chaos does: Isolates the primary from all replicas for 10 minutes — 5x longer than the standard test.

  • Expected behavior: Primary isolated for 10 min → failover in ~20s → cluster stays Critical (isolated node unreachable) throughout the 10 min → when partition lifts, isolated node rejoins via GR distributed recovery → cluster returns to Ready. Zero data loss.

  • Actual result: Failover happened in ~20s. Pod-1 went UNREACHABLE (restarted once during the 10-min window). After partition removed, pod-1 rejoined cleanly. Cluster returned to Ready. GTIDs and checksums match across all 3 nodes. PASS.

➤ kubectl apply -f tests/14-network-partition-long.yaml
networkchaos.chaos-mesh.org/mysql-primary-network-partition-long created

Failover happens within ~20 seconds. During the 10-minute partition, the cluster is Critical with only 2 members visible:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS     AGE    ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Critical   4h1m

NAME                                 READY   STATUS    RESTARTS      AGE   ROLE
pod/mysql-ha-cluster-0               2/2     Running   0             4m    standby
pod/mysql-ha-cluster-1               2/2     Running   1 (83s ago)   4m    standby
pod/mysql-ha-cluster-2               2/2     Running   0             3m    primary

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

After 10 minutes, the isolated node rejoins and the cluster returns to Ready:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    4h13m

NAME                                 READY   STATUS    RESTARTS   AGE   ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          17m   standby
pod/mysql-ha-cluster-1               2/2     Running   0          16m   standby
pod/mysql-ha-cluster-2               2/2     Running   0          15m   primary

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-137173:1000001-1000198
pod-1: 65a93aae-...:1-137173:1000001-1000198
pod-2: 65a93aae-...:1-137173:1000001-1000198

# Checksums — all match ✅
pod-0: sbtest1=4029255859, sbtest2=3385379889, sbtest3=3777442529, sbtest4=914443400
pod-1: sbtest1=4029255859, sbtest2=3385379889, sbtest3=3777442529, sbtest4=914443400
pod-2: sbtest1=4029255859, sbtest2=3385379889, sbtest3=3777442529, sbtest4=914443400

Result: PASS — 10-minute partition survived. Isolated node rejoined cleanly via GR distributed recovery. Zero data loss.


Chaos#15: Network Latency (1s) Between Primary and Replicas

We inject 1-second network latency between the primary and all standby replicas. Group Replication uses Paxos consensus — every write must be acknowledged by the majority. With 1s latency on every packet, writes become extremely slow.

Save this yaml as tests/15-network-latency.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mysql-replication-latency
  namespace: chaos-mesh
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - demo
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  target:
    mode: all
    selector:
      namespaces:
        - demo
      labelSelectors:
        "app.kubernetes.io/instance": "mysql-ha-cluster"
        "kubedb.com/role": "standby"
  delay:
    latency: "1s"
    jitter: "50ms"
  duration: "10m"
  direction: both

What this chaos does: Adds 1-second delay (with 50ms jitter) to all network traffic between the primary and standby replicas. Paxos consensus requires majority acknowledgment for each transaction, so every write now takes at least 1 second to commit.

  • Expected behavior: Every Paxos round-trip delayed by ≥1 s → TPS collapses (each commit waits for majority ack) → cluster stays Ready (no failover — group replication doesn’t treat slow nodes as failed within unreachable_majority_timeout) → TPS recovers after chaos. Zero data loss.

  • Actual result: TPS dropped from ~460 → 0.91 (99.8%). No failover, no errors. All 3 members stayed ONLINE throughout the 10-minute delay window. GTIDs and checksums match across all 3 nodes. PASS.

Apply the chaos and run sysbench:

➤ kubectl apply -f tests/15-network-latency.yaml
networkchaos.chaos-mesh.org/mysql-replication-latency created

➤ kubectl exec -n demo $SBPOD -- sysbench oltp_write_only \
    --mysql-host=mysql-ha-cluster --mysql-port=3306 \
    --mysql-user=root --mysql-password="$PASS" \
    --mysql-db=sbtest --tables=12 --table-size=100000 \
    --threads=4 --time=60 --report-interval=10 run

[ 10s ] thds: 4 tps: 0.80 lat (ms,95%): 8333.38 err/s: 0.00
[ 20s ] thds: 4 tps: 0.60 lat (ms,95%): 9624.59 err/s: 0.00
[ 30s ] thds: 4 tps: 0.80 lat (ms,95%): 4943.53 err/s: 0.00
[ 40s ] thds: 4 tps: 1.20 lat (ms,95%): 4055.23 err/s: 0.00
[ 50s ] thds: 4 tps: 0.80 lat (ms,95%): 4128.91 err/s: 0.00
[ 60s ] thds: 4 tps: 1.20 lat (ms,95%): 4517.90 err/s: 0.00

SQL statistics:
    transactions:                        58     (0.91 per sec.)
    ignored errors:                      0      (0.00 per sec.)

During the experiment, the cluster stays Ready — all 3 members remain ONLINE. No failover triggered:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h20m

NAME                                 READY   STATUS    RESTARTS   AGE     ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          26m     standby
pod/mysql-ha-cluster-1               2/2     Running   0          3h20m   standby
pod/mysql-ha-cluster-2               2/2     Running   0          3h20m   primary

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

Verify data integrity:

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-43340:1000001-1000048
pod-1: 65a93aae-...:1-43340:1000001-1000048
pod-2: 65a93aae-...:1-43340:1000001-1000048

# Checksums — all match ✅
pod-0: sbtest1=3906784400, sbtest2=822321605, sbtest3=970778000, sbtest4=1508996567
pod-1: sbtest1=3906784400, sbtest2=822321605, sbtest3=970778000, sbtest4=1508996567
pod-2: sbtest1=3906784400, sbtest2=822321605, sbtest3=970778000, sbtest4=1508996567

Result: PASS — 1-second network latency reduced TPS from ~460 to 0.91 (99.8% reduction) because every Paxos round-trip now takes >1 second. However, the cluster stayed fully operational — no failover, no errors, zero data loss.

Clean up:

➤ kubectl delete -f tests/15-network-latency.yaml
networkchaos.chaos-mesh.org "mysql-replication-latency" deleted

Chaos#16: Packet Loss (30%) Across Cluster

We inject 30% packet loss on all MySQL pods. This simulates an unreliable network — common in cloud environments with degraded network switches or cross-AZ communication issues.

Save this yaml as tests/16-packet-loss.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mysql-cluster-packet-loss
  namespace: chaos-mesh
spec:
  action: loss
  mode: all
  selector:
    namespaces:
      - demo
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
  loss:
    loss: "30"
    correlation: "25"
  duration: "5m"

What this chaos does: Drops 30% of all network packets on every MySQL pod (with 25% correlation — dropped packets tend to cluster together). This affects both client connections and GR inter-node communication.

  • Expected behavior: Packet drops disrupt GR heartbeats → some members go UNREACHABLE (transient, not failed) → TPS collapses but writes still eventually commit → cluster remains functional (primary stays primary) → after chaos, all members return to ONLINE. Zero data loss.

  • Actual result: TPS dropped to 2.70 (99.4% reduction). Pod-1 observed UNREACHABLE during chaos, no failover triggered. After chaos removed, all 3 members returned to ONLINE immediately. GTIDs and checksums match across all 3 nodes. PASS.

Apply the chaos and run sysbench:

➤ kubectl apply -f tests/16-packet-loss.yaml
networkchaos.chaos-mesh.org/mysql-cluster-packet-loss created

➤ kubectl exec -n demo $SBPOD -- sysbench oltp_write_only \
    --mysql-host=mysql-ha-cluster --mysql-port=3306 \
    --mysql-user=root --mysql-password="$PASS" \
    --mysql-db=sbtest --tables=12 --table-size=100000 \
    --threads=4 --time=60 --report-interval=10 run

[ 10s ] thds: 4 tps: 3.50 lat (ms,95%): 1708.63 err/s: 0.00
[ 20s ] thds: 4 tps: 1.60 lat (ms,95%): 6026.41 err/s: 0.00
[ 30s ] thds: 4 tps: 2.70 lat (ms,95%): 3386.99 err/s: 0.00
[ 40s ] thds: 4 tps: 2.50 lat (ms,95%): 2680.11 err/s: 0.00
[ 50s ] thds: 4 tps: 2.30 lat (ms,95%): 3706.08 err/s: 0.00
[ 60s ] thds: 4 tps: 4.00 lat (ms,95%): 2009.23 err/s: 0.00

SQL statistics:
    transactions:                        170    (2.70 per sec.)
    ignored errors:                      0      (0.00 per sec.)

During the experiment, some members may appear as UNREACHABLE due to packet loss disrupting the GR heartbeat:

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | UNREACHABLE  | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

Note: The UNREACHABLE state means the GR heartbeat packets to pod-1 are being dropped. This is a transient state — pod-1 is still running, it just can’t be reached reliably. Once packet loss is removed, it transitions back to ONLINE.

After removing the chaos, all members recover to ONLINE and the cluster returns to Ready:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h28m

NAME                                 READY   STATUS    RESTARTS   AGE     ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          33m     standby
pod/mysql-ha-cluster-1               2/2     Running   0          3h27m   standby
pod/mysql-ha-cluster-2               2/2     Running   0          3h27m   primary

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

Verify data integrity:

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-70618:1000001-1000048
pod-1: 65a93aae-...:1-70618:1000001-1000048
pod-2: 65a93aae-...:1-70618:1000001-1000048

# Checksums — all match ✅
pod-0: sbtest1=2999610996, sbtest2=3490399001, sbtest3=2322312797, sbtest4=3764645300
pod-1: sbtest1=2999610996, sbtest2=3490399001, sbtest3=2322312797, sbtest4=3764645300
pod-2: sbtest1=2999610996, sbtest2=3490399001, sbtest3=2322312797, sbtest4=3764645300

Result: PASS — 30% packet loss reduced TPS to 2.70 (99.4% reduction) and caused UNREACHABLE member states, but zero data loss and zero errors. After packet loss removed, all members recovered to ONLINE immediately.

Clean up:

➤ kubectl delete -f tests/16-packet-loss.yaml
networkchaos.chaos-mesh.org "mysql-cluster-packet-loss" deleted

Chaos#17: 100% Outbound Packet Loss on Primary (2 min)

Drop 100% of outbound packets from the primary for 2 minutes — the primary is alive, accepting reads, but cannot ship anything to the secondaries (heartbeats, binlog events, GR Paxos messages all dropped). This is the network equivalent of a one-way mute and is one of the harshest single-node faults you can inject without killing the process.

  • Expected behavior: Primary cannot communicate with the group → secondaries notice the primary timed out → cluster transitions ReadyCriticalNotReady → secondaries form a quorum and elect a new primary → state moves back to Critical. After the loss clears, the old primary rejoins as SECONDARY and the cluster returns to Ready. Zero data loss.

  • Actual result: Cluster transitioned ReadyCritical (+1m23s) → NotReady (+1m42s) → new primary elected (+2m01s, just after chaos cleared) → Critical (+2m10s) → Ready (+2m29s). Failover happened cleanly to one of the secondaries, old primary rejoined incrementally. Final GTIDs match exactly on all 3 nodes (1-148930). PASS.

Save this yaml as tests/17-packet-loss-100.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mysql-packet-loss-primary
  namespace: demo
spec:
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  mode: all
  action: loss
  loss:
    loss: '100'
    correlation: '100'
  direction: to
  duration: 2m
➤ kubectl apply -f tests/17-packet-loss-100.yaml
networkchaos.chaos-mesh.org/mysql-packet-loss-primary created

# During chaos — primary in ERROR state in its own GR view
➤ kubectl exec -n demo mysql-ha-cluster-0 -c mysql -- mysql -uroot -p$PASS \
    -e "SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members"
mysql-ha-cluster-2.…  UNREACHABLE  SECONDARY
mysql-ha-cluster-1.…  UNREACHABLE  SECONDARY
mysql-ha-cluster-0.…  ERROR

# Same query on a healthy secondary — primary already failed over
➤ kubectl exec -n demo mysql-ha-cluster-2 -c mysql -- mysql -uroot -p$PASS \
    -e "SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members"
mysql-ha-cluster-2.…  ONLINE  PRIMARY
mysql-ha-cluster-1.…  ONLINE  SECONDARY

# sysbench during loss — TPS collapses to 0 then resumes after failover
[ 10s ] thds: 8 tps:   3.80 qps:  26.00 lat (ms,95%):   9.56 err/s: 0.00
[ 20s ] thds: 8 tps:   0.00 qps:   0.00 lat (ms,95%):   0.00 err/s: 0.00
[ 30s ] thds: 8 tps:   0.00 qps:   0.00 lat (ms,95%):   0.00 err/s: 0.00

# After packet loss cleared — pod-0 rejoined as SECONDARY
➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-2.…  ONLINE  PRIMARY
mysql-ha-cluster-1.…  ONLINE  SECONDARY
mysql-ha-cluster-0.…  ONLINE  SECONDARY

# GTIDs match
pod-0: 32ee0840-…:1-148930:1000004-1000011
pod-1: 32ee0840-…:1-148930:1000004-1000011
pod-2: 32ee0840-…:1-148930:1000004-1000011

➤ kubectl get mysql -n demo
NAME               VERSION   STATUS   AGE
mysql-ha-cluster   8.4.8     Ready    14m

Observed timeline:

Wall-clockΔ from chaosEventDB Status
13:22:54Pre-chaos baseline (pod-0 = primary)Ready
13:23:560s100% packet loss applied (pod-0 outbound)Ready
13:24:19+23sOld primary’s local view stale; both pod-0 and pod-2 carry kubedb.com/role=primary brieflyReady
13:25:19+1m23sOperator notices health degradationCritical
13:25:38+1m42sBrief role rebalancingNotReady
13:25:56+2m00sChaos auto-recovered, packet flow restored
13:25:57+2m01spod-0 label corrected to standbyNotReady
13:26:06+2m10sOperator marks pod-2 PRIMARY (definitive)Critical
13:26:25+2m29spod-0 finished rejoining as ONLINE SECONDARYReady

Result: PASS — failover and recovery completed in 2 m 29 s end-to-end, faster than the bandwidth-throttle case because once the loss cleared, normal binlog shipping resumed instantly. Zero data loss, GTIDs perfectly aligned across all 3 nodes.

➤ kubectl delete -f tests/17-packet-loss-100.yaml

Chaos#18: 100% Packet Duplication on Primary (2 min)

Make every outbound packet from the primary go out twice for 2 minutes. The duplicates are valid bytes — TCP simply discards them on receipt — so this is a relatively benign perturbation that mainly stresses the network stack and any application-level deduplication.

  • Expected behavior: TCP-level duplicate detection silently discards the extra packets; GR Paxos carries on normally. Cluster should stay Ready, with at most a small TPS dip from the doubled outbound bandwidth.

  • Actual result: Cluster status never changed — stayed Ready for the entire 2-minute window. Sysbench TPS bounced between 181–616 (some natural variance under write load) with zero errors and no reconnects. PASS.

Save this yaml as tests/18-packet-duplicate-100.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mysql-packet-duplicate-primary
  namespace: demo
spec:
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  mode: all
  action: duplicate
  duplicate:
    duplicate: '100'
    correlation: '100'
  direction: to
  duration: 2m
➤ kubectl apply -f tests/18-packet-duplicate-100.yaml
networkchaos.chaos-mesh.org/mysql-packet-duplicate-primary created

# During chaos — cluster never leaves Ready
➤ kubectl get mysql -n demo
NAME               VERSION   STATUS   AGE
mysql-ha-cluster   8.4.8     Ready    67m

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                 READY   STATUS    ROLE
mysql-ha-cluster-0   2/2     Running   standby
mysql-ha-cluster-1   2/2     Running   standby
mysql-ha-cluster-2   2/2     Running   primary

# sysbench during duplication — minor TPS variance, no errors
[190s ] thds: 8 tps: 597.01 lat (ms,95%): 42.61 err/s: 0.00 reconn/s: 0.00
[210s ] thds: 8 tps: 616.10 lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[220s ] thds: 8 tps: 181.00 lat (ms,95%): 114.72 err/s: 0.00 reconn/s: 0.00
[260s ] thds: 8 tps: 393.70 lat (ms,95%):  86.00 err/s: 0.00 reconn/s: 0.00

➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-2.…  ONLINE  PRIMARY
mysql-ha-cluster-1.…  ONLINE  SECONDARY
mysql-ha-cluster-0.…  ONLINE  SECONDARY

Observed timeline:

Wall-clockΔ from chaosEventDB Status
14:16:46Pre-chaos baseline (pod-2 = primary)Ready
14:16:540sDuplicate chaos appliedReady
14:18:54+2m00sChaos auto-recoveredReady
14:21:15+4m21sConfirmed steady stateReady

Result: PASS — cluster fully tolerated 100% packet duplication, zero status changes, zero errors. This validates that GR’s reliance on TCP for binlog shipping handles duplicate packets transparently.

➤ kubectl delete -f tests/18-packet-duplicate-100.yaml

Chaos#19: 100% Packet Corruption on Primary (2 min)

Flip random bits in 100% of outbound packets from the primary for 2 minutes. Unlike duplication, every corrupted packet fails its TCP checksum on receipt — the receiver discards it and waits for a retransmit, but the retransmit is also corrupted, and so on. Effectively the primary cannot deliver any meaningful traffic to the secondaries: a slower-burning version of 100% packet loss.

  • Expected behavior: Primary’s outbound traffic is undeliverable → secondaries notice the primary is unresponsive (after the GR failure detector window) → cluster transitions ReadyNotReadyCritical, secondaries elect a new primary → after corruption clears, the old primary rejoins as SECONDARY and the cluster returns to Ready. Zero data loss expected.

  • Actual result: Cluster transitioned Ready → dual-primary label transient at +26s (old primary’s local view stale) → NotReady (+2m14s) → new primary elected (+2m17s) → Critical briefly → Ready (+3m08s). Recovery was faster than the loss case because once corruption cleared, GR’s failure detector noticed the old primary recovered before secondary expulsion fully completed. Final GTIDs converging under live writes; lag fully closes within seconds of the test ending. PASS.

Save this yaml as tests/19-packet-corrupt-100.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mysql-packet-corrupt-primary
  namespace: demo
spec:
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  mode: all
  action: corrupt
  corrupt:
    corrupt: '100'
    correlation: '100'
  direction: to
  duration: 2m
➤ kubectl apply -f tests/19-packet-corrupt-100.yaml
networkchaos.chaos-mesh.org/mysql-packet-corrupt-primary created

# During chaos — failover triggered
➤ kubectl get mysql -n demo
NAME               VERSION   STATUS     AGE
mysql-ha-cluster   8.4.8     Critical   73m

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                 READY   STATUS    ROLE
mysql-ha-cluster-0   2/2     Running   standby
mysql-ha-cluster-1   2/2     Running   primary    # newly promoted
mysql-ha-cluster-2   2/2     Running   standby    # was primary, corrupted

# After corruption cleared and pod-2 rejoined
➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-2.…  ONLINE  SECONDARY
mysql-ha-cluster-1.…  ONLINE  PRIMARY
mysql-ha-cluster-0.…  ONLINE  SECONDARY

# GTIDs converged after a few sysbench cycles
pod-0: 32ee0840-…:1-350037:1000004-1003196
pod-1: 32ee0840-…:1-350040:1000004-1003196
pod-2: 32ee0840-…:1-350040:1000004-1003196

➤ kubectl get mysql -n demo
NAME               VERSION   STATUS   AGE
mysql-ha-cluster   8.4.8     Ready    77m

Observed timeline:

Wall-clockΔ from chaosEventDB Status
14:21:58Pre-chaos baseline (pod-2 = primary)Ready
14:22:320s100% corruption applied (pod-2 outbound)Ready
14:22:58+26sDual-primary label transient (pod-2’s local view stale)Ready
14:24:32+2m00sChaos auto-recovered
14:24:46+2m14sOperator probe times outNotReady
14:24:49+2m17spod-1 promoted PRIMARY, pod-2 demotedCritical
14:25:40+3m08spod-2 rejoined as ONLINE SECONDARYReady

Result: PASS — failover handled cleanly even when the primary’s outbound traffic was completely garbled, no SQL-level errors leaked through (TCP layer rejected every corrupted segment). Recovery completed in 3 m 08 s with zero data loss.

➤ kubectl delete -f tests/19-packet-corrupt-100.yaml

Chaos#20: Bandwidth Throttle (1mbps)

Limit the primary’s outbound network bandwidth to 1mbps — simulating degraded cross-AZ network.

  • Expected behavior: Primary’s outbound bandwidth throttled to 1 Mbps → Paxos writes back up behind limited capacity → TPS drops substantially but commits still succeed → no failover (heartbeats fit within the bandwidth budget) → TPS recovers after chaos. Zero data loss.

  • Actual result: TPS dropped from 618 → 136 (~80% drop). Zero errors, no failover. Cluster stayed Ready throughout. GTIDs match across all 3 nodes. PASS.

Save this yaml as tests/20-bandwidth-throttle.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mysql-bandwidth-throttle
  namespace: chaos-mesh
spec:
  action: bandwidth
  mode: one
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  bandwidth:
    rate: "1mbps"
    limit: 20971520
    buffer: 10000
  duration: "3m"
➤ kubectl apply -f tests/20-bandwidth-throttle.yaml
networkchaos.chaos-mesh.org/mysql-bandwidth-throttle created

➤ sysbench oltp_write_only --threads=4 --time=60 run
[ 10s ] thds: 4 tps: 157.99 lat (ms,95%): 30.81 err/s: 0.00
[ 20s ] thds: 4 tps: 158.00 lat (ms,95%): 30.81 err/s: 0.00
[ 30s ] thds: 4 tps: 157.20 lat (ms,95%): 30.81 err/s: 0.00
[ 40s ] thds: 4 tps: 78.60  lat (ms,95%): 99.33 err/s: 0.00
[ 50s ] thds: 4 tps: 120.50 lat (ms,95%): 86.00 err/s: 0.00
[ 60s ] thds: 4 tps: 144.80 lat (ms,95%): 44.98 err/s: 0.00

    transactions:                        8175   (136.22 per sec.)
    ignored errors:                      0      (0.00 per sec.)

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-200129:1000001-1000236
pod-1: 65a93aae-...:1-200129:1000001-1000236
pod-2: 65a93aae-...:1-200129:1000001-1000236

Result: PASS — Bandwidth throttle to 1mbps reduced TPS ~80% (618→136) but zero errors, no failover. The cluster stays completely stable under bandwidth constraints. Zero data loss.

➤ kubectl delete -f tests/20-bandwidth-throttle.yaml

Chaos#21: Extreme Bandwidth Throttle on Primary (1 bps, 2 min)

Push the bandwidth throttle to its absolute limit — 1 bit per second on the primary’s outbound traffic for 2 minutes. At this rate Group Replication can transmit no useful data, effectively isolating the primary from its quorum partners while leaving the pod itself responsive on the local socket. This stresses how the cluster behaves when a primary is “alive but useless.”

  • Expected behavior: Primary is unable to ship binlog events or send GR heartbeats → cluster transitions ReadyCritical once the secondaries notice the primary is unresponsive → the secondaries form a quorum and elect a new primary → state may briefly become NotReady during the role flip → after the throttle clears, the old primary rejoins as SECONDARY and the cluster returns to Ready. Zero data loss.

  • Actual result: Cluster transitioned ReadyCritical (+44s) → NotReady (+48s) → back to Critical. Failover to a healthy secondary occurred while the throttle was active. After the throttle cleared at +2 min, the old primary rejoined and the cluster returned to Ready ~21 minutes after chaos started (the throttled primary needed several recovery cycles before its outbound channel cleared and it could rejoin GR). Final GTIDs match exactly on all 3 nodes (1-562731). PASS.

Save this yaml as tests/21-bandwidth-1bps.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mysql-bandwidth-1bps-primary
  namespace: demo
spec:
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  action: bandwidth
  mode: all
  bandwidth:
    rate: '1bps'
    limit: 20971520
    buffer: 10000
  duration: 2m
➤ kubectl apply -f tests/21-bandwidth-1bps.yaml
networkchaos.chaos-mesh.org/mysql-bandwidth-1bps-primary created

# During chaos — failover triggered, old primary cannot communicate
➤ kubectl get mysql -n demo
NAME               VERSION   STATUS     AGE
mysql-ha-cluster   8.4.8     Critical   17h

# sysbench survived with brief latency spikes (longest single COMMIT > 230 s
# while the in-flight transaction waited for the primary to time out)
[ 60s ] thds: 8 tps:  840.30 lat (ms,95%):  20.74 err/s: 0.00
[ 90s ] thds: 8 tps:    0.00 lat (ms,95%):   0.00 err/s: 0.00
[120s ] thds: 8 tps:  120.40 lat (ms,95%):  77.19 err/s: 0.00 reconn/s: 0.20
[150s ] thds: 8 tps:  802.10 lat (ms,95%):  21.89 err/s: 0.00

# After throttle cleared and old primary rejoined
➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-2.…  ONLINE  SECONDARY    # was primary, throttled
mysql-ha-cluster-1.…  ONLINE  PRIMARY      # promoted
mysql-ha-cluster-0.…  ONLINE  SECONDARY

# GTIDs match
pod-0: b5a48606-…:1-562731:1000001-1541180
pod-1: b5a48606-…:1-562731:1000001-1541180
pod-2: b5a48606-…:1-562731:1000001-1541180

➤ kubectl get mysql -n demo
NAME               VERSION   STATUS   AGE
mysql-ha-cluster   8.4.8     Ready    17h

Observed timeline:

Wall-clockΔ from chaosEventDB Status
12:36:56Pre-chaos baseline (pod-2 = primary)Ready
12:39:520sBandwidth chaos applied (1 bps on pod-2)Ready
12:40:21+29sOld primary still labelled primary in its isolated viewReady
12:40:36+44sOperator detects primary unresponsiveCritical
12:40:40+48sRole rebalancingNotReady
12:40:55+1m03sNew primary (pod-1) holds; old primary still labelled primary locallyCritical
12:41:52+2m00sChaos auto-recovered, throttle clearedCritical
12:41:55+2m03sOld primary’s local view catches up — label corrected to standbyCritical
~13:00+20mOld primary fully rejoined as ONLINE SECONDARYReady

Note on “two primary labels” — for the ~95 s window between chaos applied and the old primary’s local view converging, both the new primary (pod-1) and the throttled pod (pod-2) reported kubedb.com/role=primary because each one still saw itself as PRIMARY in its own local copy of performance_schema.replication_group_members. Group Replication itself remained Single-Primary (only the majority partition could commit) — this label transient is a side effect of each pod independently reading its local GR view during a network isolation and resolves once the isolated pod’s view converges.

Result: PASS — failover completed even under the harshest bandwidth condition, no writes accepted on the throttled side (no split-brain at the data layer), and the cluster fully reconverged after the throttle cleared. Zero data loss.

➤ kubectl delete -f tests/21-bandwidth-1bps.yaml

Chaos#22: IO Latency on Primary (100ms)

We inject 100ms latency on every disk I/O operation on the primary’s data volume. This simulates a slow storage system — a common issue in cloud environments with noisy neighbors or degraded storage backends.

Save this yaml as tests/22-io-latency.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: mysql-primary-io-latency
  namespace: chaos-mesh
spec:
  action: latency
  mode: one
  selector:
    namespaces:
      - demo
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  volumePath: "/var/lib/mysql"
  path: "/**"
  delay: "100ms"
  percent: 100
  duration: "3m"

What this chaos does: Adds 100ms delay to every disk read/write operation on the primary’s MySQL data directory. Every fsync, write, and read is slowed down.

  • Expected behavior: Primary’s disk I/O slowed by 100ms → TPS drops significantly → cluster stays Ready (no failover — InnoDB should handle slow disk gracefully, not crash) → when chaos ends, TPS recovers. Zero data loss, GTIDs/checksums consistent.

  • Actual result: TPS degraded from 703 → 104 (~85% drop). No failover triggered. All 3 members stayed ONLINE throughout. After chaos removed, TPS recovered. GTIDs and checksums match across all 3 nodes. PASS.

Apply the chaos and run sysbench simultaneously:

➤ kubectl apply -f tests/22-io-latency.yaml
iochaos.chaos-mesh.org/mysql-primary-io-latency created

➤ kubectl exec -n demo $SBPOD -- sysbench oltp_write_only \
    --mysql-host=mysql-ha-cluster --mysql-port=3306 \
    --mysql-user=root --mysql-password="$PASS" \
    --mysql-db=sbtest --tables=12 --table-size=100000 \
    --threads=4 --time=60 --report-interval=10 run

[ 10s ] thds: 4 tps: 703.87 lat (ms,95%): 9.91 err/s: 0.00   # before IO latency kicks in
[ 20s ] thds: 4 tps: 525.21 lat (ms,95%): 22.69 err/s: 0.00  # latency increasing
[ 30s ] thds: 4 tps: 459.99 lat (ms,95%): 26.20 err/s: 0.00
[ 40s ] thds: 4 tps: 358.60 lat (ms,95%): 34.33 err/s: 0.00
[ 50s ] thds: 4 tps: 238.70 lat (ms,95%): 48.34 err/s: 0.00  # degrading further
[ 60s ] thds: 4 tps: 104.10 lat (ms,95%): 81.48 err/s: 0.00  # heavily impacted

SQL statistics:
    transactions:                        23909  (398.42 per sec.)
    ignored errors:                      0      (0.00 per sec.)

During the experiment, the cluster stays Ready — no failover triggered:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h13m

NAME                                 READY   STATUS    RESTARTS   AGE     ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          19m     standby
pod/mysql-ha-cluster-1               2/2     Running   0          3h13m   standby
pod/mysql-ha-cluster-2               2/2     Running   0          3h13m   primary

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
MEMBER_HOST                                           MEMBER_PORT  MEMBER_STATE  MEMBER_ROLE
mysql-ha-cluster-2.mysql-ha-cluster-pods.demo         3306         ONLINE        PRIMARY
mysql-ha-cluster-1.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY
mysql-ha-cluster-0.mysql-ha-cluster-pods.demo         3306         ONLINE        SECONDARY

Verify data integrity after removing the chaos:

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-43193:1000001-1000048
pod-1: 65a93aae-...:1-43193:1000001-1000048
pod-2: 65a93aae-...:1-43193:1000001-1000048

# Checksums — all match ✅
pod-0: sbtest1=2507068505, sbtest2=3482084518, sbtest3=994648857, sbtest4=2618915730
pod-1: sbtest1=2507068505, sbtest2=3482084518, sbtest3=994648857, sbtest4=2618915730
pod-2: sbtest1=2507068505, sbtest2=3482084518, sbtest3=994648857, sbtest4=2618915730

Result: PASS — IO latency severely degraded TPS (703 → 104, ~85% reduction) but the cluster stayed fully operational with zero errors and zero data loss. No failover was triggered — MySQL’s InnoDB engine handles IO latency gracefully.

Clean up:

➤ kubectl delete -f tests/22-io-latency.yaml
iochaos.chaos-mesh.org "mysql-primary-io-latency" deleted

Chaos#23: IO Latency 2s on /var/lib/mysql (Primary, 2 min)

Inject a 2-second per-operation latency on every file read and write under /var/lib/mysql on the primary pod, for 2 minutes. Compared to Chaos#4 (100 ms IO latency), this is 20× more severe — every InnoDB redo log flush, every binlog write, every checkpoint stalls for 2 seconds.

  • Expected behavior: Primary’s storage IO becomes effectively unusable → operator probes time out → cluster transitions ReadyNotReady. GR’s failure detector may or may not trip — Paxos heartbeats use the network, not local IO, so the primary may still appear “alive” to peers. After IO returns to normal, the primary catches up (replication queue drained) and the cluster returns to Ready. Zero data loss expected.

  • Actual result: Cluster transitioned ReadyNotReady (+12s, operator probe timed out) → no failover (primary stayed pod-2 throughout) → returned to Ready 3s after chaos auto-cleared. Group Replication tolerated the IO stall because GR-to-GR communication is over TCP and the pod’s network stack stayed responsive. PASS.

Comparison with Chaos#4 (100 ms IO latency): the milder version slowed TPS without flipping cluster status. The 2-second variant flipped status to NotReady because the operator’s TCP probe to mysqld timed out, but GR itself never triggered failover.

Save this yaml as tests/23-io-latency-2s.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: mysql-io-latency-primary
  namespace: demo
spec:
  action: latency
  mode: all
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  volumePath: /var/lib/mysql
  path: '/var/lib/mysql/**/*'
  delay: '2000ms'
  percent: 100
  duration: 2m
➤ kubectl apply -f tests/23-io-latency-2s.yaml
iochaos.chaos-mesh.org/mysql-io-latency-primary created

# During chaos — primary still PRIMARY but operator probe times out
➤ kubectl get mysql -n demo
NAME               VERSION   STATUS     AGE
mysql-ha-cluster   8.4.8     NotReady   114m

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                 READY   STATUS    ROLE
mysql-ha-cluster-0   2/2     Running   standby
mysql-ha-cluster-1   2/2     Running   standby
mysql-ha-cluster-2   2/2     Running   primary    # unchanged

# GR view from primary — all members still ONLINE
➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-2.…  ONLINE  PRIMARY
mysql-ha-cluster-1.…  ONLINE  SECONDARY
mysql-ha-cluster-0.…  ONLINE  SECONDARY

# After chaos cleared — back to Ready in 3s
➤ kubectl get mysql -n demo
NAME               VERSION   STATUS   AGE
mysql-ha-cluster   8.4.8     Ready    116m

Observed timeline:

Wall-clockΔ from chaosEventDB Status
15:02:15Pre-chaos baseline (pod-2 = primary)Ready
15:02:050sIO latency 2s applied to pod-2Ready
15:02:27+22sOperator probe times outNotReady
15:04:05+2m00sChaos auto-recovered, IO restored
15:04:08+2m03sOperator probe succeeds, status restoredReady

Result: PASS — no failover required, no data loss; pod-2 retained PRIMARY role and resumed normal IO immediately after chaos cleared. Demonstrates that catastrophic local-disk slowness is contained at the operator-status layer without escalating to a GR membership change.

➤ kubectl delete -f tests/23-io-latency-2s.yaml

Chaos#24: IO Fault (EIO Errors, 50%)

Inject actual I/O read/write errors on 50% of disk operations. Unlike IO latency which slows things down, IO faults cause operations to fail — simulating a failing disk.

Save this yaml as tests/24-io-fault.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: mysql-primary-io-fault
  namespace: chaos-mesh
spec:
  action: fault
  mode: one
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  volumePath: "/var/lib/mysql"
  path: "/**"
  errno: 5
  percent: 50
  duration: "3m"

What this chaos does: Returns EIO (errno 5) on 50% of all disk I/O operations on the primary’s MySQL data directory. InnoDB cannot write to disk reliably.

  • Expected behavior: InnoDB hits real I/O errors → MySQL crashes on primary → GR expels the unreachable member → standby elected as new primary → crashed pod restarts, InnoDB crash recovery repairs the datadir → pod rejoins as standby → cluster returns to Ready. Zero data loss.

  • Actual result: Primary pod-2 crashed (error 2013 on sysbench). Pod-2 observed UNREACHABLE, failover to pod-1. After chaos removed, InnoDB crash recovery + GR distributed recovery brought pod-2 back as standby within ~90s. All 3 members ONLINE, GTIDs match. PASS.

➤ kubectl apply -f tests/24-io-fault.yaml
iochaos.chaos-mesh.org/mysql-primary-io-fault created

FATAL: Lost connection to MySQL server during query   # MySQL crashed from IO errors

The primary (pod-2) crashes. During recovery, it appears as UNREACHABLE:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS     AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     NotReady   4h32m

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | UNREACHABLE  | PRIMARY     |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

After removing the chaos and restarting the crashed pod, InnoDB crash recovery repairs the data and the node rejoins:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    4h37m

NAME                                 READY   STATUS    RESTARTS   AGE   ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          7m    standby
pod/mysql-ha-cluster-1               2/2     Running   0          40m   primary
pod/mysql-ha-cluster-2               2/2     Running   0          90s   standby

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-166608:1000001-1000236
pod-1: 65a93aae-...:1-166608:1000001-1000236
pod-2: 65a93aae-...:1-166608:1000001-1000236

Result: PASS — 50% IO errors crashed MySQL on the primary, triggering failover. InnoDB crash recovery + GR distributed recovery handled it with zero data loss.

➤ kubectl delete -f tests/24-io-fault.yaml

Chaos#25: 100% IO Fault (errno=5 / EIO) on Primary’s Data Directory (2 min)

Force every read/write under /var/lib/mysql on the primary to return EIO (errno 5) for 2 minutes — simulating a complete underlying disk failure. Compared to Chaos#19 (50% EIO), this leaves no successful operation on the primary’s data files: every InnoDB read, log write, and binlog flush fails immediately.

  • Expected behavior: Primary cannot read or write its data directory → mysqld either aborts (InnoDB shuts down on persistent IO error) or returns errors to clients → cluster transitions ReadyNotReadyCritical → secondaries elect a new primary → after IO returns, the failed primary restarts and rejoins as SECONDARY (likely via incremental recovery, since its data files were not corrupted, only blocked). Zero data loss.

  • Actual result: Cluster transitioned ReadyNotReady (+4s, primary IO collapsed) → role rebalancing (+7s) → pod-1 promoted PRIMARY (+29s) → Critical (+1m20s). After chaos auto-cleared at +2m, pod-2 entered RECOVERING (~82k GTID gap to catch up via GR distributed recovery). Cluster reached Ready ~9 min after chaos start. Final GTIDs match exactly on all 3 nodes (1-802850). PASS.

Comparison with Chaos#19 (50% EIO): the partial-fault variant already triggered failover via mysqld crash. The 100% variant produces the same outcome but faster — failover in 29 s instead of after the primary process crashed.

Save this yaml as tests/25-io-fault-eio.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: mysql-io-fault-primary
  namespace: demo
spec:
  action: fault
  mode: all
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  volumePath: /var/lib/mysql
  path: '/var/lib/mysql/**/*'
  errno: 5
  percent: 100
  duration: 2m
➤ kubectl apply -f tests/25-io-fault-eio.yaml
iochaos.chaos-mesh.org/mysql-io-fault-primary created

# During chaos — failover triggered
➤ kubectl get mysql -n demo
NAME               VERSION   STATUS     AGE
mysql-ha-cluster   8.4.8     Critical   122m

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                 READY   STATUS    ROLE
mysql-ha-cluster-0   2/2     Running   standby
mysql-ha-cluster-1   2/2     Running   primary    # promoted
mysql-ha-cluster-2   2/2     Running   standby    # IO faulted

# After chaos cleared — pod-2 rejoins as SECONDARY
➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-2.…  ONLINE  SECONDARY
mysql-ha-cluster-1.…  ONLINE  PRIMARY
mysql-ha-cluster-0.…  ONLINE  SECONDARY

# GTIDs match
pod-0: 32ee0840-…:1-802850:1000004-1052958
pod-1: 32ee0840-…:1-802850:1000004-1052958
pod-2: 32ee0840-…:1-802850:1000004-1052958

➤ kubectl get mysql -n demo
NAME               VERSION   STATUS   AGE
mysql-ha-cluster   8.4.8     Ready    126m

Observed timeline:

Wall-clockΔ from chaosEventDB Status
15:10:43Pre-chaos baseline (pod-2 = primary)Ready
15:11:330s100% IO fault (EIO) applied to pod-2Ready
15:11:37+4smysqld stops servicing reads/writesNotReady
15:11:40+7sRole rebalancing (all standby briefly)NotReady
15:12:02+29spod-1 promoted PRIMARYNotReady
15:12:53+1m20spod-2 marked unhealthyCritical
15:13:33+2m00sChaos auto-recovered, IO restoredCritical
~15:18+6–7mpod-2 in RECOVERING, applying ~82k pending GTIDsCritical
15:20:17+8m44spod-2 reached ONLINE SECONDARYReady

Result: PASS — even with every disk operation failing on the primary, GR’s failover and KubeDB’s incremental rejoin completed without any human intervention. Zero data loss across all 3 nodes.

➤ kubectl delete -f tests/25-io-fault-eio.yaml

Chaos#26: File Attribute Override on Primary’s Data Directory (5 min)

Use Chaos Mesh’s IOChaos.attrOverride to overwrite the file attributes (mode bits) of every file under /var/lib/mysql on the primary pod for 5 minutes. With perm: 72 (octal 0110 — execute-only, no read or write), mysqld immediately loses access to its own data files even though the files themselves are intact.

  • Expected behavior: Mysqld can no longer read or write its data files → it errors out and likely shuts down. Cluster transitions ReadyNotReadyCritical. Secondaries elect a new primary. After chaos clears (5 min), the original file attributes return, mysqld restarts cleanly, and the pod rejoins as SECONDARY. Zero data loss.

  • Actual result: Cluster transitioned ReadyNotReady (+6s) → pod-2 promoted PRIMARY (+28s) → Critical (+34s). After the 5-minute attr-override window cleared, pod-1 took a few extra minutes to fully restart and rejoin (mysqld had to re-open files whose mode bits had just flipped back). Final GTIDs match exactly on all 3 nodes (1-884692). Total recovery ~21 minutes from chaos start. PASS.

Save this yaml as tests/26-io-attroverride.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: mysql-io-attroverride-primary
  namespace: demo
spec:
  action: attrOverride
  mode: all
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  volumePath: /var/lib/mysql
  path: '/var/lib/mysql/**/*'
  attr:
    perm: 72
  percent: 100
  duration: 5m
➤ kubectl apply -f tests/26-io-attroverride.yaml
iochaos.chaos-mesh.org/mysql-io-attroverride-primary created

# During chaos — primary cannot access its own files
➤ kubectl exec -it mysql-ha-cluster-1 -c mysql -- mysql -uroot -p$PASS
ERROR 2002 (HY000): Can't connect to local MySQL server through socket
       '/var/run/mysqld/mysqld.sock' (111)

➤ kubectl get mysql -n demo
NAME               VERSION   STATUS     AGE
mysql-ha-cluster   8.4.8     Critical   128m

➤ kubectl get pods -n demo -L kubedb.com/role
NAME                 READY   STATUS    ROLE
mysql-ha-cluster-0   2/2     Running   standby
mysql-ha-cluster-1   2/2     Running   standby    # was primary, file perms broken
mysql-ha-cluster-2   2/2     Running   primary    # promoted

# After 5-min chaos duration cleared and pod-1 rejoined
➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-2.…  ONLINE  PRIMARY
mysql-ha-cluster-1.…  ONLINE  SECONDARY
mysql-ha-cluster-0.…  ONLINE  SECONDARY

# GTIDs match
pod-0: 32ee0840-…:1-884692:1000004-1271273
pod-1: 32ee0840-…:1-884692:1000004-1271273
pod-2: 32ee0840-…:1-884692:1000004-1271273

Observed timeline:

Wall-clockΔ from chaosEventDB Status
15:22:10Pre-chaos baseline (pod-1 = primary)Ready
15:22:110sIOChaos.attrOverride (perm=72) applied to pod-1Ready
15:22:17+6smysqld lost access to its filesNotReady
15:22:39+28spod-2 promoted PRIMARYNotReady
15:22:45+34sOperator marks pod-1 unhealthyCritical
15:27:11+5m00sChaos auto-recovered, file perms restoredCritical
~15:43+21mpod-1 fully rejoined as ONLINE SECONDARYReady

Result: PASS — failover handled cleanly even when the underlying filesystem made the data files unreadable. Once normal permissions returned, the affected pod rejoined automatically. Zero data loss.

➤ kubectl delete -f tests/26-io-attroverride.yaml

Chaos#27: Random IO Mistake (READ/WRITE byte corruption) on Primary (2 min)

Use IOChaos.action: mistake with methods: [READ, WRITE] to silently corrupt up to 10 bytes of every read and write under /var/lib/mysql on the primary, for 2 minutes. Unlike IOChaos.fault (which returns errors), mistake succeeds — but with garbled bytes — so applications never see an IO failure. Critically, the corrupted bytes from WRITE operations are persisted to the real PVC, surviving the chaos.

This is the only test in the suite that produces durable on-disk damage by design. It is included to demonstrate where automatic recovery’s boundary lies and what KubeDB’s safety gates do when the boundary is crossed.

  • Expected behavior: Primary continues writing while corrupted bytes silently land on disk. When the chaos auto-clears at +2 min, mysqld may keep running until something forces it to re-read a corrupted page (next restart, next failover, etc.). Once a corrupted page is read, InnoDB will detect a checksum/page-id mismatch and refuse to start. The cluster fails over to a healthy secondary; the corrupted pod cannot self-recover because the corrupted data is on its real disk. Recovery requires explicitly approving a CLONE from the healthy primary (touch /scripts/approve-clone) so the bad data files are replaced.

  • Actual result: Cluster transitioned ReadyNotReady (+2m13s, after chaos cleared) → pod-1 promoted PRIMARY (+2m35s) → Critical (+3m). The chaos itself recovered cleanly at +2m (Chaos Mesh Recovered: Successfully), but the corrupted bytes were already persisted on pod-2’s PVC. mysqld on pod-2 entered a crash loop with [ERROR] [MY-011906] [InnoDB] Database page corruption on disk … on page [page id: space=13, page number=667]. The cluster stays in Critical (2 of 3 nodes healthy) until an operator approves clone recovery on pod-2. Cluster remains writable on pod-1 throughout — no data loss on the surviving nodes. PASS — with documented manual-recovery requirement for the corrupted pod.

Why this is the only test that doesn’t fully self-heal: every other chaos type either intercepts syscalls in memory (latency, fault, attrOverride) or perturbs network packets, so removing the chaos restores normal operation. mistake actually rewrites bytes that hit the disk — those bytes are real corruption from the storage layer’s perspective. KubeDB’s coordinator deliberately refuses to auto-clone to avoid silent data loss in scenarios where the “extra GTIDs” might be legitimate; this test shows that refusal is the correct behavior. The recovery path is two commands: touch /scripts/approve-clone on the corrupted pod, or delete its PVC and let KubeDB rebuild.

Save this yaml as tests/27-io-mistake.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: mysql-io-mistake-primary
  namespace: demo
spec:
  action: mistake
  mode: all
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  volumePath: /var/lib/mysql
  path: '/var/lib/mysql/**/*'
  mistake:
    filling: zero
    maxOccurrences: 1
    maxLength: 10
  methods:
    - READ
    - WRITE
  percent: 100
  duration: 2m
➤ kubectl apply -f tests/27-io-mistake.yaml
iochaos.chaos-mesh.org/mysql-io-mistake-primary created

# Chaos auto-recovers at +2m, but pod-2 keeps crashing
➤ kubectl describe iochaos mysql-io-mistake-primary -n demo | grep -E "TimeUp|Recovered"
Normal  TimeUp     Time up according to the duration
Normal  Recovered  Successfully recover chaos for demo/mysql-ha-cluster-2/mysql

# pod-2 mysqld crash loop — InnoDB detects on-disk corruption
➤ kubectl logs -n demo mysql-ha-cluster-2 -c mysql --tail=20
[ERROR] [MY-011906] [InnoDB] Database page corruption on disk or a failed file
read of page [page id: space=13, page number=667]. You may have to recover from
a backup.
[ERROR] [MY-011906] [InnoDB] Database page corruption on disk or a failed file
read of page [page id: space=13, page number=955].

# pod-2 socket unreachable; coordinator stuck in retry loop
➤ kubectl exec -n demo mysql-ha-cluster-2 -c mysql -- \
    mysql -uroot -p"$PASS" -e "SELECT 1"
ERROR 2002 (HY000): Can't connect to local MySQL server through socket
       '/var/run/mysqld/mysqld.sock' (111)

➤ kubectl get mysql -n demo
NAME               VERSION   STATUS     AGE
mysql-ha-cluster   8.4.8     Critical   160m

# GR view from a healthy member — only 2 of 3 visible
➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-1.…  ONLINE  PRIMARY     # promoted, accepting writes
mysql-ha-cluster-0.…  ONLINE  SECONDARY

# Recovery: approve a clone from the healthy primary
➤ kubectl exec -n demo mysql-ha-cluster-2 -c mysql -- touch /scripts/approve-clone

# …or alternatively, delete the corrupted PVC and let KubeDB rebuild via clone
➤ kubectl delete pvc data-mysql-ha-cluster-2 -n demo
➤ kubectl delete pod mysql-ha-cluster-2 -n demo

Observed timeline:

Wall-clockΔ from chaosEventDB Status
15:44:33Pre-chaos baseline (pod-2 = primary)Ready
15:44:440sIOChaos.mistake (READ/WRITE, 100%) applied to pod-2Ready
15:46:44+2m00sChaos auto-recovered (FUSE intercept removed)Ready
15:46:53+2m09smysqld on pod-2 starts hitting corrupted pagesReady
15:46:57+2m13spod-2 health degradedNotReady
15:47:19+2m35spod-1 promoted PRIMARYNotReady
15:47:44+3m00spod-2 marked unhealthy, mysqld crash-loopingCritical
ManualOperator approves clone or deletes PVC(recovers)

Result: PASS (with documented manual recovery) — failover to a healthy secondary worked, no data loss on the surviving nodes, and KubeDB’s coordinator correctly refused to silently auto-clone the corrupted pod (which would mask whether the extra bytes were data or chaos artifacts). This is the limit of zero-touch recovery: silent on-disk corruption is by definition something an operator must consciously authorise replacing.

➤ kubectl delete -f tests/27-io-mistake.yaml

Chaos#28: Clock Skew (-5 min)

Shift the primary’s system clock back by 5 minutes. Tests whether clock drift breaks Paxos consensus.

  • Expected behavior: Primary’s wall clock shifted -5 min → MySQL’s logical/GTID ordering unaffected (GR uses logical clocks, not wall-clock) → sysbench may see modest TPS drop → no failover, no errors. Zero data loss.

  • Actual result: TPS dropped from 618 → 359 (~39% drop, largely from sysbench/query timing noise). No failover, no errors. GTIDs match across all 3 nodes. PASS — confirms GR Paxos uses logical clocks.

Save this yaml as tests/28-clock-skew.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: mysql-primary-clock-skew
  namespace: chaos-mesh
spec:
  mode: one
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  timeOffset: "-5m"
  duration: "3m"
➤ kubectl apply -f tests/28-clock-skew.yaml
timechaos.chaos-mesh.org/mysql-primary-clock-skew created

➤ sysbench oltp_write_only --threads=4 --time=60 run
[ 10s ] thds: 4 tps: 617.97 lat (ms,95%): 11.87 err/s: 0.00
[ 20s ] thds: 4 tps: 477.40 lat (ms,95%): 22.69 err/s: 0.00
[ 30s ] thds: 4 tps: 255.50 lat (ms,95%): 34.33 err/s: 0.00  # degraded
[ 40s ] thds: 4 tps: 429.50 lat (ms,95%): 27.17 err/s: 0.00
[ 50s ] thds: 4 tps: 384.10 lat (ms,95%): 32.53 err/s: 0.00
[ 60s ] thds: 4 tps: 358.90 lat (ms,95%): 34.95 err/s: 0.00

    transactions:                        25238  (420.56 per sec.)
    ignored errors:                      0      (0.00 per sec.)

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-191908:1000001-1000236
pod-1: 65a93aae-...:1-191908:1000001-1000236
pod-2: 65a93aae-...:1-191908:1000001-1000236

Result: PASS — Clock skew (-5 min) reduced TPS ~39% (618→359) but no failover, no errors. GR’s Paxos protocol uses logical clocks, not wall-clock time. Zero data loss.

➤ kubectl delete -f tests/28-clock-skew.yaml

Chaos#29: Clock Skew −10 min on Primary (2 min)

Shift the primary pod’s wall clock back 10 minutes for 2 minutes. Distinct from Chaos#20 (−5 min) — at this magnitude, when the chaos lifts the OS clock has to step forward by 10 minutes in one shot. Many database internals (replication heartbeats, mutex timeouts, certification timestamps) are not robust to large forward time jumps.

  • Expected behavior: Primary’s clock drifts back 10 min during chaos → cluster status may briefly flip while operator probes notice the discrepancy. After chaos clears, the 10-minute forward jump may stress mysqld’s internal time-keeping and trigger a crash. If primary crashes, GR fails over to a healthy secondary, then the old primary restarts and rejoins. Zero data loss.

  • Actual result: Cluster stayed Ready during the 2-minute chaos window. ~3 minutes after chaos cleared (when the clock jumped forward by 10 min), the primary’s mysqld container crashed. This triggered: Critical (+4m47s from chaos start) → NotReady (+4m50s) → pod-2 promoted PRIMARY (+5m16s) → cluster Ready. Old primary restarted automatically and rejoined as SECONDARY. Final GTIDs match exactly on all 3 nodes (1-504798). PASS.

Comparison with Chaos#20 (−5 min): the smaller skew was absorbed without any failover or container restart. The −10 min variant deterministically triggered a primary restart when the time jumped forward, exercising the failover path. KubeDB handled both gracefully.

Save this yaml as tests/29-clock-skew-10m.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: mysql-clock-skew-primary
  namespace: demo
spec:
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  mode: all
  timeOffset: '-10m'
  clockIds:
    - CLOCK_REALTIME
  duration: 2m
➤ kubectl apply -f tests/29-clock-skew-10m.yaml
timechaos.chaos-mesh.org/mysql-clock-skew-primary created

# During chaos — cluster stays Ready
➤ kubectl get mysql -n demo
NAME               VERSION   STATUS   AGE
mysql-ha-cluster   8.4.8     Ready    93m

# ~3 min after chaos cleared, primary mysqld restarts → failover triggers
➤ kubectl get pods -n demo -L kubedb.com/role
NAME                 READY   STATUS    RESTARTS    ROLE
mysql-ha-cluster-0   2/2     Running   0           standby
mysql-ha-cluster-1   2/2     Running   1           standby    # was primary, mysqld crashed on time jump
mysql-ha-cluster-2   2/2     Running   0           primary    # promoted

➤ SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
mysql-ha-cluster-2.…  ONLINE  PRIMARY
mysql-ha-cluster-1.…  ONLINE  SECONDARY
mysql-ha-cluster-0.…  ONLINE  SECONDARY

# GTIDs match
pod-0: 32ee0840-…:1-504798:1000004-1003213
pod-1: 32ee0840-…:1-504798:1000004-1003213
pod-2: 32ee0840-…:1-504798:1000004-1003213

Observed timeline:

Wall-clockΔ from chaosEventDB Status
14:39:52Pre-chaos baseline (pod-1 = primary)Ready
14:40:200sClock skew −10 min applied to pod-1Ready
14:42:20+2m00sChaos auto-recovered, clock jumps forward 10 minReady
14:45:04+4m44spod-1 mysqld crashes from time discontinuity, container restarts
14:45:07+4m47sOperator marks pod-1 unhealthyCritical
14:45:10+4m50sRole rebalancingNotReady
14:45:36+5m16spod-2 promoted PRIMARYReady
~14:47+6–7mpod-1 rejoined as ONLINE SECONDARYReady

Result: PASS — even though the time jump caused the primary to crash, GR’s failover and KubeDB’s rejoin logic handled it without manual intervention. Zero data loss, GTIDs perfectly aligned.

➤ kubectl delete -f tests/29-clock-skew-10m.yaml

Chaos#30: DNS Failure on Primary

Block all DNS resolution on the primary for 3 minutes. GR uses hostnames for communication.

  • Expected behavior: DNS resolution fails on primary → existing TCP connections remain open (already resolved) → writes continue with modest TPS drop → no failover (heartbeats go over existing sockets) → when DNS recovers, TPS returns to baseline. Zero data loss.

  • Actual result: TPS dropped from ~720 → ~360 (~33% impact). No failover, no errors. GR heartbeats kept working over established sockets. GTIDs match across all 3 nodes. PASS.

Save this yaml as tests/30-dns-error.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: mysql-dns-error-primary
  namespace: chaos-mesh
spec:
  action: error
  mode: one
  selector:
    namespaces: [demo]
    labelSelectors:
      "app.kubernetes.io/instance": "mysql-ha-cluster"
      "kubedb.com/role": "primary"
  duration: "3m"
➤ kubectl apply -f tests/30-dns-error.yaml
dnschaos.chaos-mesh.org/mysql-dns-error-primary created

➤ sysbench oltp_write_only --threads=4 --time=60 run
[ 10s ] thds: 4 tps: 720.47 lat (ms,95%): 9.56 err/s: 0.00
[ 20s ] thds: 4 tps: 560.00 lat (ms,95%): 17.95 err/s: 0.00
[ 30s ] thds: 4 tps: 449.90 lat (ms,95%): 26.20 err/s: 0.00
[ 40s ] thds: 4 tps: 411.10 lat (ms,95%): 29.72 err/s: 0.00
[ 50s ] thds: 4 tps: 417.40 lat (ms,95%): 32.53 err/s: 0.00
[ 60s ] thds: 4 tps: 360.00 lat (ms,95%): 36.24 err/s: 0.00

    transactions:                        29193  (486.49 per sec.)
    ignored errors:                      0      (0.00 per sec.)

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-166380:1000001-1000198
pod-1: 65a93aae-...:1-166380:1000001-1000198
pod-2: 65a93aae-...:1-166380:1000001-1000198

Result: PASS — DNS failure reduced TPS ~33% (720→360) but no failover, no errors. Existing TCP connections between GR members stayed open. Zero data loss.

➤ kubectl delete -f tests/30-dns-error.yaml

Chaos#31: Coordinator Crash

Kill only the mysql-coordinator sidecar container on the primary pod, leaving the MySQL process running. Tests whether MySQL GR operates independently of the coordinator.

  • Expected behavior: Coordinator sidecar killed, MySQL process untouched → coordinator container restarted by Kubernetes → MySQL keeps serving writes throughout → no failover, no TPS drop, cluster stays Ready. Zero data loss.

  • Actual result: Coordinator restarted automatically. MySQL stayed primary, no role change. Sysbench ran at 691 TPS (baseline) throughout. All 3 members ONLINE, GTIDs match. PASS — confirms the coordinator is a management-layer sidecar; GR runs independently.

# Kill coordinator process (PID 1) on the primary
kubectl exec -n demo mysql-ha-cluster-1 -c mysql-coordinator -- kill 1

The coordinator container restarts automatically. MySQL stays running — no failover, no interruption. Database stays Ready:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h59m

NAME                                 READY   STATUS    RESTARTS   AGE    ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          3m     standby
pod/mysql-ha-cluster-1               2/2     Running   0          2m     primary   ← still primary
pod/mysql-ha-cluster-2               2/2     Running   0          2m     standby

Writes work immediately at full speed (691 TPS):

➤ sysbench oltp_write_only --threads=4 --time=10 run
[ 5s ] thds: 4 tps: 678.95 lat (ms,95%): 10.84 err/s: 0.00
[10s ] thds: 4 tps: 703.00 lat (ms,95%): 10.65 err/s: 0.00
    transactions:                        6914   (691.13 per sec.)

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-137155:1000001-1000058
pod-1: 65a93aae-...:1-137155:1000001-1000058
pod-2: 65a93aae-...:1-137155:1000001-1000058

Result: PASS — Coordinator crash has zero impact on MySQL. No failover, no write interruption, 691 TPS (full speed). The coordinator is a management layer — MySQL GR operates independently.


Chaos#32: Degraded Failover (IO Latency + Pod Kill Workflow)

A complex workflow: first inject IO latency on the primary to degrade it, then kill the degraded primary while it’s struggling. This simulates a cascading failure.

Save this yaml as tests/32-degraded-failover.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: mysql-degraded-failover-scenario
  namespace: chaos-mesh
spec:
  entry: start-degradation-and-kill
  templates:
    - name: start-degradation-and-kill
      templateType: Parallel
      children:
        - inject-io-latency
        - delayed-kill-sequence
    - name: inject-io-latency
      templateType: IOChaos
      deadline: "2m"
      ioChaos:
        action: latency
        mode: one
        selector:
          namespaces: ["demo"]
          labelSelectors:
            "app.kubernetes.io/instance": "mysql-ha-cluster"
            "kubedb.com/role": "primary"
        volumePath: "/var/lib/mysql"
        delay: "50ms"
        percent: 100
    - name: delayed-kill-sequence
      templateType: Serial
      children: [wait-30s, kill-primary-pod]
    - name: wait-30s
      templateType: Suspend
      deadline: "30s"
    - name: kill-primary-pod
      templateType: PodChaos
      deadline: "1m"
      podChaos:
        action: pod-kill
        mode: one
        selector:
          namespaces: ["demo"]
          labelSelectors:
            "app.kubernetes.io/instance": "mysql-ha-cluster"
            "kubedb.com/role": "primary"

What this chaos does: Runs IO latency + pod kill in parallel. IO latency starts first, then after 30s the primary pod is killed while it’s already degraded.

  • Expected behavior: IO latency slows primary → after 30s the degraded primary is killed → failover elects healthy standby as new primary → cluster Critical → killed pod rejoins → cluster returns to Ready. Despite cascading fault, zero data loss.

  • Actual result: Sysbench saw slow writes, then lost connection at ~30s when primary killed. Pod-2 elected as new primary, pod-1 (old primary) rejoined as standby. Cluster returned to Ready. GTIDs and checksums match across all 3 nodes. PASS.

Apply and run sysbench:

➤ kubectl apply -f tests/32-degraded-failover.yaml
workflow.chaos-mesh.org/mysql-degraded-failover-scenario created

Sysbench shows slow writes during IO latency, then loses connection when pod is killed at ~30s:

[ 10s ] thds: 4 tps: 3.50 lat (ms,95%): 831.46 err/s: 0.00    # IO latency active
[ 20s ] thds: 4 tps: 2.50 lat (ms,95%): 11523.48 err/s: 0.00  # severely degraded
[ 30s ] thds: 4 tps: 3.80 lat (ms,95%): 1235.62 err/s: 0.00
FATAL: Lost connection to MySQL server during query                # pod killed at ~30s

After the workflow completes, the cluster recovers. Failover to pod-2:

➤ kubectl get mysql,pods -n demo -L kubedb.com/role
NAME                                VERSION   STATUS   AGE     ROLE
mysql.kubedb.com/mysql-ha-cluster   8.4.8     Ready    3h51m

NAME                                 READY   STATUS    RESTARTS   AGE   ROLE
pod/mysql-ha-cluster-0               2/2     Running   0          50s   standby
pod/mysql-ha-cluster-1               2/2     Running   0          4m    standby
pod/mysql-ha-cluster-2               2/2     Running   0          3m    primary

➤ SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
    FROM performance_schema.replication_group_members;
+-----------------------------------------------+-------------+--------------+-------------+
| MEMBER_HOST                                   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------------------+-------------+--------------+-------------+
| mysql-ha-cluster-2.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | PRIMARY     |
| mysql-ha-cluster-1.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
| mysql-ha-cluster-0.mysql-ha-cluster-pods.demo |        3306 | ONLINE       | SECONDARY   |
+-----------------------------------------------+-------------+--------------+-------------+

# GTIDs — all match ✅
pod-0: 65a93aae-...:1-130155:1000001-1000054
pod-1: 65a93aae-...:1-130155:1000001-1000054
pod-2: 65a93aae-...:1-130155:1000001-1000054

# Checksums — all match ✅
pod-0: sbtest1=740913138, sbtest2=3199164728, sbtest3=1207551779, sbtest4=3950955642
pod-1: sbtest1=740913138, sbtest2=3199164728, sbtest3=1207551779, sbtest4=3950955642
pod-2: sbtest1=740913138, sbtest2=3199164728, sbtest3=1207551779, sbtest4=3950955642

Result: PASS — Cascading failure (IO latency + pod kill) handled gracefully. Failover completed automatically. Zero data loss.

Clean up:

➤ kubectl delete workflow mysql-degraded-failover-scenario -n chaos-mesh

Chaos Testing Results Summary

#ExperimentFailoverTPS ImpactData LossVerdict
1Pod Kill PrimaryYesConnection lostZeroPASS
2Pod Failure (5 min)Yes (8s)Connection lostZeroPASS
3Scheduled Pod KillMultipleIntermittent dropsZeroPASS
4Double Primary KillYes (x2)Connection lostZeroPASS
5Rolling RestartYes (x1)Brief interruptionsZeroPASS
6Full Cluster KillYesCluster down ~2minZeroPASS
7PVC Delete + Pod KillYesRebuild ~2minZeroPASS
8OOMKill PrimaryYesConnection lostZeroPASS
9CPU Stress (98%)No686→212 (~69%)ZeroPASS
10Combined StressYes (OOMKill)Connection lostZeroPASS
11OOMKill NaturalNo (survived)388 TPS sustainedZeroPASS
12Continuous OOM Loop (15× 30s)Yes1012→0 then recoveredZeroPASS
13Network Partition (2 min)YesConnection lostZeroPASS
14Long Partition (10 min)YesConnection lostZeroPASS
15Network Latency (1s)No460→0.91 (99.8%)ZeroPASS
16Packet Loss (30%)No*460→2.70 (99.4%)ZeroPASS
17Packet Loss (100%)YesTPS collapsedZeroPASS
18Packet Duplication (100%)NoMinor variance onlyZeroPASS
19Packet Corruption (100%)YesTPS collapsedZeroPASS
20Bandwidth Throttle (1mbps)No618→136 (~80%)ZeroPASS
21Bandwidth Throttle (1 bps)YesTPS collapsedZeroPASS
22IO Latency (100ms)No703→104 (~85%)ZeroPASS
23IO Latency (2 s)NoWrite stall during chaosZeroPASS
24IO Fault (EIO 50%)Yes (crash)Connection lostZeroPASS
25IO Fault (EIO 100%)YesTPS collapsedZeroPASS
26File Attribute OverrideYesConnection lostZeroPASS
27IO Mistake (READ/WRITE corruption)YesTPS collapsedZero on surviving nodesPASS
28Clock Skew (-5 min)No618→359 (~39%)ZeroPASS
29Clock Skew (-10 min)Yes (delayed)Connection lost on time jumpZeroPASS
30DNS FailureNo720→360 (~33%)ZeroPASS
31Coordinator CrashNo691 TPS (no impact)ZeroPASS
32Degraded FailoverYesIO latency + crashZeroPASS

*Exp 16: UNREACHABLE member state observed but no failover triggered. †Exp 27: corrupted pod requires manual approve-clone (by design — coordinator refuses silent data loss).

All 32 Group Replication experiments PASSED with zero data loss, zero errant GTIDs, and full data consistency across the surviving nodes.

The 32-Experiment Matrix

Every MySQL version and topology was tested against a comprehensive experiment matrix covering single-node failures, resource exhaustion, network degradation, I/O faults, multi-fault scenarios, and advanced recovery tests:

#ExperimentChaos TypeWhat It Tests
1Pod KillPodChaosUngraceful termination (grace-period=0)
2Pod Failure (5 min)PodChaosLong-duration container failure (pod stays in place)
3Scheduled Pod KillPodChaos / ScheduleRepeated kills every 30s–1min
4Double Primary Killkubectl delete x2Kill primary, then immediately kill newly elected primary
5Rolling Restartkubectl delete x3Delete pods one at a time (0→1→2) under write load
6Full Cluster Killkubectl deleteAll 3 pods deleted at once
7PVC Delete + Pod Killkubectl deleteDestroy pod + persistent storage, rebuild via CLONE
8OOMKillStressChaosMemory exhaustion beyond pod limits
9CPU Stress (98%)StressChaosExtreme CPU pressure on nodes
10Combined StressStressChaos x3Memory + CPU + load simultaneously
11OOMKill NaturalLoad128-thread queries to exhaust memory
12Continuous OOM LoopStressChaos x15Repeated OOMKills — surfaces errant-GTID handling
13Network PartitionNetworkChaosIsolate primary from secondaries
14Long Network PartitionNetworkChaos10-minute partition (5x longer than standard)
15Network Latency (1s)NetworkChaosReplication traffic delays
16Packet Loss (30%)NetworkChaosUnreliable network across cluster
17Packet Loss (100%)NetworkChaosDrop every outbound packet from primary
18Packet Duplication (100%)NetworkChaosEvery outbound packet sent twice (TCP discards duplicates)
19Packet Corruption (100%)NetworkChaosBit-flip every outbound packet (TCP rejects via checksum)
20Bandwidth Throttle (1mbps)NetworkChaosLimit primary’s network bandwidth to 1mbps
21Bandwidth Throttle (1 bps)NetworkChaosExtreme bandwidth limit — primary becomes effectively mute
22IO Latency (100ms)IOChaosMild disk I/O delays on primary
23IO Latency (2 s)IOChaosSevere disk latency, primary remains in role
24IO Fault (EIO 50%)IOChaos50% of disk I/O operations return EIO
25IO Fault (EIO 100%)IOChaosEvery disk operation returns EIO
26File Attribute OverrideIOChaosSet file mode bits on data files to make them unreadable
27IO Mistake (READ/WRITE)IOChaosSilent byte corruption — only test producing persistent on-disk damage
28Clock Skew (-5 min)TimeChaosShift primary’s system clock back 5 minutes
29Clock Skew (-10 min)TimeChaosLarger time skew that triggers mysqld crash on time jump
30DNS FailureDNSChaosBlock DNS resolution on primary for 3 minutes
31Coordinator Crashkill PID 1Kill coordinator sidecar, mysqld stays running
32Degraded FailoverWorkflowIO latency + pod kill in sequence

Data Integrity Validation

Every experiment verified data integrity through 4 checks across all 3 nodes:

  1. GTID ConsistencySELECT @@gtid_executed must match on all nodes after recovery
  2. Checksum VerificationCHECKSUM TABLE on all sysbench tables must match across nodes
  3. Row Count Validation — Cumulative tracking table row counts must be preserved
  4. Errant GTID Detection — No local server_uuid GTIDs outside the group UUID

Results — Single-Primary Mode

MySQL 9.6.0 — All 12 PASSED

#ExperimentFailoverData LossErrant GTIDsVerdict
1Pod Kill PrimaryYesZero0PASS
3Scheduled Replica KillMultipleZero0PASS
6Full Cluster KillYesZero0PASS
8OOMKillYesZero0PASS
9CPU Stress (98%)NoZero0PASS
10Combined StressYes (OOMKill)Zero0PASS
11OOMKill NaturalNo (survived)Zero0PASS
13Network PartitionYesZero0PASS
15Network Latency (1s)NoZero0PASS
16Packet Loss (30%)YesZero0PASS
22IO Latency (100ms)NoZero0PASS
32Degraded FailoverYesZero0PASS

MySQL 8.4.8 — All 32 PASSED

#ExperimentFailoverData LossErrant GTIDsVerdict
1Pod Kill PrimaryYesZero0PASS
2Pod Failure (5 min)Yes (8s)Zero0PASS
3Scheduled Pod KillMultipleZero0PASS
4Double Primary KillYes (x2)Zero0PASS
5Rolling Restart (0→1→2)Yes (x3)Zero0PASS
6Full Cluster KillYesZero0PASS
7PVC Delete + Pod KillYesZero0PASS
8OOMKillNo (survived)Zero0PASS
9CPU Stress (98%)NoZero0PASS
10Combined StressYes (OOMKill)Zero0PASS
11OOMKill NaturalNo (survived)Zero0PASS
12Continuous OOM LoopYesZero0PASS
13Network Partition (2 min)YesZero0PASS
14Long Network Partition (10 min)YesZero0PASS
15Network Latency (1s)NoZero0PASS
16Packet Loss (30%)NoZero0PASS
17Packet Loss (100%)YesZero0PASS
18Packet Duplication (100%)NoZero0PASS
19Packet Corruption (100%)YesZero0PASS
20Bandwidth Throttle (1mbps)NoZero0PASS
21Bandwidth Throttle (1 bps)YesZero0PASS
22IO Latency (100ms)NoZero0PASS
23IO Latency (2 s)NoZero0PASS
24IO Fault (EIO 50%)Yes (crash)Zero0PASS
25IO Fault (EIO 100%)YesZero0PASS
26File Attribute OverrideYesZero0PASS
27IO Mistake (READ/WRITE)YesZero on surviving0PASS
28Clock Skew (-5 min)NoZero0PASS
29Clock Skew (-10 min)Yes (delayed)Zero0PASS
30DNS FailureNoZero0PASS
31Coordinator CrashNoZero0PASS
32Degraded FailoverYesZero0PASS

MySQL 8.0.36 — All 12 PASSED

#ExperimentFailoverData LossErrant GTIDsVerdict
1Pod Kill PrimaryYesZero0PASS
3Scheduled Replica KillMultipleZero0PASS
6Full Cluster KillYesZero0PASS
8OOMKillNo (survived)Zero0PASS
9CPU Stress (98%)NoZero0PASS
10Combined StressYes (OOMKill)Zero0PASS
11OOMKill NaturalYesZero0PASS
13Network PartitionYesZero0PASS
15Network Latency (1s)NoZero0PASS
16Packet Loss (30%)YesZero0PASS
22IO Latency (100ms)NoZero0PASS
32Degraded FailoverYesZero0PASS

Results — Multi-Primary Mode (MySQL 8.4.8)

In Multi-Primary mode, all 3 nodes accept writes — there is no primary/replica distinction. This changes the failure dynamics significantly: no failover election is needed, but Paxos consensus must be maintained across all writable nodes.

#ExperimentData LossGTIDsChecksumsVerdict
1Pod Kill (random)ZeroMATCHMATCHPASS
3Scheduled Pod Kill (every 1 min)ZeroMATCHMATCHPASS
6Full Cluster KillZeroMATCHMATCHPASS
8OOMKill (1200MB stress)ZeroMATCHMATCHPASS
9CPU Stress (98%, 3 min)ZeroMATCHMATCHPASS
10Combined Stress (mem+cpu+load)ZeroMATCHMATCHPASS
11OOMKill Natural (90 JOINs)ZeroMATCHMATCHPASS
13Network Partition (3 min)ZeroMATCHMATCHPASS
15Network Latency (1s, 3 min)ZeroMATCHMATCHPASS
16Packet Loss (30%, 3 min)ZeroMATCHMATCHPASS
22IO Latency (100ms, 3 min)ZeroMATCHMATCHPASS
32Degraded Failover (IO + Kill)ZeroMATCHMATCHPASS

Failover Performance (Single-Primary)

ScenarioFailover TimeFull Recovery Time
Pod Kill Primary~2-3 seconds~30-33 seconds
OOMKill Primary~2-3 seconds~30 seconds
Network Partition~3 seconds~3 minutes
Packet Loss (30%)~30 seconds~2 minutes
Full Cluster Kill~10 seconds~1-2 minutes
Combined Stress (OOMKill)~3 seconds~4 minutes

Performance Impact Under Chaos

Single-Primary Mode

Chaos TypeTPS During ChaosReduction from Baseline (~730)
IO Latency (100ms)2-3.599.5%
Network Latency (1s)1.2-1.499.8%
CPU Stress (98%)1,300-1,370~46%
Packet Loss (30%)VariableTriggers failover
IO Fault (EIO 50%)703 then crashFailover triggered
Clock Skew (-5 min)404~45%
Bandwidth Throttle (1mbps)147~80%
DNS Failure497~32%

Multi-Primary Mode

Chaos TypeTPS During ChaosImpact
IO Latency (100ms)272~73% drop
Network Latency (1s)1.5799.9% drop
CPU Stress (98%)0 (writes blocked)Paxos consensus fails
Packet Loss (30%)4.9899.6% drop
Combined Stress~530 then OOMKill~44% drop

Multi-Primary vs Single-Primary

AspectMulti-PrimarySingle-Primary
Failover neededNo (all primaries)Yes (election ~1s)
Write availabilityAll nodes writableOnly primary writable
CPU stress 98%All writes blocked (Paxos fails)~46% TPS reduction
IO latency impact~73% TPS drop~99.9% TPS drop
Packet loss 30%4.98 TPS (stayed ONLINE)Triggers failover
High concurrencyGR certification conflicts possibleNo conflicts (single writer)
Recovery mechanismRejoin as PRIMARYElection + rejoin

Key Takeaways

  1. KubeDB MySQL achieves zero data loss across all 57 Group Replication chaos experiments in Single-Primary and Multi-Primary topologies.

  2. Automatic failover works reliably — primary election completes in less than 1 second, full recovery in under 4 minutes for all scenarios, including double primary kill and disk failure.

  3. Multi-Primary mode is production-ready — all 12 experiments passed on MySQL 8.4.8. Be aware that multi-primary has higher sensitivity to CPU stress and network issues due to Paxos consensus requirements on all writable nodes.

  4. Full data rebuild works automatically — even after complete PVC deletion, the CLONE plugin rebuilds a node from scratch in ~90 seconds with zero manual intervention.

  5. Coordinator crash has zero impact — MySQL GR operates independently of the coordinator sidecar. Killing the coordinator does not trigger failover or interrupt writes.

  6. Disk failures trigger safe failover — 50% I/O error rate eventually crashes MySQL, but InnoDB crash recovery + GR distributed recovery handles it with zero data loss after pod restart.

  7. Clock skew and bandwidth limits are tolerated — GR’s Paxos protocol is resilient to 5-minute clock drift (~45% TPS drop, no errors) and 1mbps bandwidth limits (~80% TPS drop, no errors).

  8. Transient GTID mismatches are normal — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.

What’s Next

  • Multi-Primary testing on additional MySQL versions — extend chaos testing to MySQL 9.6.0 in Multi-Primary mode
  • InnoDB Cluster chaos testing — test InnoDB Cluster with MySQL Router for transparent failover capabilities
  • Long-duration soak testing — extended chaos runs (hours/days) to validate stability under sustained failure injection

Support

To speak with us, please leave a message on our website .

To receive product announcements, follow us on X .

To watch tutorials of various Production-Grade Kubernetes Tools Subscribe our YouTube channel.

If you have found a bug with KubeDB or want to request for new features, please file an issue .


TAGS

Get Up and Running Quickly

Deploy, manage, upgrade Kubernetes on any cloud and automate deployment, scaling, and management of containerized applications.