
New to KubeDB? Please start here .
Chaos Testing KubeDB Managed PostgreSQL with Chaos-Mesh
Setup Cluster
To follow along with this tutorial, you will need:
- A running Kubernetes cluster.
- KubeDB installed in your cluster.
- kubectl command-line tool configured to communicate with your cluster.
- Chaos-Mesh installed
in your cluster.
helm upgrade -i chaos-mesh chaos-mesh/chaos-mesh \ -n chaos-mesh \ --create-namespace \ --set dashboard.create=true \ --set dashboard.securityMode=false \ --set chaosDaemon.runtime=containerd \ --set chaosDaemon.socketPath=/run/containerd/containerd.sock \ --set chaosDaemon.privileged=true
Note: Make sure to set correct path to your container runtime socket and runtime in the above command. For ex:
socketPath=/run/containerd/containerd.sock, or if in k3s, setchaosDaemon.socketPath=/run/k3s/containerd/containerd.sock.
Verify KubeDB and Chaos-Mesh Installation
➤ kubectl get pods -n kubedb
NAME READY STATUS RESTARTS AGE
kubedb-kubedb-autoscaler-0 1/1 Running 0 24d
kubedb-kubedb-ops-manager-0 1/1 Running 0 22d
kubedb-kubedb-provisioner-0 1/1 Running 0 146m
kubedb-kubedb-webhook-server-699bf949df-24w5k 1/1 Running 0 146m
kubedb-operator-shard-manager-77c8df4946-4gwhc 1/1 Running 0 146m
kubedb-petset-869495bb7f-2cln2 1/1 Running 0 146m
kubedb-sidekick-794cf489b4-t9rgf 1/1 Running 0 146m
---
➤ kubectl get pods -n chaos-mesh
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-7d44db47fb-4cwc9 1/1 Running 0 3d17h
chaos-controller-manager-7d44db47fb-cqvf7 1/1 Running 0 3d15h
chaos-controller-manager-7d44db47fb-x4xnt 1/1 Running 0 3d17h
chaos-daemon-f779s 1/1 Running 0 3d17h
chaos-dashboard-6855b9d4c-phkht 1/1 Running 0 4d1h
chaos-dns-server-85b8846dc9-ngcwm 1/1 Running 0 4d1h
Introduction to Chaos Engineering
Chaos Engineering is a disciplined approach to testing distributed systems by deliberately introducing controlled failure scenarios to discover vulnerabilities and weaknesses before they impact your users. Rather than waiting for production incidents, chaos engineering proactively identifies how your system behaves under adverse conditions—such as pod failures, network outages, resource exhaustion, and data corruption.
This methodology is particularly crucial for database systems, where failures can lead to data loss, service downtime, and compromised data consistency. By testing these scenarios in controlled environments, you gain confidence that your system can recover gracefully and maintain availability.
What This Blog Covers
In this comprehensive guide, we will:
- Deploy a Highly Available PostgreSQL Cluster on Kubernetes using KubeDB, configured with replication and automatic failover capabilities
- Run 16+ Chaos Engineering Experiments using Chaos-Mesh to simulate real-world failure scenarios
- Observe Cluster Behavior during failures including pod crashes, network issues, resource exhaustion, and disk I/O errors
- Measure Resilience by tracking data consistency, failover speed, and recovery capabilities
- Learn Best Practices for configuring PostgreSQL replication and failover strategies to maximize availability
Each experiment progressively tests different aspects of the system—from simple pod failures to complex scenarios involving multiple simultaneous failures. By the end, you’ll have a thorough understanding of how your PostgreSQL cluster behaves under various failure modes and how to configure it for maximum resilience.
You can see the Chaos Testing Results Summary
for a quick view of what we have done in this blog.
Create a High-Availability PostgreSQL Cluster
First, we need to deploy a PostgreSQL cluster configured for High Availability. Unlike a Standalone instance, a HA cluster consists of a primary pod and one or more standby pods that are ready to take over if the leader fails.
Save the following YAML as setup/pg-ha-cluster.yaml. This manifest defines a 3-node PostgreSQL cluster with streaming replication enabled.
apiVersion: kubedb.com/v1
kind: Postgres
metadata:
name: pg-ha-cluster
namespace: demo
spec:
clientAuthMode: md5
deletionPolicy: Delete
podTemplate:
spec:
containers:
- name: postgres
resources:
limits:
memory: 3Gi
requests:
cpu: 2
memory: 2Gi
replicas: 3
replication:
walKeepSize: 5000
walLimitPolicy: WALKeepSize
# forceFailoverAcceptingDataLossAfter: 30s # uncomment this if you want to accept data loss during failover, but want to have minimal downtime.
standbyMode: Hot
storage:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageType: Durable
version: "16.4"
Important Notes:
- We have set walLimitPolicy to WALKeepSize and walKeepSize to 5000. This means that we will keep 5000 MB of WAL files in our cluster. If your write operations are very high, you might want to increase this value. We suggest you set it to at least 15 - 30% of your storage.
- If you can tolerate some data loss, but you want your primary to be up and running at any time with minimal downtime, you can set
.spec.replication.forceFailoverAcceptingDataLossAfter: 30s- You can read/write in your database in both
ReadyandCriticalstate. So it means even if your db is inCriticalstate, your uptime is not compromised.Criticalmeans one or more replicas are offline. Butprimaryis up and running along with some other replicas probably.- All the results/metrics shown in this blog is related to the chaos scenarios. In general, a failover takes ~5 seconds and without any data loss ensuring high availability and data safety.
Now, create the namespace and apply the manifest:
# Create the namespace if it doesn't exist
kubectl create ns demo
# Apply the manifest to deploy the cluster
kubectl apply -f setup/pg-ha-cluster.yaml
You can monitor the status until all pods are ready:
watch kubectl get pg,petset,pods -n demo
See the database status is ready.
➤ kubectl get pg,petset,pods -n demo
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 4m45s
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 4m41s
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 4m41s
pod/pg-ha-cluster-1 2/2 Running 0 2m45s
pod/pg-ha-cluster-2 2/2 Running 0 2m39s
Inspect who is primary and who is standby.
# you can inspect who is primary
# and who is secondary like below
➤ kubectl get pods -n demo --show-labels | grep role
pg-ha-cluster-0 2/2 Running 0 20m app.kubernetes.io/component=database,app.kubernetes.io/instance=pg-ha-cluster,app.kubernetes.io/managed-by=kubedb.com,app.kubernetes.io/name=postgreses.kubedb.com,apps.kubernetes.io/pod-index=0,controller-revision-hash=pg-ha-cluster-6c5954fd77,kubedb.com/role=primary,statefulset.kubernetes.io/pod-name=pg-ha-cluster-0
pg-ha-cluster-1 2/2 Running 0 19m app.kubernetes.io/component=database,app.kubernetes.io/instance=pg-ha-cluster,app.kubernetes.io/managed-by=kubedb.com,app.kubernetes.io/name=postgreses.kubedb.com,apps.kubernetes.io/pod-index=1,controller-revision-hash=pg-ha-cluster-6c5954fd77,kubedb.com/role=standby,statefulset.kubernetes.io/pod-name=pg-ha-cluster-1
pg-ha-cluster-2 2/2 Running 0 18m app.kubernetes.io/component=database,app.kubernetes.io/instance=pg-ha-cluster,app.kubernetes.io/managed-by=kubedb.com,app.kubernetes.io/name=postgreses.kubedb.com,apps.kubernetes.io/pod-index=2,controller-revision-hash=pg-ha-cluster-6c5954fd77,kubedb.com/role=standby,statefulset.kubernetes.io/pod-name=pg-ha-cluster-2
The pod having kubedb.com/role=primary is the primary and kubedb.com/role=standby are the standbys.
Chaos Testing
We will run some chaos experiments to see how our cluster behaves under failure scenarios like oom kill, network latency, network partition, io latency, io fault etc. We will use a PostgreSQL client application to simulate high write and read load on the cluster.
PostgreSQL High Write/Read Load Client
You can apply these YAMLs to create a client application that will continuously write and read data from the database. This will help us see how the cluster behaves under load and during chaos scenarios. Make sure you change the password of your database in the below Secret YAML.
# k8s/01-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: pg-load-test-config
namespace: demo
labels:
app: pg-load-test
data:
# Test Duration (in seconds)
TEST_RUN_DURATION: "400"
# Concurrency Settings
CONCURRENT_WRITERS: "20"
# Workload Distribution (must sum to 100)
READ_PERCENT: "80"
INSERT_PERCENT: "10"
UPDATE_PERCENT: "10"
# Batch Sizes
BATCH_SIZE: "100"
READ_BATCH_SIZE: "100"
# Database Settings
TABLE_NAME: "load_test_data"
# Connection Pool Settings
MAX_OPEN_CONNS: "60"
MAX_IDLE_CONNS: "10"
CONN_MAX_LIFETIME: "300"
# Connection Safety
MIN_FREE_CONNS: "5"
# Reporting
REPORT_INTERVAL: "20"
---
# k8s/02-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: pg-load-test-secret
namespace: demo
labels:
app: pg-load-test
type: Opaque
data:
# Base64 encoded database credentials
# Replace these with your actual base64-encoded values
# Example: echo -n "your-postgres-host" | base64
DB_HOST: cGctaGEtY2x1c3Rlci5kZW1vLnN2Yy5jbHVzdGVyLmxvY2Fs
# Example: echo -n "5432" | base64
DB_PORT: NTQzMg==
# Example: echo -n "postgres" | base64
DB_USER: cG9zdGdyZXM=
# Example: echo -n "your-password" | base64
# IMPORTANT: Replace this with your actual password
DB_PASSWORD: NihrMkohSXVYdChGSSpmSg==
# Example: echo -n "postgres" | base64
DB_NAME: cG9zdGdyZXM=
---
# How to encode your credentials:
# echo -n "127.0.0.1" | base64
# echo -n "5678" | base64
# echo -n "postgres" | base64
# echo -n "CIX6TzfTYFn8~pj4" | base64
# echo -n "postgres" | base64
---
# k8s/03-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pg-load-test-job
namespace: demo
labels:
app: pg-load-test
version: v2
spec:
completions: 1
backoffLimit: 0
ttlSecondsAfterFinished: 86400
template:
metadata:
labels:
app: pg-load-test
version: v2
spec:
restartPolicy: Never
containers:
- name: load-test
# Replace with your image registry and tag
image: souravbiswassanto/high-write-load-client:v0.0.0
imagePullPolicy: Always
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "2Gi"
cpu: "2000m"
envFrom:
- configMapRef:
name: pg-load-test-config
- secretRef:
name: pg-load-test-secret
volumeMounts:
- name: results
mountPath: /results
volumes:
- name: results
persistentVolumeClaim:
claimName: pg-load-test-results
---
# k8s/04-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pg-load-test-results
namespace: demo
labels:
app: pg-load-test
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
Also as a standard, we will use 10% write, 10% update and 80% read operations. In 5 minutes of high load, it should generate around 30GB of data, more than 30M rows inserted, more than 300M rows read.
Note: If you do not want to generate this much data, you can reduce the INSERT_PERCENT and BATCH_SIZE values.
Save the above yamls. Then make a script like below:
➤ cat run-k8s.sh
#! /usr/bin/bash
kubectl delete -f k8s/03-job.yaml
kubectl delete -f k8s/04-pvc.yaml
kubectl apply -f k8s/01-configmap.yaml
kubectl apply -f k8s/03-job.yaml
kubectl apply -f k8s/04-pvc.yaml
Run the script to start the load test.
chmod +x run-k8s.sh
./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config configured
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
I have attached a sample output of the load test job below. These metrics will be printed after every REPORT_INTERVAL seconds. You can see that we are generating around 38GB of data, more than 4M rows inserted, more than 32M rows read in 7 minutes of high load.
Test Duration: 7m3s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 408454 (Reads: 326500, Inserts: 40908, Updates: 41046)
Total Number of Rows Reads: 32650000, Inserts: 4090800, Updates: 41046
Total Errors: 0
Total Data Transferred: 38187.80 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 0.00 (Reads: 0.00/s, Inserts: 0.00/s, Updates: 0.00/s)
Throughput: 0.00 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 12.097ms, P95: 83.291ms, P99: 100.506ms
Inserts - Avg: 58.51ms, P95: 146.231ms, P99: 218.178ms
Updates - Avg: 37.444ms, P95: 100.994ms, P99: 192.838ms
-----------------------------------------------------------------
Connection Pool:
Active: 13, Max: 100, Available: 87
=================================================================
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 4140800
Records Found in DB: 4140800
I0406 04:26:53.097674 1 load_generator_v2.go:555] Total records in table: 4140800
I0406 04:26:53.097700 1 load_generator_v2.go:556] totalRows in LoadGenerator: 4140800
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
Cleaning up test data...
Cleaning up test table...
Cleanup completed
Test data table deleted successfully
Test completed successfully!
You can see these logs by running
kubectl logs -n demo job/pg-load-test-jobcommand.
With this load on the cluster, we are ready to run some chaos experiments and see how our cluster behaves under failure scenarios.
Chaos#1: Kill the Primary Pod
We will ignore the load test for this experiment.
We are about to kill the primary pod and see how fast the failover happens. We will use Chaos-Mesh to do this. You can also do this manually by running kubectl delete pod command, but using Chaos-Mesh will give you more insights about the failover process.
Save this yaml as tests/01-pod-kill.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pg-primary-pod-kill
namespace: chaos-mesh
spec:
action: pod-kill
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
gracePeriod: 0
duration: "30s"
What this chaos does: Terminates the primary pod abruptly, forcing an immediate failover to a standby replica.
We are selecting the primary pod using label selector and killing it. The duration field specifies how long the chaos will last. In this case, we are killing the primary pod for 30 seconds.
Our expectation is that within 30 seconds, the primary pod will be killed, and one of the standby pods will be promoted to primary. The killed pod will be brought back by our PetSet operator and will join the cluster as a standby.
Before running, let’s see who is the primary
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{print $1}'
pg-ha-cluster-0
Now run watch kubectl get pg,petset,pods -n demo.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 09:36:19 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d15h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d15h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 3m44s
pod/pg-ha-cluster-1 2/2 Running 0 59s
pod/pg-ha-cluster-2 2/2 Running 0 57s
While watching the pods, run the chaos experiment.
kubectl apply -f primary-pod-kill.yaml
podchaos.chaos-mesh.org/pg-primary-pod-kill created
kubectl get pg,petset,pods -n demo
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 2d15h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d15h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 1 (8s ago) 10s
pod/pg-ha-cluster-1 2/2 Running 0 3m36s
pod/pg-ha-cluster-2 2/2 Running 0 3m34s
Note the Restarts section; you will see the primary pod was
killed 8 seconds ago. The failover was done almost immediately.
The database state is now Critical, which
means your new primary is ready to accept connections, but one or
more of your replicas are not ready. The old primary will
be ready after chaos.spec.duration seconds, which is 30 seconds.
Let’s see who is the new primary.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{print $1}'
pg-ha-cluster-1
Now wait some time and you should see the old primary is back and the database state is Ready again.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 09:39:50 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d15h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d15h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 1 (62s ago) 64s
pod/pg-ha-cluster-1 2/2 Running 0 4m30s
pod/pg-ha-cluster-2 2/2 Running 0 4m28s
Now let’s clean up the chaos experiment.
kubectl delete -f tests/01-pod-kill.yaml
podchaos.chaos-mesh.org "pg-primary-pod-kill" deleted
Chaos#2: OOMKill the Primary Pod
Now we are going to OOMKill the primary pod. This is a more realistic scenario than just killing the pod, because in real life, your primary pod might get OOMKilled due to high memory usage.
Save this yaml as tests/02-oomkill.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: pg-primary-oom
namespace: chaos-mesh
spec:
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
stressors:
memory:
workers: 1
size: "5000MB" # Exceed the 3Gi limit to trigger OOM
duration: "10m"
What this chaos does: Allocates excessive memory on the primary pod to exceed its limits, triggering an OOMKill that forces failover.
Before running this, we will run the load test job.
./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config configured
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
We can see the database is in ready state while the load test job is running.
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d16h
---------------------------------------------------------------
pod/pg-load-test-job-z8bxf 1/1 Running 0 22s
Let’s see the log from the load test job:
➤ kubectl logs -f -n demo job/pg-load-test-job
Test Duration: 43s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 70123 (Reads: 55952, Inserts: 7053, Updates: 7118)
Total Number of Rows Reads: 5595200, Inserts: 705300, Updates: 7118
Total Errors: 0
Total Data Transferred: 6548.86 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 1526.62 (Reads: 1219.18/s, Inserts: 158.02/s, Updates: 149.42/s)
Throughput: 143.24 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 5.042ms, P95: 27.845ms, P99: 63.214ms
Inserts - Avg: 50.112ms, P95: 128.465ms, P99: 274.199ms
Updates - Avg: 22.783ms, P95: 87.802ms, P99: 211.079ms
-----------------------------------------------------------------
Connection Pool:
Active: 20, Max: 100, Available: 80
=================================================================
Now run the chaos experiment.
> kubectl apply -f primary-oomkill.yaml
stresschaos.chaos-mesh.org/pg-primary-oom created
Now you should see the primary pod is OOMKilled and the failover happens. The database state will be Critical during the failover and will be Ready again after the old primary is back as standby.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 10:47:30 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 2d16h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d16h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 54m
pod/pg-ha-cluster-1 2/2 Running 1 (3s ago) 56m # NOTE: This shows the Restarts counter. It indicates that the pod is OOMKilled and restarted by Kubernetes
pod/pg-ha-cluster-2 2/2 Running 0 54m
pod/pg-load-test-job-z8bxf 1/1 Running 0 113s
You can check the status of chaos experiment by running kubectl get stresschaos -n chaos-mesh pg-primary-oom command.
...
status:
conditions:
- status: "True"
type: Selected
- status: "False"
type: AllInjected
- status: "True" # All chaos recovered
type: AllRecovered
- status: "False"
type: Paused
Now after some time, you should see the old primary is back and the database state is Ready again.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 10:48:18 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d16h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d16h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 55m
pod/pg-ha-cluster-1 2/2 Running 1 (51s ago) 57m
pod/pg-ha-cluster-2 2/2 Running 0 55m
pod/pg-load-test-job-z8bxf 1/1 Running 0 2m41s
Now check the data loss report from the load test job logs once the test is completed.
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 4095300
I0406 04:52:42.162937 1 load_generator_v2.go:555] Total records in table: 4095300
I0406 04:52:42.162960 1 load_generator_v2.go:556] totalRows in LoadGenerator: 4095300
Records Found in DB: 4095300
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
Cleaning up test data...
Cleaning up test table...
Cleanup completed
Test data table deleted successfully
Test completed successfully!
Clean up the chaos experiment.
kubectl delete -f tests/02-oomkill.yaml
stresschaos.chaos-mesh.org "pg-primary-oom" deleted
Chaos#3: Kill Postgres process in the Primary Pod
Now we are going to kill the postgres process in the primary pod. Save this yaml as tests/03-kill-postgres-process.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pg-kill-postgres-process
namespace: chaos-mesh
spec:
action: container-kill
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
containerNames:
- postgres
duration: "30s"
What this chaos does: Forcefully terminates the PostgreSQL process in the primary container, simulating a database crash without pod termination.
Create the load test job. I will alter the duration of the load test job to 1 minute as this chaos experiment is generally shorter.
Just change the TEST_RUN_DURATION: "60" in the ConfigMap YAML and apply all the YAMLs again.
./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config configured
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
pod/pg-load-test-job-79k9p 1/1 Running 0 10s # NOTE the load test job is running
Now run the chaos experiment.
kubectl apply -f pg-kill-postgres-process.yaml
podchaos.chaos-mesh.org/pg-kill-postgres-process created
As soon as you run the chaos experiment, you should see the primary pod is killed, the failover might/might not happen based on the possibility of data loss. If all the replica were synced up with primary before primary went down, a failover will happen immediately. Conversely, if there was some lag between primary and replica, there is a possibility of data loss and in that case, failover will not happen until the primary is back and the replica is synced up with primary.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 11:15:07 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 NotReady 2d17h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d17h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 81m
pod/pg-ha-cluster-1 2/2 Running 2 (9s ago) 84m
pod/pg-ha-cluster-2 2/2 Running 0 81m
pod/pg-load-test-job-79k9p 1/1 Running 0 39s
You can see the primary pod was killed and restarted by Kubernetes. The failover was not performed and the database state is NotReady. The reason database didn’t go ready is that chaos-mesh killed the postgres process immediately without giving the standby time to receive the last WAL the primary generated under high load. So there is a chance of data loss if we do a failover, so we are not doing a failover in this case to protect your data. However, there are APIs using which you can do a failover in this case also.
Now wait some time and you should see the old primary is back and the database state is Ready again.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 11:15:32 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d17h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d17h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 82m
pod/pg-ha-cluster-1 2/2 Running 2 (35s ago) 84m
pod/pg-ha-cluster-2 2/2 Running 0 82m
pod/pg-load-test-job-79k9p 1/1 Running 0 65s
Now check the data loss report from the load test job logs once the test is completed.
Cumulative Statistics:
Total Operations: 83211 (Reads: 66607, Inserts: 8355, Updates: 8249)
Total Number of Rows Reads: 6660700, Inserts: 835500, Updates: 8249)
Total Errors: 19548
Total Data Transferred: 7790.99 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 1298.86 (Reads: 974.14/s, Inserts: 259.77/s, Updates: 64.94/s)
Throughput: 129.14 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 5.366ms, P95: 30.093ms, P99: 72.567ms
Inserts - Avg: 53.477ms, P95: 135.148ms, P99: 238.446ms
Updates - Avg: 31.277ms, P95: 99.222ms, P99: 202.694ms
-----------------------------------------------------------------
Connection Pool:
Active: 14, Max: 100, Available: 86
=================================================================
=================================================================
Performance Summary:
Average Throughput: 1327.47 operations/sec
Read Operations: 66607 (1062.59/sec avg)
Insert Operations: 8355 (133.29/sec avg)
Update Operations: 8249 (131.60/sec avg)
Error Rate: 19.0232%
Total Data Transferred: 7.61 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 885500
Records Found in DB: 885500
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
Cleaning up test data...
Cleaning up test table...
I0406 05:15:34.394443 1 load_generator_v2.go:555] Total records in table: 885500
I0406 05:15:34.394469 1 load_generator_v2.go:556] totalRows in LoadGenerator: 885500
Cleanup completed
Test data table deleted successfully
Test completed successfully!
Clean up the chaos experiment.
kubectl delete -f tests/03-kill-postgres-process.yaml
podchaos.chaos-mesh.org "pg-kill-postgres-process" deleted
Chaos#4: Primary Pod Failure
In this experiment, we are going to simulate a complete failure of the primary pod, including the node it is running on. This is a more extreme scenario than just killing the pod or the postgres process.
Save this yaml as tests/04-pod-failure.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pg-primary-pod-failure
namespace: chaos-mesh
spec:
action: pod-failure
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
duration: "5m"
What this chaos does: Removes the entrypoint which is running the postgres process.
NOTE: Chaos-Mesh will simulate a pod failure for
.spec.durationamount of time; for our case, it is 5 minutes. As this simulates the complete failure of a pod for 5 minutes, our database will be in either a NotReady or Critical state for 5 minutes. Once this chaos isRecovered, the database will move back toReadystate automatically.
We will not run load tests for this experiment as well.
Before running this, let’s examine the database state.
➤ kubectl get pg -n demo
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d17h
---------------------------------------------------------------
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{print $1}'
pg-ha-cluster-1 # Primary pod
See the primary pod is in running state.
pod/pg-ha-cluster-0 2/2 Running 0 102m
pod/pg-ha-cluster-1 2/2 Running 2 (21m ago) 105m
pod/pg-ha-cluster-2 2/2 Running 0 102m
Now run the chaos experiment.
kubectl apply -f pg-primary-pod-failure.yaml
podchaos.chaos-mesh.org/pg-primary-pod-failure created
See the database went into NotReady state. Now based on the possibility of data loss, a failover will happen or be prohibited.
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 NotReady 2d17h
A failover happened immediately as there was no possibility of data loss. See the database is now in Critical state, which means the new primary is ready to accept connections, but one or more of the replicas are not ready, in this case, the old primary in not ready. The old primary will be ready after chaos.spec.duration seconds when chaos will be recovered, which is 5 minutes in our case.
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 2d17h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d17h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 103m
pod/pg-ha-cluster-1 2/2 Running 2 (22m ago) 106m
pod/pg-ha-cluster-2 2/2 Running 0 103m
Let’s see who is the new primary.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{print $1}'
pg-ha-cluster-0
Now let’s wait 5 minutes and follow the status of the chaos experiment by running kubectl get podchaos -n chaos-mesh pg-primary-pod-failure command.
status:
conditions:
- status: "False"
type: Paused
- status: "True"
type: Selected
- status: "False"
type: AllInjected
- status: "True"
type: AllRecovered
If you see AllRecovered condition is True, that means the chaos experiment is recovered, now you should see the old primary is back and the database state is Ready again.
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d17h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d17h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 106m
pod/pg-ha-cluster-1 2/2 Running 4 (10m ago) 110m
pod/pg-ha-cluster-2 2/2 Running 0 106m
Clean up the chaos experiment.
kubectl delete -f tests/04-pod-failure.yaml
podchaos.chaos-mesh.org "pg-primary-pod-failure" deleted
Chaos#5: Network Partition Primary Pod
NOTE: The only possible way to avoid data loss in the network partition case is to use synchronous replication. You can do this by changing
db.spec.streamingMode: Synchronous. In this case, there won’t be any data loss.
Caution: This experiment can cause data loss if you are using asynchronous replication. So use this experiment with caution and only on non-production environments.
In this experiment, we simulate a network partition affecting the primary pod in a PostgreSQL cluster.
Let’s say we have a cluster with 3 nodes: one primary and two standbys. Now we are going to create a network partition between the primary and the standby pods. After the split, the primary will be in the minority partition and the standbys will be in the majority partition.
Cluster (3 nodes)
-----------------
| Partition A | Partition B |
|---------------|-----------------|
| primary-0 | standby-1 |
| | standby-2 |
The primary will keep running as primary in the minority partition and one of the standbys will be promoted to primary in the majority partition. Because the majority quorum can’t reach the primary in the minority partition due to network partition, they think the primary is down and they will promote one of the standby to primary by leader election.
After Split
-----------
| Partition A | Partition B |
|--------------------|--------------------|
| primary-0 (active) | standby-1 → primary|
| | standby-2 |
Partition Check
---------------
| Partition A | Nodes: 1 | No quorum |
| Partition B | Nodes: 2 | Has quorum |
We will detect this situation and will shutdown the primary in the minority partition to avoid data loss as much as possible.
Safe Outcome
------------
| Partition A | Partition B |
|--------------------|--------------------|
| primary-0 (stopped)| standby-1 → primary|
| | standby-2 |
But again, there exists a data loss window which is generally small (30s - 1 minute). So how much data might be lost? Depends on your write load during that time, might be none in case there wasn’t any write load.
Now save this yaml as tests/05-network-partition.yaml. We will test this scenario against both asynchronous and synchronous replication mode and see the difference.
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: pg-primary-network-partition
namespace: chaos-mesh
spec:
action: partition
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
target:
mode: all
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "standby"
direction: both
duration: "4m"
What this chaos does: Blocks network connectivity between the primary pod and all standby pods, forcing a split-brain scenario where standbys promote a new primary in their partition while the isolated primary continues running.
Lets first test on the current postgres, which is running in asynchronous replication mode. Its basically the default mode if you have not mentioned anything in the .spec.streamingMode field of Postgres Object.
Now lets first apply the load test job, but I will modify some config before running it.
BATCH_SIZE: "100"
TEST_RUN_DURATION: "600" # updated this, 10 minutes
INSERT_PERCENT: "1" # let's put some realistic write load, 1% of the operations will be insert
UPDATE_PERCENT: "19" # 19% of the operations will be update, so total write load is 20% which is quite high for postgres. We want to see some data loss in this case
CONCURRENT_WRITERS: "10" # Reduce the concurrent writers
Now,
./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config configured
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Before running this experiment, lets examine db state.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 13:21:23 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d19h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d19h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 102s
pod/pg-ha-cluster-1 2/2 Running 0 99s
pod/pg-ha-cluster-2 2/2 Running 0 96s
pod/pg-load-test-job-ztb94 1/1 Running 0 12s
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{print $1}'
pg-ha-cluster-1 # Primary pod
Now let’s go ahead and run the chaos experiment.
➤ kubectl apply -f tests/05-network-partition.yaml
networkchaos.chaos-mesh.org/pg-primary-network-partition created
Your database will be in Ready state for some time until we detect there is a network partition, when we detect the
network partition,
we shutdown the primary in the minority quorum. So you will see the database is in Ready state for some
time and then it will go to NotReady or Critical state based on some other criteria.
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d19h
After some time, you should see the database is in NotReady state as we detected the network partition and shutdown the primary in the minority partition to avoid data loss as much as possible.
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 NotReady 2d19h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d19h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 3m55s
pod/pg-ha-cluster-1 2/2 Running 0 3m52s
pod/pg-ha-cluster-2 2/2 Running 0 3m49s
pod/pg-load-test-job-ztb94 1/1 Running 0 2m25s
Your database should be in Critical state after some time.
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 2d19h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d19h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 4m38s
pod/pg-ha-cluster-1 2/2 Running 0 4m35s
pod/pg-ha-cluster-2 2/2 Running 0 4m32s
pod/pg-load-test-job-ztb94 1/1 Running 0 3m8s
NOTE: There is one possible way where data loss might be avoided even with asynchronous replication, this reason is somewhat weird but possible. In a scenario where the standby was lagging behind the primary before the network partition happened, there won’t be a failover in this case as we know doing a failover will result in data loss in this case. In this case, your db will be in
NotReadystate.
So, if you see your db is in NotReady state for longer period, this might be the reason, and you have successfully avoided data loss even with asynchronous replication at the cost of some downtime. Again, if you prefer uptime, use
.spec.replication.forceFailoverAcceptingDataLossAfter: 30swhich will force fully do a failover without considering data loss case.
Let’s see the logs of the old primary, you should see the postgres process is shutdown immediately after the network partition is detected.
➤ kubectl logs -f -n demo pg-ha-cluster-1
...
2026-04-06 07:23:50.190 UTC [2598] FATAL: the database system is shutting down
2026-04-06 07:23:50.514 UTC [77] LOG: checkpoint complete: wrote 24464 buffers (37.3%); 0 WAL file(s) added, 0 removed, 23 recycled; write=0.441 s, sync=0.036 s, total=0.519 s; sync files=44, longest=0.025 s, average=0.001 s; distance=376832 kB, estimate=376832 kB; lsn=8/48000028, redo lsn=8/48000028
2026-04-06 07:23:50.576 UTC [48] LOG: database system is shut down
Let’s check who is the new primary.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{print $1}'
pg-ha-cluster-0
Check the logs of the new primary. It shows that it is now accepting connections, so new read/write operations will now go to the new primary.
➤ kubectl logs -f -n demo pg-ha-cluster-0
Defaulted container "postgres" out of: postgres, pg-coordinator, postgres-init-container (init)
...
2026-04-06 07:23:26.417 UTC [116] LOG: database system is ready to accept connections
2026-04-06 07:23:26.864 UTC [160] LOG: checkpoint complete: wrote 23062 buffers (35.2%); 0 WAL file(s) added, 0 removed, 17 recycled; write=0.407 s, sync=0.027 s, total=0.460 s; sync files=47, longest=0.011 s, average=0.001 s; distance=287175 kB, estimate=287175 kB; lsn=8/42873F88, redo lsn=8/42871FF0
2026-04-06 07:23:26.864 UTC [160] LOG: checkpoint starting: immediate force wait
2026-04-06 07:23:26.871 UTC [160] LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.002 s, total=0.007 s; sync files=1, longest=0.002 s, average=0.002 s; distance=8 kB, estimate=258459 kB; lsn=8/42874050, redo lsn=8/42874018
Now wait for the chaos experiment to be recovered, you can check the status of chaos experiment by running kubectl get networkchaos -n chaos-mesh pg-primary-network-partition command.
status:
conditions:
- status: "True"
type: Selected
- status: "False"
type: AllInjected
- status: "True"
type: AllRecovered
- status: "False"
type: Paused
Once AllRecovered is True you should see the old primary is back as standby and the database state is Ready again.
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d19h
Now let’s see how many rows we lost in this case by checking the load test job logs.
Cumulative Statistics:
Total Operations: 2371907 (Reads: 1897709, Inserts: 47445, Updates: 426753)
Total Number of Rows Reads: 189770900, Inserts: 237225, Updates: 426753)
Total Errors: 73
Total Data Transferred: 192743.74 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 58.05 (Reads: 26.12/s, Inserts: 2.90/s, Updates: 29.03/s)
Throughput: 2.81 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 5.23ms, P95: 36.986ms, P99: 48.888ms
Inserts - Avg: 7.183ms, P95: 37.949ms, P99: 49.527ms
Updates - Avg: 5.358ms, P95: 33.606ms, P99: 45.95ms
-----------------------------------------------------------------
Connection Pool:
Active: 9, Max: 100, Available: 91
=================================================================
=================================================================
Performance Summary:
Average Throughput: 3934.44 operations/sec
Read Operations: 1897709 (3147.85/sec avg)
Insert Operations: 47445 (78.70/sec avg)
Update Operations: 426753 (707.88/sec avg)
Error Rate: 0.0031%
Total Data Transferred: 188.23 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 270295
Records Found in DB: 253365
Records Lost: 16930
Data Loss Percentage: 6.26%
=================================================================
⚠️ WARNING: 16930 records were inserted but not found in database!
So we incurred data loss. Now the question is: how much? In our above case, there were:
Insert Operations: 47445 (78.70/sec avg) -> 78.70 insert operations per second, each insert uses batch size of BATCH_SIZE: 5,
which is 78.70 * 5 = 393.5 rows inserted per second.
We lost 16930 rows, so the data loss window is 16930 / 393.5 = 43 seconds.
So we can say that there was a network partition for around 4 minutes (chaos.spec.duration)
and split brain due of network partition was for 43 seconds.
This split brain detection time is around ~30 seconds(Your data loss window) despite how longer your network partition lasts.
Note: If your network partition window is less than 30 seconds, you won’t loose any data even in Asynchronous mode.
Now lets try to avoid data loss by using Synchronous replication. Change the db.spec.streamingMode: Synchronous to the setup/pg-ha-cluster.yaml.
apiVersion: kubedb.com/v1
kind: Postgres
metadata:
name: pg-ha-cluster
namespace: demo
spec:
clientAuthMode: md5
deletionPolicy: Delete
podTemplate:
spec:
containers:
- name: postgres
resources:
limits:
memory: 3Gi
requests:
cpu: 2
memory: 2Gi
replicas: 3
replication:
walKeepSize: 5000
walLimitPolicy: WALKeepSize
standbyMode: Hot
storage:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageType: Durable
streamingMode: Synchronous # Note this line
version: "16.4"
Before applying this, let’s clean up the previous yamls including postgres, load-test jobs and chaos experiment.
kubectl delete -f setup/pg-ha-cluster.yaml
postgres.kubedb.com "pg-ha-cluster" deleted
kubectl delete -f k8s/03-job.yaml
job.batch "pg-load-test-job" deleted
kubectl delete -f tests/05-network-partition.yaml
networkchaos.chaos-mesh.org "pg-primary-network-partition" deleted
Now apply the setup/pg-ha-cluster.yaml and wait for the db to be in Ready state.
kubectl apply -f setup/pg-ha-cluster.yaml
Once the db is in Ready state, apply the load test job and then wait 1 minute, then apply the chaos experiment.
./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config configured
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
First delete the previous experiment.
➤ kubectl delete -f tests/05-network-partition.yaml
networkchaos.chaos-mesh.org/pg-primary-network-partition deleted
Now apply the experiment again.
➤ kubectl apply -f tests/05-network-partition.yaml
networkchaos.chaos-mesh.org/pg-primary-network-partition created
You should experience the same scenario as before, but this time there won’t be any data loss as we are using synchronous replication.
Let’s wait and verify the logs from the load test job once the test is completed.
Final Results:
=================================================================
Test Duration: 10m3s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 1828426 (Reads: 1463758, Inserts: 36327, Updates: 328341)
Total Number of Rows Reads: 146375800, Inserts: 181635, Updates: 328341)
Total Errors: 42
Total Data Transferred: 151400.85 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 62.77 (Reads: 51.69/s, Inserts: 3.69/s, Updates: 7.38/s)
Throughput: 5.37 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 4.254ms, P95: 6.757ms, P99: 72.028ms
Inserts - Avg: 10.912ms, P95: 59.609ms, P99: 72.482ms
Updates - Avg: 14.94ms, P95: 18.156ms, P99: 70.674ms
-----------------------------------------------------------------
Connection Pool:
Active: 10, Max: 100, Available: 90
=================================================================
=================================================================
Performance Summary:
Average Throughput: 3032.08 operations/sec
Read Operations: 1463758 (2427.35/sec avg)
Insert Operations: 36327 (60.24/sec avg)
Update Operations: 328341 (544.49/sec avg)
Error Rate: 0.0023%
Total Data Transferred: 147.85 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 231635
Records Found in DB: 231635
Records Lost: 0
I0406 07:45:57.178325 1 load_generator_v2.go:555] Total records in table: 231635
I0406 07:45:57.178359 1 load_generator_v2.go:556] totalRows in LoadGenerator: 231635
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
See this time there is no data loss.
Clean up the chaos experiment.
kubectl delete -f tests/05-network-partition.yaml
networkchaos.chaos-mesh.org "pg-primary-network-partition" deleted
Delete and recreate the postgres with asynchronous replication if you want to do more experiments. Also revert back this changes
BATCH_SIZE: "100"
TEST_RUN_DURATION: "300" # updated this, 5 minutes
INSERT_PERCENT: "10" # lets put some relistic write load, 1% of the operations will be insert
UPDATE_PERCENT: "10" # 19% of the operations will be update, so total write load is 20% which is pretty high for postgres, we want to see some data loss in this case
CONCURRENT_WRITERS: "20" # Reduce the concurrent writters
Chaos#6: Limit bandwidth of Primary Pod
As you changed
.db.spec.streamingMode: Synchronousin the previous experiment, change it back toAsynchronousfor this experiment. You can also keep it as it if you want though.
Skip these deletion process if you want to continue with .db.spec.streamingMode: Synchronous.
Edit the setup/pg-ha-cluster.yaml and update .db.spec.streamingMode: Asynchronous
Now first delete the previous one,
kubectl delete -f setup/pg-ha-cluster.yaml
Now wait untill all the pods of are gone.
kubectl get pods -n demo | grep pg-ha-cluster
# This should return nothing
Now apply the setup/pg-ha-cluster.yaml,
kubectl apply -f setup/pg-ha-cluster.yaml
Now wait until database is in ready state.
➤ kubectl get pg,pods -n demo
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2m28s
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 2m22s
pod/pg-ha-cluster-1 2/2 Running 0 2m15s
pod/pg-ha-cluster-2 2/2 Running 0 2m8s
For this chaos experiment, we are going to limit the bandwidth of the primary pod. This will cause the replication lag between primary and standby to increase, which can lead to data loss if a failover happens during this time. So this is a good experiment to test the behavior of your cluster under network congestion.
Save this yaml as tests/06-bandwidth-limit.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: pg-primary-bandwidth-limit
namespace: chaos-mesh
spec:
action: bandwidth
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
target:
mode: all
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
bandwidth:
rate: "1mbps"
limit: 20000
buffer: 10000
direction: both
duration: "2m"
What this chaos does: Restricts the egress/ingress bandwidth of the primary pod to 1 Mbps, simulating a slow network connection and increasing replication lag.
Additionally, we will run the load test with some changes introduced.
INSERT_PERCENT: "19"
UPDATE_PERCENT: "1"
BATCH_SIZE: "200"
TEST_RUN_DURATION: "150"
Run the load generating job.
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config configured
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Now let’s watch the pods and postgres.
watch kubectl get pg,petset,pods -n demo
> watch -n demo kubectl get pg,petset,pods
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 17:13:20 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 2d23h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 2d23h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 3h40m
pod/pg-ha-cluster-1 2/2 Running 0 3h38m
pod/pg-ha-cluster-2 2/2 Running 0 3h38m
pod/pg-load-test-job-hf85p 1/1 Running 0 105s
Your database should be in ready state all the time. Once the chaos experiment is completed, check the logs of load test job to see if there was any data loss.
Final Results:
=================================================================
Test Duration: 3m0s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 24564 (Reads: 19517, Inserts: 4803, Updates: 244)
Total Number of Rows Reads: 1951700, Inserts: 960600, Updates: 244)
Total Errors: 20
Total Data Transferred: 3067.35 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 0.00 (Reads: 0.00/s, Inserts: 0.00/s, Updates: 0.00/s)
Throughput: 0.00 MB/s
Errors/sec: 2.75
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 13.334ms, P95: 49.334ms, P99: 336.744ms
Inserts - Avg: 168.387ms, P95: 324.687ms, P99: 590.029ms
Updates - Avg: 137.242ms, P95: 189.343ms, P99: 350.323ms
-----------------------------------------------------------------
Connection Pool:
Active: 29, Max: 100, Available: 71
=================================================================
=================================================================
Performance Summary:
Average Throughput: 136.47 operations/sec
Read Operations: 19517 (108.43/sec avg)
Insert Operations: 4803 (26.68/sec avg)
Update Operations: 244 (1.36/sec avg)
Error Rate: 0.0814%
Total Data Transferred: 3.00 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
I0406 11:14:41.761915 1 load_generator_v2.go:555] Total records in table: 1014600
I0406 11:14:41.761938 1 load_generator_v2.go:556] totalRows in LoadGenerator: 1010600
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 1014600
Records Found in DB: 1018600
Records Lost: -4000
Data Loss Percentage: -0.39%
=================================================================
No data loss detected - all inserted records are present in database
Cleanup the chaos experiment.
Clean up the chaos experiment.
kubectl delete -f tests/06-bandwidth-limit.yaml
networkchaos.chaos-mesh.org "pg-primary-bandwidth-limit" deleted
Chaos#7: Network Delay Primary Pod
In this chaos experiment, we are going to introduce network delay to the primary pod. This will cause the replication lag between primary and standby to increase, which can lead to data loss if a failover happens during this time. So this is a good experiment to test the behavior of your cluster under network congestion.
Save this yaml as tests/07-network-delay.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: pg-primary-network-delay
namespace: chaos-mesh
spec:
action: delay
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
target:
mode: all
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
delay:
latency: "500ms"
jitter: "100ms"
correlation: "50"
duration: "3m"
direction: both
What this chaos does: Adds 500ms latency with 100ms jitter to all network packets of the primary pod, simulating high-latency network conditions.
Let’s adjust the load test config before running the load test job.
TEST_RUN_DURATION: "200"
READ_PERCENT: "80"
INSERT_PERCENT: "10"
UPDATE_PERCENT: "10"
BATCH_SIZE: "100"
Lets create the load test job.
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config configured
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Now watch the pods and postgres status.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 18:39:12 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 3d
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 3d
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 83m
pod/pg-ha-cluster-1 2/2 Running 0 83m
pod/pg-ha-cluster-2 2/2 Running 0 83m
pod/pg-load-test-job-89flv 1/1 Running 0 72s
The database should be in Ready state all the time.
kubectl get networkchaos -n chaos-mesh -oyaml
...
status:
conditions:
- status: "True"
type: AllRecovered
- status: "False"
type: Paused
- status: "True"
type: Selected
- status: "False"
type: AllInjected
AllRecovered condition is True, that means the chaos experiment is done. Now let’s check how many rows were inserted.
Final Results:
=================================================================
Test Duration: 3m23s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 253446 (Reads: 202535, Inserts: 25370, Updates: 25541)
Total Number of Rows Reads: 20253500, Inserts: 2537000, Updates: 25541)
Total Errors: 0
Total Data Transferred: 23686.56 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 336.68 (Reads: 202.01/s, Inserts: 84.17/s, Updates: 50.50/s)
Throughput: 30.12 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 8.76ms, P95: 64.215ms, P99: 98.166ms
Inserts - Avg: 54.375ms, P95: 124.26ms, P99: 189.16ms
Updates - Avg: 32.242ms, P95: 99.145ms, P99: 150.899ms
-----------------------------------------------------------------
Connection Pool:
Active: 28, Max: 100, Available: 72
=================================================================
=================================================================
Performance Summary:
Average Throughput: 1250.27 operations/sec
Read Operations: 202535 (999.12/sec avg)
Insert Operations: 25370 (125.15/sec avg)
Update Operations: 25541 (126.00/sec avg)
Error Rate: 0.0000%
Total Data Transferred: 23.13 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 2587000
Records Found in DB: 2587000
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
Cleaning up test data...
Cleaning up test table...
I0406 12:41:39.032079 1 load_generator_v2.go:555] Total records in table: 2587000
I0406 12:41:39.032102 1 load_generator_v2.go:556] totalRows in LoadGenerator: 2587000
=================================================================
As you can see, 25M rows were inserted, 23GB data was transferred to the database and there was no data loss. So even with 500ms network delay, our cluster was able to handle the load and there was no data loss.
Clean up the chaos experiment.
kubectl delete -f tests/07-network-delay.yaml
networkchaos.chaos-mesh.org "pg-primary-network-delay" deleted
Chaos#8: Network Loss Primary Pod
In this chaos experiment, we are going to introduce network loss to the primary pod. We expect our database to be able to hold Ready state, even though we see some failover, the end state of database should be Ready.
Save this yaml as tests/08-network-loss.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: pg-primary-packet-loss
namespace: chaos-mesh
spec:
action: loss
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
target:
mode: all
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
loss:
loss: "100"
correlation: "100"
duration: "3m"
direction: both
What this chaos does: Drops 100% of network packets to/from the primary pod, simulating a complete network blackhole while allowing recovery when the chaos ends.
Lets run the load test job with some changes in config.
TEST_RUN_DURATION: "200"
Lets create the load test job.
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config unchanged
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Now create the chaos experiment.
➤ kubectl apply -f tests/08-network-loss.yaml
networkchaos.chaos-mesh.org/pg-primary-packet-loss created
Now watch the pods and postgres status.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 19:00:54 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 3d
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 3d
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 104m
pod/pg-ha-cluster-1 2/2 Running 0 104m
pod/pg-ha-cluster-2 2/2 Running 0 104m
pod/pg-load-test-job-44hg8 1/1 Running 0 96s
Postgres should be in Ready state all the time, even though it switches to Critical, it should be back to Ready state after the experiment is done.
kubectl get networkchaos -n chaos-mesh -oyaml
...
status:
conditions:
- status: "True"
type: AllRecovered
- status: "False"
type: Paused
- status: "True"
type: Selected
- status: "False"
type: AllInjected
AllRecovered condition is True, that means the chaos experiment is done. Now let’s check how many rows were inserted and if there was any data loss.
Final Results:
=================================================================
Test Duration: 3m23s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 229680 (Reads: 183614, Inserts: 23016, Updates: 23050)
Total Number of Rows Reads: 18361400, Inserts: 2301600, Updates: 23050)
Total Errors: 0
Total Data Transferred: 21474.94 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 65.82 (Reads: 29.62/s, Inserts: 23.04/s, Updates: 13.16/s)
Throughput: 5.62 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 13.045ms, P95: 50.368ms, P99: 218.188ms
Inserts - Avg: 45.326ms, P95: 119.711ms, P99: 189.261ms
Updates - Avg: 23.338ms, P95: 81.739ms, P99: 142.693ms
-----------------------------------------------------------------
Connection Pool:
Active: 29, Max: 100, Available: 71
=================================================================
=================================================================
Performance Summary:
Average Throughput: 1131.63 operations/sec
Read Operations: 183614 (904.67/sec avg)
Insert Operations: 23016 (113.40/sec avg)
Update Operations: 23050 (113.57/sec avg)
Error Rate: 0.0000%
Total Data Transferred: 20.97 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
I0406 13:02:54.311885 1 load_generator_v2.go:555] Total records in table: 2351600
I0406 13:02:54.311910 1 load_generator_v2.go:556] totalRows in LoadGenerator: 2351600
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 2351600
Records Found in DB: 2351600
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
You can see the stats and this clearly shows lots of rows were inserted and reads were performed, but there was no data loss. And no downtime.
Clean up the chaos experiment.
kubectl delete -f tests/08-network-loss.yaml
networkchaos.chaos-mesh.org "pg-primary-packet-loss" deleted
Chaos#9: Network Duplicate to Primary Pod
In this experiment, we will introduce packet duplication to the primary pod. We expect the database to be able to handle packet duplication and be ready all the time.
Save this yaml as tests/09-network-duplicate.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: pg-primary-packet-duplicate
namespace: chaos-mesh
spec:
action: duplicate
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
target:
mode: all
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
duplicate:
duplicate: "50"
correlation: "25"
duration: "4m"
direction: both
What this chaos does: Duplicates 50% of network packets to/from the primary pod, creating redundant traffic that can overwhelm or confuse the receiving end.
Lets run the load test job with some changes in config.
TEST_RUN_DURATION: "240"
saurov@saurov-pc:~/g/s/g/s/high-write-load-client|main⚡*?
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config unchanged
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Now lets create the chaos experiment.
➤ kubectl apply -f tests/09-network-duplicate.yaml
networkchaos.chaos-mesh.org/pg-primary-packet-duplicate created
Now watch the pods and postgres status.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 19:19:44 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 3d1h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 3d1h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 123m
pod/pg-ha-cluster-1 2/2 Running 0 123m
pod/pg-ha-cluster-2 2/2 Running 0 123m
You should see your database is in Ready state all the time despite the packet duplication.
kubectl get networkchaos -n chaos-mesh -oyaml
...
status:
conditions:
- status: "True"
type: AllInjected
- status: "False"
type: AllRecovered
- status: "False"
type: Paused
- status: "True"
type: Selected
Once the experiment is done, check the logs of load test job to see if there was any data loss.
Final Results:
=================================================================
Test Duration: 3m23s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 224846 (Reads: 179994, Inserts: 22547, Updates: 22305)
Total Number of Rows Reads: 17999400, Inserts: 2254700, Updates: 22305)
Total Errors: 0
Total Data Transferred: 21050.70 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 151.38 (Reads: 105.97/s, Inserts: 22.71/s, Updates: 22.71/s)
Throughput: 13.61 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 13.234ms, P95: 71.521ms, P99: 237.409ms
Inserts - Avg: 46.457ms, P95: 115.989ms, P99: 193.325ms
Updates - Avg: 24.757ms, P95: 91.271ms, P99: 157.036ms
-----------------------------------------------------------------
Connection Pool:
Active: 14, Max: 100, Available: 86
=================================================================
=================================================================
Performance Summary:
Average Throughput: 1109.38 operations/sec
Read Operations: 179994 (888.08/sec avg)
Insert Operations: 22547 (111.25/sec avg)
Update Operations: 22305 (110.05/sec avg)
Error Rate: 0.0000%
Total Data Transferred: 20.56 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 2304700
Records Found in DB: 2304700
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
Cleaning up test data...
Cleaning up test table...
I0406 13:15:39.100598 1 load_generator_v2.go:555] Total records in table: 2304700
I0406 13:15:39.100624 1 load_generator_v2.go:556] totalRows in LoadGenerator: 2304700
=================================================================
As usual, despite load on the database and packet duplication, there was no data loss and database was in Ready state all the time.
Clean up the chaos experiment.
kubectl delete -f tests/09-network-duplicate.yaml
networkchaos.chaos-mesh.org "pg-primary-packet-duplicate" deleted
Chaos#10: Network Corruption to Primary Pod
In this experiment, we will introduce packet corruption to the primary pod. We expect the database to be able to handle packet corruption and not lose any data.
Save this yaml as tests/10-network-corrupt.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: pg-primary-packet-corrupt
namespace: chaos-mesh
spec:
action: corrupt
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
target:
mode: all
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
corrupt:
corrupt: "50"
correlation: "25"
duration: "4m"
direction: both
What this chaos does: Corrupts 50% of network packets to/from the primary pod by flipping random bits in the payload, causing checksums to fail.
Lets change some config and apply the load test creation script.
TEST_RUN_DURATION: "240"
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config unchanged
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Now check if the database is in ready state.
kubectl get pg,petset,pods -n demo
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 3d1h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 3d1h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 126m
pod/pg-ha-cluster-1 2/2 Running 0 126m
pod/pg-ha-cluster-2 2/2 Running 0 126m
pod/pg-load-test-job-lftl8 1/1 Running 0 6s
Now create the chaos experiment.
networkchaos.chaos-mesh.org/pg-primary-packet-corrupt created
Now watch the pods and postgres status.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 19:27:48 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 3d1h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 3d1h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 131m
pod/pg-ha-cluster-1 2/2 Running 0 131m
pod/pg-ha-cluster-2 2/2 Running 0 131m
The database is ready so far, and pg-ha-cluster-0 is the primary.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 19:35:09 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 NotReady 3d1h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 3d1h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 139m
pod/pg-ha-cluster-1 2/2 Running 0 139m
pod/pg-ha-cluster-2 2/2 Running 0 139m
pod/pg-load-test-job-5q4gh 1/1 Running 0 52s
Database turns into NotReady state as a failover happens due of the corruption.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 19:35:52 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 3d1h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 3d1h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 139m
pod/pg-ha-cluster-1 2/2 Running 0 139m
pod/pg-ha-cluster-2 2/2 Running 0 139m
pod/pg-load-test-job-5q4gh 1/1 Running 0 95s
A new primary is elected and database moved into Critical state, which means new primary is ready to accept connections.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-1
So pg-ha-cluster-1 is the new primary. Wait for the chaos to be recovered.
kubectl get networkchaos -n chaos-mesh -oyaml
...
status:
conditions:
- status: "False"
type: AllInjected
- status: "True"
type: AllRecovered
- status: "False"
type: Paused
- status: "True"
type: Selected
Alrecovered true means chaos experiment is over.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 19:36:25 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 3d1h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 3d1h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 140m
pod/pg-ha-cluster-1 2/2 Running 0 140m
pod/pg-ha-cluster-2 2/2 Running 0 140m
pod/pg-load-test-job-5q4gh 1/1 Running 0 2m8s
The database has returned to Ready state.
Now check the stats of data insertion and read.
Final Results:
=================================================================
Test Duration: 3m23s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 241642 (Reads: 193608, Inserts: 23878, Updates: 24156)
Total Number of Rows Reads: 19360800, Inserts: 2387800, Updates: 24156)
Total Errors: 0
Total Data Transferred: 22600.28 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 395.64 (Reads: 237.39/s, Inserts: 138.48/s, Updates: 19.78/s)
Throughput: 39.88 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 10.469ms, P95: 64.429ms, P99: 103.762ms
Inserts - Avg: 51.473ms, P95: 134.532ms, P99: 201.798ms
Updates - Avg: 29.607ms, P95: 98.249ms, P99: 169.741ms
-----------------------------------------------------------------
Connection Pool:
Active: 27, Max: 100, Available: 73
=================================================================
=================================================================
Performance Summary:
Average Throughput: 1192.38 operations/sec
Read Operations: 193608 (955.36/sec avg)
Insert Operations: 23878 (117.83/sec avg)
Update Operations: 24156 (119.20/sec avg)
Error Rate: 0.0000%
Total Data Transferred: 22.07 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
I0406 13:26:04.117504 1 load_generator_v2.go:555] Total records in table: 2437800
I0406 13:26:04.117540 1 load_generator_v2.go:556] totalRows in LoadGenerator: 2437800
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 2437800
Records Found in DB: 2437800
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
So everything looks alright. No data loss.
Cleanup the chaos experiment:
kubectl delete -f tests/10-network-corrupt.yaml
networkchaos.chaos-mesh.org "pg-primary-packet-corrupt" deleted
Chaos#11: Time Offset and DNS error
We will run two chaos experiments one after another in this case. No load test will be run in these two cases.
Save this yaml as tests/11-time-offset.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
name: pg-primary-time-offset
namespace: chaos-mesh
spec:
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
timeOffset: "-2h"
clockIds:
- CLOCK_REALTIME
duration: "2m"
What this chaos does: Shifts the system clock of the primary pod back by 2 hours, simulating time skew that can cause certificate validation, timestamp-based logic, and replication synchronization issues.
Save this yaml as tests/12-dns-error.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
name: pg-primary-dns-error
namespace: chaos-mesh
spec:
action: error
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
duration: "2m"
What this chaos does: Makes all DNS queries from the primary pod fail with resolution errors, simulating DNS service outage or misconfiguration.
➤ kubectl apply -f tests/11-time-offset.yaml
timechaos.chaos-mesh.org/pg-primary-time-offset created
saurov@saurov-pc:~/g/s/g/s/chaos-mesh|main⚡*
➤ kubectl apply -f tests/12-dns-error.yaml
dnschaos.chaos-mesh.org/pg-primary-dns-error created
Your database will be in ready state through the whole chaos.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 19:50:14 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 3d1h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 3d1h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 154m
pod/pg-ha-cluster-1 2/2 Running 0 154m
pod/pg-ha-cluster-2 2/2 Running 0 154m
Clean up the chaos experiments.
kubectl delete -f tests/11-time-offset.yaml
timechaos.chaos-mesh.org "pg-primary-time-offset" deleted
kubectl delete -f tests/12-dns-error.yaml
dnschaos.chaos-mesh.org "pg-primary-dns-error" deleted
IO chaos
Postgres Recreation with force failover
For IO related chaos, if you prioritize high availability over data loss, then set
.spec.replication.forceFailoverAcceptingDataLossAfter: 30s. This will result in better availability. But if you prefer data safety over high availability, then do not set .spec.replication.forceFailoverAcceptingDataLossAfter: 30s this.
I will set.spec.replication.forceFailoverAcceptingDataLossAfter: 30s for IO related chaos tests.
If you do not prefer data loss, ignore this Recreation step.
You will see that even though we will force failover accepting the possibility that there might be data loss,
but in really this data loss chances are very not very high. We should be able to achieve high availability without losing any data in most cases. Our end goal is to have
the database in Ready state when chaos is recovered.
NOTE: In case you do not prefer to set this
.spec.replication.forceFailoverAcceptingDataLossAfter: 30s. Its just you might face some extra downtime(Database might stay inNotReadystate for longer period until chaos is recovered) in some IOChaos cases.
First delete the setup/pg-ha-cluster.yaml
kubectl delete -f setup/pg-ha-cluster.yaml
Wait untill all the pods are deleted.
kubectl get pods -n demo | grep pg-ha-cluster
# this should not return anything
Now update your setup/pg-ha-cluster.yaml with below yaml.
apiVersion: kubedb.com/v1
kind: Postgres
metadata:
name: pg-ha-cluster
namespace: demo
spec:
clientAuthMode: md5
deletionPolicy: Delete
podTemplate:
spec:
containers:
- name: postgres
resources:
limits:
memory: 3Gi
requests:
cpu: 2
memory: 2Gi
replicas: 3
replication:
walKeepSize: 5000
walLimitPolicy: WALKeepSize
forceFailoverAcceptingDataLossAfter: 30s # New added
standbyMode: Hot
storage:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageType: Durable
version: "16.4"
Run kubectl apply -f setup/pg-ha-cluster.yaml and wait for database to be in ready state.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 20:17:16 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 54s
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 48s
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 48s
pod/pg-ha-cluster-1 2/2 Running 0 43s
pod/pg-ha-cluster-2 2/2 Running 0 38s
Let’s check which pod is the primary.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-2
Note: If you performed this step, you might need to change the k8s/02-secret.yaml -> DB_PASSWORD: ’new value'
Chaos#12: IO latency
In this experiment, we will simulate IO latency. Our end goal is to have as low downtime as possible and the database should be in Ready state when chaos is recovered.
Save this yaml as tests/13-io-latency.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: pg-primary-io-latency
namespace: chaos-mesh
spec:
action: latency
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
volumePath: /var/pv
path: /var/pv/data/**/*
delay: "500ms"
percent: 100
duration: "5m"
containerNames:
- postgres
What this chaos does: Injects 500ms latency into all disk I/O operations on the primary pod, simulating slow storage that increases replication lag and can trigger failover.
Lets change the load test config.
TEST_RUN_DURATION: "300"
In case your database password is changed(you recreated the postgres and used WipeOut deletion policy), you can run the below command to check your database password.
➤ kubectl get secret -n demo pg-ha-cluster-auth -oyaml
apiVersion: v1
data:
password: bVApIWcyYW5PcV9ONXR+bQ==
username: cG9zdGdyZXM=
kind: Secret
...
Check if your database password given in the secret of load test yaml is changed or not. If changed, then update the password and apply the secret again.
DB_PASSWORD: bVApIWcyYW5PcV9ONXR+bQ==
➤ kubectl apply -f k8s/02-secret.yaml
secret/pg-load-test-secret configured
Now apply the load test yamls.
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config configured
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Now wait 10-20 second and apply the chaos experiment.
➤ kubectl apply -f tests/13-io-latency.yaml
iochaos.chaos-mesh.org/pg-primary-io-latency created
Soon after we created the chaos test, the database should be in NotReady state. The reason for this is, the client call to Primary pod is getting timed
out due of slow IO.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 20:37:24 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 NotReady 21m
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 20m
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 5m55s
pod/pg-ha-cluster-1 2/2 Running 0 5m52s
pod/pg-ha-cluster-2 2/2 Running 0 5m49s
pod/pg-load-test-job-62l88 1/1 Running 0 2m10s
Now we might observe some interesting behavior as the IO is not performing correctly. We might see frequent failovers and a possible split brain situation. However, this won’t last long.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
pg-ha-cluster-2
saurov@saurov-pc:~
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
pg-ha-cluster-2
saurov@saurov-pc:~
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
pg-ha-cluster-2
saurov@saurov-pc:~
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
pg-ha-cluster-2
saurov@saurov-pc:~
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-2
saurov@saurov-pc:~
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
saurov@saurov-pc:~
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
After some amount of time, we should see a stable primary, in our case which is pg-ha-cluster-0.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 20:40:04 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 23m
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 23m
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 8m34s
pod/pg-ha-cluster-1 2/2 Running 0 8m31s
pod/pg-ha-cluster-2 2/2 Running 0 8m28s
pod/pg-load-test-job-62l88 1/1 Running 0 4m49s
Now, the database is in critical state. We will wait untill the chaos is recovered.
➤ kubectl get iochaos -n chaos-mesh -oyaml
...
status:
conditions:
- status: "True"
type: Selected
- status: "False"
type: AllInjected
- status: "True"
type: AllRecovered
- status: "False"
type: Paused
The chaos is recovered. Now the database should be in Ready state. But if anything goes terribly wrong because of slow IO, you might find a database in either NotReady and Critical state. In this case, contact with us.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Mon Apr 6 20:48:52 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 32m
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 32m
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 17m
pod/pg-ha-cluster-1 2/2 Running 0 17m
pod/pg-ha-cluster-2 2/2 Running 0 17m
pod/pg-load-test-job-62l88 0/1 Completed 0 13m
So the database is transitioned into Ready state as soon as the chaos was recovered.
Final Results:
=================================================================
Test Duration: 5m3s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 214087 (Reads: 171304, Inserts: 21358, Updates: 21425)
Total Number of Rows Reads: 17130400, Inserts: 2135800, Updates: 21425)
Total Errors: 15175
Total Data Transferred: 20024.37 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 45.03 (Reads: 40.53/s, Inserts: 2.25/s, Updates: 2.25/s)
Throughput: 4.46 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 14.961ms, P95: 55.724ms, P99: 153.02ms
Inserts - Avg: 59.45ms, P95: 126.812ms, P99: 218.533ms
Updates - Avg: 27.391ms, P95: 90.496ms, P99: 180.472ms
-----------------------------------------------------------------
Connection Pool:
Active: 11, Max: 100, Available: 89
=================================================================
=================================================================
Performance Summary:
Average Throughput: 706.23 operations/sec
Read Operations: 171304 (565.10/sec avg)
Insert Operations: 21358 (70.46/sec avg)
Update Operations: 21425 (70.68/sec avg)
Error Rate: 6.6191%
Total Data Transferred: 19.56 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
=================================================================
Data Loss Report:
I0406 14:40:33.621876 1 load_generator_v2.go:555] Total records in table: 2185700
I0406 14:40:33.621913 1 load_generator_v2.go:556] totalRows in LoadGenerator: 2185800
-----------------------------------------------------------------
Total Records Inserted: 2185700
Records Found in DB: 2185600
Records Lost: 100
Data Loss Percentage: 0.00%
=================================================================
⚠️ WARNING: 100 records were inserted but not found in database!
This may indicate:
- Database crash/restart occurred during test
- pg_rewind was triggered due to network partition
- Transaction rollback due to replication issues
Total number of rows inserted 2135800, lost rows 100, so basically 1 batch insert query was lost. If you have not set force failover, this data loss won’t be there.
NOTE: The same chaos experiment is run again in the
IO Chaos Tests Without Force Failoversection below without theforceFailoverAcceptingDataLossAfter: 30sAPI. In that case, no data loss was incurred.
Clean up the chaos experiment.
kubectl delete -f tests/13-io-latency.yaml
iochaos.chaos-mesh.org "pg-primary-io-latency" deleted
Chaos#13: IO Fault to primary
In this experiment, chaos-mesh will insert io/fault. Our database should handle this chaos and remain in Ready or Critical state.
Once the chaos is recovered by chaos-mesh, the database should be back in Ready state.
Save this yaml as tests/14-io-fault.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: pg-primary-io-fault
namespace: chaos-mesh
spec:
action: fault
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
volumePath: /var/pv
path: /var/pv/data/**/*
errno: 5 # EIO (Input/output error)
percent: 50
duration: "5m"
containerNames:
- postgres
What this chaos does: Injects I/O errors (EIO) on 50% of disk operations to the primary pod, simulating disk hardware failures or filesystem corruption.
Let’s see how our database is now,
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 08:00:56 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 11h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 11h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 11h
pod/pg-ha-cluster-1 2/2 Running 0 11h
pod/pg-ha-cluster-2 2/2 Running 0 11h
pod/pg-load-test-job-62l88 0/1 Completed 0 11h
Let’s see who is primary:
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
➤ kubectl exec -it -n demo pg-ha-cluster-0 -- bash
Defaulted container "postgres" out of: postgres, pg-coordinator, postgres-init-container (init)
pg-ha-cluster-0:/$ psql
psql (16.4)
Type "help" for help.
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
(1 row)
Lets now create the load generate job,
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config unchanged
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Wait 15-20 second and then apply the io-fault yaml.
➤ kubectl apply -f tests/14-io-fault.yaml
iochaos.chaos-mesh.org/pg-primary-io-fault created
keep watching the database and pods,
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 08:05:39 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 11h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 11h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 11h
pod/pg-ha-cluster-1 2/2 Running 0 11h
pod/pg-ha-cluster-2 2/2 Running 0 11h
pod/pg-load-test-job-pq4l6 1/1 Running 0 117s
After running for some time, the database went into critical state. Let’s see if there is a failover.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-1
➤ kubectl exec -it -n demo pg-ha-cluster-1 -- bash
Defaulted container "postgres" out of: postgres, pg-coordinator, postgres-init-container (init)
pg-ha-cluster-1:/$ psql
psql (16.4)
Type "help" for help.
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
(1 row)
There is a failover, and we can run queries on the new primary. Things are looking good so far.
I will show you what happened to old primary due to i/o error.
➤ kubectl logs -n demo pg-ha-cluster-0
Defaulted container "postgres" out of: postgres, pg-coordinator, postgres-init-container (init)
...
2026-04-07 02:04:14.564 UTC [2813] LOG: all server processes terminated; reinitializing
2026-04-07 02:04:14.564 UTC [2813] LOG: could not open directory "base/pgsql_tmp": I/O error
2026-04-07 02:04:14.564 UTC [2813] LOG: could not open directory "base": I/O error
2026-04-07 02:04:14.564 UTC [2813] LOG: could not open directory "pg_tblspc": I/O error
2026-04-07 02:04:14.643 UTC [2813] PANIC: could not open file "global/pg_control": I/O error
/scripts/run.sh: line 61: 2813 Aborted (core dumped) /run_scripts/role/run.sh
removing the initial scripts as server is not running ...
So as it wasn’t able to operate cleanly and communicate with standby’s, a new leader election happened and pg-ha-cluster-1 was promoted as primary.
As we saw earlier, we can run queries on pg-ha-cluster-1, so our cluster is usable even in the time of chaos.
Now wait until chaos is recovered.
➤ kubectl get iochaos -n chaos-mesh pg-primary-io-fault -oyaml
...
status:
conditions:
- status: "False"
type: AllInjected
- status: "True"
type: AllRecovered
- status: "False"
type: Paused
- status: "True"
type: Selected
Chaos is recovered by chaos-mesh.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 08:11:28 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 11h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 11h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 11h
pod/pg-ha-cluster-1 2/2 Running 0 11h
pod/pg-ha-cluster-2 2/2 Running 0 11h
pod/pg-load-test-job-pq4l6 0/1 Completed 0 7m47s
Our database is transitioned back into Ready state.
Final Results:
Total Data Transferred: 24419.35 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 1115.64 (Reads: 893.05/s, Inserts: 111.29/s, Updates: 111.29/s)
Throughput: 104.36 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 16.1ms, P95: 60.276ms, P99: 295.835ms
Inserts - Avg: 42.911ms, P95: 117.868ms, P99: 204.391ms
Updates - Avg: 18.051ms, P95: 65.577ms, P99: 131.106ms
-----------------------------------------------------------------
Connection Pool:
Active: 29, Max: 100, Available: 71
=================================================================
=================================================================
Test Duration: 5m3s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 260920 (Reads: 208927, Inserts: 26033, Updates: 25960)
Total Number of Rows Reads: 20892700, Inserts: 2603300, Updates: 25960)
Total Errors: 242129
Total Data Transferred: 24420.59 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 35.96 (Reads: 32.97/s, Inserts: 3.00/s, Updates: 0.00/s)
Throughput: 3.71 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 16.102ms, P95: 60.291ms, P99: 295.835ms
Inserts - Avg: 42.912ms, P95: 117.868ms, P99: 204.391ms
Updates - Avg: 18.051ms, P95: 65.577ms, P99: 131.106ms
-----------------------------------------------------------------
Connection Pool:
Active: 11, Max: 100, Available: 89
=================================================================
=================================================================
Performance Summary:
Average Throughput: 861.35 operations/sec
Read Operations: 208927 (689.71/sec avg)
Insert Operations: 26033 (85.94/sec avg)
Update Operations: 25960 (85.70/sec avg)
Error Rate: 48.1323%
Total Data Transferred: 23.85 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 2653300
Records Found in DB: 2653300
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
You can see the statistics here, 25 GB was inserted in 5 minutes with zero data loss even though we accepted data loss via forceFailoverAcceptingDataLossAfter.
Clean up the chaos experiment.
kubectl delete -f tests/14-io-fault.yaml
iochaos.chaos-mesh.org "pg-primary-io-fault" deleted
Chaos#14: IO attribute overwrite
In this experiment, i/o attributes will be overwritten. We expect our database to be available (Ready | Critical) during the chaos experiment.
Note: If you are not using
forceFailoverAcceptingDataLossAfter, then you might see the database is inNotReadyduring the chaos.
Save this yaml as tests/15-io-attr-override.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: pg-primary-io-attr-override
namespace: chaos-mesh
spec:
action: attrOverride
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
volumePath: /var/pv
path: /var/pv/data/**/*
attr:
perm: 444 # Read-only permissions
percent: 100
duration: "4m"
containerNames:
- postgres
What this chaos does: Overrides file permissions on data files to read-only (444), preventing write operations and forcing the database to encounter permission denied errors on all writes.
Let’s see how our database is now.
kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 08:32:42 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 12h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 12h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 12h
pod/pg-ha-cluster-1 2/2 Running 0 12h
pod/pg-ha-cluster-2 2/2 Running 0 12h
Create the load generation job.
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config unchanged
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Apply the chaos experiment.
➤ kubectl apply -f tests/15-io-attr-override.yaml
iochaos.chaos-mesh.org/pg-primary-io-attr-override created
Keep watching the database resources.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 08:33:45 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 NotReady 12h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 12h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 12h
pod/pg-ha-cluster-1 2/2 Running 0 12h
pod/pg-ha-cluster-2 2/2 Running 0 12h
pod/pg-load-test-job-cgbgt 1/1 Running 0 72s
So the database went into NotReady state, which means the primary is not responsive. The reason might be that the database inside the primary pod is not running.
Let’s check this:
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-1
Let’s check the logs from unresponsive primary pg-ha-cluster-1.
➤ kubectl logs -f -n demo pg-ha-cluster-1
Defaulted container "postgres" out of: postgres, pg-coordinator, postgres-init-container (init)
...
2026-04-07 02:33:20.552 UTC [237694] FATAL: the database system is in recovery mode
2026-04-07 02:33:20.553 UTC [2908] LOG: all server processes terminated; reinitializing
2026-04-07 02:33:20.553 UTC [2908] LOG: could not open directory "base/pgsql_tmp": Permission denied
2026-04-07 02:33:20.554 UTC [2908] LOG: could not open directory "base/4": Permission denied
2026-04-07 02:33:20.554 UTC [2908] LOG: could not open directory "base/5": Permission denied
2026-04-07 02:33:20.554 UTC [2908] LOG: could not open directory "base/1": Permission denied
2026-04-07 02:33:20.627 UTC [2908] PANIC: could not open file "global/pg_control": Permission denied
removing the initial scripts as server is not running ...
/scripts/run.sh: line 61: 2908 Aborted (core dumped) /run_scripts/role/run.sh
So you can see primary is shut down for I/O chaos. A failover should happen soon.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 08:34:42 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 12h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 12h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 12h
pod/pg-ha-cluster-1 2/2 Running 0 12h
pod/pg-ha-cluster-2 2/2 Running 0 12h
pod/pg-load-test-job-cgbgt 1/1 Running 0 2m9s
Our database now moved to NotReady -> Critical state. Let’s see who is the new primary.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
-----
➤ kubectl exec -it -n demo pg-ha-cluster-0 -- bash
Defaulted container "postgres" out of: postgres, pg-coordinator, postgres-init-container (init)
pg-ha-cluster-0:/$ psql
psql (16.4)
Type "help" for help.
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
(1 row)
----
➤ kubectl logs -f -n demo pg-ha-cluster-0
Defaulted container "postgres" out of: postgres, pg-coordinator, postgres-init-container (init)
...
2026-04-07 02:34:38.753 UTC [368446] LOG: checkpoint starting: wal
2026-04-07 02:35:09.342 UTC [368446] LOG: checkpoint complete: wrote 20389 buffers (31.1%); 0 WAL file(s) added, 0 removed, 21 recycled; write=29.948 s, sync=0.444 s, total=30.589 s; sync files=11, longest=0.351 s, average=0.041 s; distance=539515 kB, estimate=618146 kB; lsn=4/FACB3C48, redo lsn=4/DD03C9B0
2026-04-07 02:35:12.932 UTC [368446] LOG: checkpoint starting: wal
2026-04-07 02:35:29.121 UTC [368446] LOG: checkpoint complete: wrote 22745 buffers (34.7%); 0 WAL file(s) added, 2 removed, 33 recycled; write=15.541 s, sync=0.535 s, total=16.190 s; sync files=12, longest=0.216 s, average=0.045 s; distance=540559 kB, estimate=610387 kB; lsn=5/1C15E728, redo lsn=4/FE0207D8
So database is back online again, however old primary has not yet joined in the cluster. We will wait until all the chaos recovered.
➤ kubectl get iochaos -n chaos-mesh pg-primary-io-attr-override -oyaml
...
status:
conditions:
- status: "True"
type: AllRecovered
- status: "False"
type: Paused
- status: "True"
type: Selected
- status: "False"
type: AllInjected
All the generated chaos has been recovered.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 08:38:52 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 12h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 12h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 12h
pod/pg-ha-cluster-1 2/2 Running 0 12h
pod/pg-ha-cluster-2 2/2 Running 0 12h
pod/pg-load-test-job-cgbgt 0/1 Completed 0 6m20s
Database moved into Ready state.
Final Results:
=================================================================
Test Duration: 5m3s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 248425 (Reads: 198683, Inserts: 24948, Updates: 24794)
Total Number of Rows Reads: 19868300, Inserts: 2494800, Updates: 24794)
Total Errors: 232435
Total Data Transferred: 23243.14 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 48.77 (Reads: 36.58/s, Inserts: 4.88/s, Updates: 7.32/s)
Throughput: 4.34 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 16.922ms, P95: 52.069ms, P99: 317.142ms
Inserts - Avg: 44.849ms, P95: 130.692ms, P99: 211.022ms
Updates - Avg: 19.201ms, P95: 66.474ms, P99: 148.452ms
-----------------------------------------------------------------
Connection Pool:
Active: 25, Max: 100, Available: 75
=================================================================
=================================================================
Performance Summary:
Average Throughput: 820.04 operations/sec
Read Operations: 198683 (655.85/sec avg)
Insert Operations: 24948 (82.35/sec avg)
Update Operations: 24794 (81.84/sec avg)
Error Rate: 48.3374%
Total Data Transferred: 22.70 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
=================================================================
Test Duration: 5m13s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 248425 (Reads: 198683, Inserts: 24948, Updates: 24794)
Total Number of Rows Reads: 19868300, Inserts: 2494800, Updates: 24794)
Total Errors: 232435
Total Data Transferred: 23243.14 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 0.00 (Reads: 0.00/s, Inserts: 0.00/s, Updates: 0.00/s)
Throughput: 0.00 MB/s
Errors/sec: 0.00
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 16.922ms, P95: 52.069ms, P99: 317.142ms
Inserts - Avg: 44.849ms, P95: 130.692ms, P99: 211.022ms
Updates - Avg: 19.201ms, P95: 66.474ms, P99: 148.452ms
-----------------------------------------------------------------
Connection Pool:
Active: 14, Max: 100, Available: 86
=================================================================
I0407 02:37:53.535684 1 load_generator_v2.go:555] Total records in table: 2544800
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 2544800
I0407 02:37:53.535709 1 load_generator_v2.go:556] totalRows in LoadGenerator: 2544800
Records Found in DB: 2544800
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
We inserted around 23 GB in 5 minutes. No data loss detected.
Clean up the chaos experiment.
kubectl delete -f tests/15-io-attr-override.yaml
iochaos.chaos-mesh.org "pg-primary-io-attr-override" deleted
Chaos#15: IO mistake
In this experiment, chaos-mesh will insert IO mistakes. We expect the database to be in Ready state after the chaos is recovered. If you are using the forceFailover API, then your database will be up even when chaos is running, but this will increase the chance of some data loss (if some write operations are going on during the failover process).
Just to remind you, we are using forceFailoverAcceptingDataLossAfter API for IO related chaos.
Save this yaml as tests/16-io-mistake.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: pg-primary-io-mistake
namespace: chaos-mesh
spec:
action: mistake
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
volumePath: /var/pv
path: /var/pv/data/**/*
mistake:
filling: random
maxOccurrences: 10
maxLength: 100
percent: 50
duration: "5m"
containerNames:
- postgres
What this chaos does: Randomly injects garbage data (random bytes) into file operations on 50% of disk writes, corrupting the data stored on disk.
Let’s check the database state.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 08:57:53 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 12h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 12h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 12h
pod/pg-ha-cluster-1 2/2 Running 0 12h
pod/pg-ha-cluster-2 2/2 Running 0 12h
Running the load generation job.
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config unchanged
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Lets check the primary.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
saurov@saurov-pc:~
➤ kubectl exec -it -n demo pg-ha-cluster-0 -- bash
Defaulted container "postgres" out of: postgres, pg-coordinator, postgres-init-container (init)
pg-ha-cluster-0:/$ psql
psql (16.4)
Type "help" for help.
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
(1 row)
Lets apply the experiment.
➤ kubectl apply -f tests/16-io-mistake.yaml
iochaos.chaos-mesh.org/pg-primary-io-mistake created
Keep watching the database.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 08:59:26 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 NotReady 12h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 12h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 12h
pod/pg-ha-cluster-1 2/2 Running 0 12h
pod/pg-ha-cluster-2 2/2 Running 0 12h
pod/pg-load-test-job-b56q6 1/1 Running 0 75s
Database went into NotReady state and should be back in Critical state as we used forceFailoverAcceptingDataLossAfter api.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 09:00:32 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 12h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 12h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 12h
pod/pg-ha-cluster-1 2/2 Running 0 12h
pod/pg-ha-cluster-2 2/2 Running 0 12h
pod/pg-load-test-job-b56q6 1/1 Running 0 2m21s
The database is back in Critical state.
➤ kubectl get iochaos -n chaos-mesh pg-primary-io-mistake -oyaml
status:
conditions:
- status: "True"
type: Selected
- status: "False"
type: AllInjected
- status: "True"
type: AllRecovered
- status: "False"
type: Paused
All the chaos recovered.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 09:04:55 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 12h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 12h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 12h
pod/pg-ha-cluster-1 2/2 Running 0 12h
pod/pg-ha-cluster-2 2/2 Running 0 12h
pod/pg-load-test-job-b56q6 0/1 Completed 0 6m44
Database back in Ready state.
...
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 2537700
I0407 03:03:27.372254 1 load_generator_v2.go:556] totalRows in LoadGenerator: 2537700
Records Found in DB: 2537700
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
No data loss.
Cleanup:
➤ kubectl delete -f tests/16-io-mistake.yaml
iochaos.chaos-mesh.org "pg-primary-io-mistake" deleted
IO Chaos Tests Without Force Failover
We have seen data losses in chaos tests with forceFailoverAcceptingDataLossAfter: 30s api, so we will now try the same chaos,
but without this api.
Now save this yaml at setup/pg-ha-cluster.yaml
apiVersion: kubedb.com/v1
kind: Postgres
metadata:
name: pg-ha-cluster
namespace: demo
spec:
clientAuthMode: md5
deletionPolicy: Delete
podTemplate:
spec:
containers:
- name: postgres
resources:
limits:
memory: 3Gi
requests:
cpu: 2
memory: 2Gi
replicas: 3
replication:
walKeepSize: 3000
walLimitPolicy: WALKeepSize
standbyMode: Hot
storage:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageType: Durable
version: "16.4"
Now apply this yaml kubectl apply -f setup/pg-ha-cluster.yaml.
watch the resource coming up and db getting Ready.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Wed Apr 8 10:15:14 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 68s
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 63s
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 63s
pod/pg-ha-cluster-1 2/2 Running 0 56s
pod/pg-ha-cluster-2 2/2 Running 0 48s
lets see who is the primary.
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-0
-----
➤ kubectl exec -it -n demo pg-ha-cluster-0 -- bash
Defaulted container "postgres" out of: postgres, pg-coordinator, postgres-init-container (init)
pg-ha-cluster-0:/$ psql
psql (16.4)
Type "help" for help.
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
(1 row)
lets run the load generate job.
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config unchanged
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Apply the io-latency chaos experiment.
➤ kubectl apply -f tests/13-io-latency.yaml
iochaos.chaos-mesh.org/pg-primary-io-latency created
Now watch the database state.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Wed Apr 8 10:20:38 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 NotReady 6m32s
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 6m27s
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 6m27s
pod/pg-ha-cluster-1 2/2 Running 0 6m20s
pod/pg-ha-cluster-2 2/2 Running 0 6m12s
pod/pg-load-test-job-p7vvw 1/1 Running 0 80s
You should see your database is in NotReady state all the time. The reason behind that:
- Primary database is up and running, but as IO latency increased, new connection creation is getting timed out.
- All existing connections to the primary are working fine.
- Primary postgres process are working fine, that’s why we are not doing a failover.
- So new connections during this test wasn’t possible, and as we do not used force failover, no failover performed.
status:
conditions:
- status: "True"
type: Selected
- status: "False"
type: AllInjected
- status: "True"
type: AllRecovered
- status: "False"
type: Paused
Now the chaos is recovered and our database should eventually reach Ready state.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Wed Apr 8 10:33:12 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 19m
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 19m
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 53s
pod/pg-ha-cluster-1 2/2 Running 0 18m
pod/pg-ha-cluster-2 2/2 Running 0 18m
pod/pg-load-test-job-p7vvw 0/1 Completed 0 13m
The database reached in Ready state.
Final Results:
=================================================================
Test Duration: 6m10s
-----------------------------------------------------------------
Cumulative Statistics:
Total Operations: 72980 (Reads: 58363, Inserts: 7250, Updates: 7367)
Total Number of Rows Reads: 5836300, Inserts: 725000, Updates: 7367)
Total Errors: 17
Total Data Transferred: 6820.02 MB
-----------------------------------------------------------------
Current Throughput (interval):
Operations/sec: 0.00 (Reads: 0.00/s, Inserts: 0.00/s, Updates: 0.00/s)
Throughput: 0.00 MB/s
Errors/sec: 0.36
-----------------------------------------------------------------
Latency Statistics:
Reads - Avg: 42.497ms, P95: 35.26ms, P99: 48.91ms
Inserts - Avg: 53.254ms, P95: 109.955ms, P99: 266.726ms
Updates - Avg: 93.204ms, P95: 91.333ms, P99: 260.728ms
-----------------------------------------------------------------
Connection Pool:
Active: 14, Max: 100, Available: 86
=================================================================
=================================================================
Performance Summary:
Average Throughput: 197.22 operations/sec
Read Operations: 58363 (157.72/sec avg)
Insert Operations: 7250 (19.59/sec avg)
Update Operations: 7367 (19.91/sec avg)
Error Rate: 0.0233%
Total Data Transferred: 6.66 GB
=================================================================
=================================================================
Checking for Data Loss...
=================================================================
Error getting connection stats: failed to get current connections: pq: canceling statement due to user request
Error getting connection stats: failed to get max_connections: context deadline exceeded
I0408 04:30:19.253637 1 load_generator_v2.go:555] Total records in table: 775000
I0408 04:30:19.253658 1 load_generator_v2.go:556] totalRows in LoadGenerator: 775000
=================================================================
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 775000
Records Found in DB: 775000
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
No data loss detected - all inserted records are present in database
From the load generate job, we can see there was less data inserted as database was unavailable. But more importantly, No data loss was recorded.
Similarly, you can try the other chaos also. You should find out no data loss for each io chaos cases.
Cleanup:
➤ kubectl delete -f tests/13-io-latency.yaml
iochaos.chaos-mesh.org "pg-primary-io-latency" deleted
Misc Chaos Tests
Chaos#16: Node Reboot | Stress CPU memory
We will perform three experiments one after another here. We will not run load tests for some of these experiments.
Save this yaml as tests/17-node-reboot.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pg-cluster-all-pods-kill
namespace: chaos-mesh
spec:
action: pod-kill
mode: all
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
gracePeriod: 0
duration: "30s"
What this chaos does: Simultaneously kills all PostgreSQL pods in the cluster, simulating a complete node failure where all replicas restart at once.
This is simulate a typical node failure scenario where all the pod restarted.
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 09:31:47 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 13h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 13h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 13h
pod/pg-ha-cluster-1 2/2 Running 0 13h
pod/pg-ha-cluster-2 2/2 Running 0 13h
Lets apply the experiment.
kubectl apply -f tests/17-node-reboot.yaml
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 09:32:12 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 13h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 13h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 5s
pod/pg-ha-cluster-1 2/2 Running 0 2s
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 09:32:24 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 NotReady 13h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 13h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 16s
pod/pg-ha-cluster-1 2/2 Running 0 13s
pod/pg-ha-cluster-2 2/2 Running 0 11s
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 09:32:33 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Critical 13h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 13h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 26s
pod/pg-ha-cluster-1 2/2 Running 0 23s
pod/pg-ha-cluster-2 2/2 Running 0 21s
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 09:32:40 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 13h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 13h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 32s
pod/pg-ha-cluster-1 2/2 Running 0 29s
pod/pg-ha-cluster-2 2/2 Running 0 27s
So the database is back in ready state within 30s of applying the chaos. Now let’s apply the next chaos which will stress CPU.
Now lets try to stress the cpu.
Save this yaml as tests/18-stress-cpu-primary.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: pg-primary-cpu-stress
namespace: chaos-mesh
spec:
mode: one
selector:
namespaces:
- demo
labelSelectors:
"app.kubernetes.io/instance": "pg-ha-cluster"
"kubedb.com/role": "primary"
stressors:
cpu:
workers: 2
load: 90
duration: "2m"
What this chaos does: Stresses the CPU on the primary pod by running 2 CPU-intensive worker processes at 90% load, consuming system resources and potentially causing slowdowns and failover.
But before running this, we will run the load test job.
➤ ./run-k8s.sh
job.batch "pg-load-test-job" deleted
persistentvolumeclaim "pg-load-test-results" deleted
configmap/pg-load-test-config unchanged
job.batch/pg-load-test-job created
persistentvolumeclaim/pg-load-test-results created
Now lets apply the chaos experiment.
➤ kubectl apply -f tests/18-stress-cpu-primary.yaml
stresschaos.chaos-mesh.org/pg-primary-cpu-stress created
➤ kubectl get pods -n demo --show-labels | grep primary | awk '{ print $1}'
pg-ha-cluster-1
Lets check the cpu usages:
Every 2.0s: kubectl top pods --containers -n demo saurov-pc: Tue Apr 7 09:35:42 2026
POD NAME CPU(cores) MEMORY(bytes)
pg-ha-cluster-0 pg-coordinator 29m 40Mi
pg-ha-cluster-0 postgres 244m 621Mi
pg-ha-cluster-1 pg-coordinator 15m 38Mi
pg-ha-cluster-1 postgres 7060m 693Mi
pg-ha-cluster-2 pg-coordinator 16m 38Mi
pg-ha-cluster-2 postgres 217m 629Mi
pg-load-test-job-sfj6z load-test 1594m 216Mi
watch kubectl top pods --containers -n demo
Every 2.0s: kubectl top pods --containers -n demo saurov-pc: Tue Apr 7 09:35:58 2026
POD NAME CPU(cores) MEMORY(bytes)
pg-ha-cluster-0 pg-coordinator 29m 37Mi
pg-ha-cluster-0 postgres 272m 633Mi
pg-ha-cluster-1 pg-coordinator 15m 38Mi
pg-ha-cluster-1 postgres 8509m 941Mi
pg-ha-cluster-2 pg-coordinator 14m 39Mi
pg-ha-cluster-2 postgres 241m 657Mi
pg-load-test-job-sfj6z load-test 1256m 272Mi
Database remain in ready state as there was sufficient cpu left in the cluster. However, this test case will pass in every environment.
watch kubectl get pg,petset,pods -n demo
Every 2.0s: kubectl get pg,petset,pods -n demo saurov-pc: Tue Apr 7 09:36:31 2026
NAME VERSION STATUS AGE
postgres.kubedb.com/pg-ha-cluster 16.4 Ready 13h
NAME AGE
petset.apps.k8s.appscode.com/pg-ha-cluster 13h
NAME READY STATUS RESTARTS AGE
pod/pg-ha-cluster-0 2/2 Running 0 4m24s
pod/pg-ha-cluster-1 2/2 Running 0 4m21s
pod/pg-ha-cluster-2 2/2 Running 0 4m19s
pod/pg-load-test-job-sfj6z 1/1 Running 0 113s
Data Loss Report:
-----------------------------------------------------------------
Total Records Inserted: 3273100
Records Found in DB: 3273100
Records Lost: 0
Data Loss Percentage: 0.00%
=================================================================
I0407 03:40:04.019990 1 load_generator_v2.go:555] Total records in table: 3273100
I0407 03:40:04.020008 1 load_generator_v2.go:556] totalRows in LoadGenerator: 3273100
No data loss detected - all inserted records are present in database
CleanUp:
kubectl delete -f tests/17-node-reboot.yaml
podchaos.chaos-mesh.org "pg-cluster-all-pods-kill" deleted
kubectl delete -f tests/18-stress-cpu-primary.yaml
stresschaos.chaos-mesh.org "pg-primary-cpu-stress" deleted
Chaos Testing Results Summary
Test Results Overview
Below is a comprehensive summary of all chaos engineering experiments conducted on the KubeDB-managed PostgreSQL High-Availability cluster. Each metric shows results in two configurations:
- With Force Failover: Using
forceFailoverAcceptingDataLossAfter: 30s - Without Force Failover: Waiting for data consistency before failover
Note: You might see different results if you have tested under no read/write load.
| # | Experiment | Failure Mode | Failover Time | Data Loss | Downtime | Notes |
|---|---|---|---|---|---|---|
| 1 | Kill Primary Pod | Pod termination | With: ~8s Without: ~8s | With: 0 / Without: 0 | With: Minimal / Without: Minimal | Immediate failover works in both cases |
| 2 | OOMKill Primary Pod | Memory exhaustion | With: ~3s / Without: ~3s | With: 0 / Without: 0 | With: Minimal / Without: Minimal | Rapid failover, 4.1M rows inserted |
| 3 | Kill Postgres Process | Process crash | With: ~30s / Without: ~30s+ | With: 0 / Without: 0 | With: ~30s / Without: 40s | Blocks failover to prevent data loss in both cases |
| 4 | Primary Pod Failure | Network isolation | With: ~10s / Without: ~10s | With: 0 / Without: 0 | With: Minimal / Without: Minimal | Split-brain handled well |
| 5 | Network Partition | Complete isolation | With: ~30s / Without: ~30s+ | With: ⚠️ Possible / Without: ⚠️ Possible | With: Brief / Without: Extended | Split-brain scenario, data safety challenge in both |
| 6 | Bandwidth Limit (1 Mbps) | Slow network | With: No failover / Without: No failover | With: 0 / Without: 0 | With: 0s / Without: 0s | 2.3M rows inserted, high latency tolerated |
| 7 | Network Delay (500ms) | High latency | With: No failover / Without: No failover | With: 0 / Without: 0 | With: 0s / Without: 0s | 2.5M rows inserted, consistency maintained |
| 8 | Network Loss (100%) | Packet drop | With: No failover / Without: No failover | With: 0 / Without: 0 | With: 0s / Without: 0s | 2.3M rows inserted, no data loss |
| 9 | Network Duplicate (50%) | Redundant traffic | With: No failover / Without: No failover | With: 0 / Without: 0 | With: 0s / Without: 0s | 2.2M rows inserted, gracefully handled |
| 10 | Network Corruption (50%) | Corrupted packets | With: ~15s / Without: ~15s | With: 0 / Without: 0 | With: ~30s / Without: ~30s | 2.1M rows inserted, checksums fail |
| 11 | Time Offset & DNS Error | System time shift | With: No failover / Without: No failover | With: 0 / Without: 0 | With: 0s / Without: 0s | 2.0M rows inserted |
| 12 | IO Latency | Disk I/O delay | With: ~30s / Without: No failover | With: ⚠️ ~1 insert loss / Without: 0 | With: Brief / Without: Extended | Critical difference: force failover causes ~1 insert loss |
| 13 | IO Fault (50%) | I/O errors | With: ~30s / Without: No failover | With: 0 / Without: 0 | With: Brief / Without: Extended | 2.6M rows inserted, 25GB transferred |
| 14 | IO Attribute Override | Filesystem attr change | With: ~30s / Without: No failover | With: 0 / Without: 0 | With: Brief / Without: Extended | 2.5M rows inserted, 23GB transferred |
| 15 | IO Mistake | Random I/O faults | With: ~30s / Without: No failover | With: 0 / Without: 0 | With: Brief / Without: Extended | 2.5M rows inserted, 23GB transferred |
| 16 | Node Reboot (All Pods) | Complete node failure | With: ~30s / Without: ~30s+ | With: 0 / Without: 0 | With: Extended / Without: Extended | 3.2M rows inserted, full cluster restart |
Note:
Extendedmeans as long as the chaos runs.
Key Findings
Replication Strategy Impact
| Scenario | With Force Failover (30s) | Without Force Failover | Winner |
|---|---|---|---|
| Availability | High - immediate failover | Lower - waits for consistency | |
| Data Loss Risk | Low-Medium | Zero Risk | |
| IO Chaos Tests | ⚠️ 1 insert lost (rare) | 0 insert lost | |
| Failover Time | 30 seconds or less | Variable (extended if unsafe) | |
| Use Case | High-availability priority | Data integrity priority |
Chaos Test Categories
1. Pod-Level Failures (Chaos #1-4)
- Result: Immediate failovers work well
- Data Loss: Zero in all cases
- Downtime: Minimal (< 30s recovery)
- Best Practice: Default configuration handles these excellently
2. Network Chaos (Chaos #5-11)
- Result: Cluster remains stable without failover
- Data Loss: Zero in all cases (except network partition which forces split-brain)
- Downtime: Minimal to none (connections recover automatically)
- Best Practice: PostgreSQL’s replication is resilient to network impairments
3. IO Chaos with Force Failover (Chaos #12-15, with forceFailoverAcceptingDataLossAfter)
- Result: Cluster stays highly available
- Data Loss: Minimal (~1 insert in worst case = 0.004%)
- Downtime: Minimal (automatic failover keeps cluster up)
- Trade-off: Sacrifices tiny bit of data for high availability
4. IO Chaos Without Force Failover (Chaos #12-15, without forceFailoverAcceptingDataLossAfter)
- Result: Database may enter NotReady state
- Data Loss: Zero across all tests
- Downtime: Extended (until chaos clears or manual intervention)
- Trade-off: Prioritizes data safety over availability
Replication Configuration Recommendations
Choose WITH forceFailoverAcceptingDataLossAfter: 30s if:
- Your application requires high availability
- You can tolerate rare events of < 0.01% data loss
- Your database serves real-time or customer-facing services
Choose WITHOUT forceFailoverAcceptingDataLossAfter if:
- Data integrity is fine, but not critical
- You can tolerate extended downtime during node failures
- Your database serves compliance-sensitive operations
Choose WITHOUT streamingMode: Synchronous if:
- Data integrity is absolutely critical
- You want high availability
- Your database serves compliance-sensitive operations
Performance Metrics Summary in chaos cases
| Metric | Average | Best | Worst |
|---|---|---|---|
| Rows Inserted | 2.3M | 4.1M | 0.7M |
| Data Transferred | 21.5 GB | 25 GB | 6.6 GB |
| Failover Time | ~20 seconds | ~3 seconds | 30+ seconds |
| Data Loss (with Force Failover) | < 0.01% | 0% | 0.004% |
| Data Loss (without Force Failover) | 0% | 0% | 0% |
| Recovery Time | < 1 minutes | ~30 seconds | ~5 minutes |
Important Note: All these metrics are taken during chaos experiment. Kubedb performs notably well in both chaos scenarios and normal scenarios. For example In normal scenarios where you kubernetes cluster is behaving normal, you should see a failover happening within
~5 secondswithoutany data lossevery time, and of course automatically.
Conclusion
The KubeDB-managed PostgreSQL HA cluster demonstrates excellent resilience across all tested failure scenarios.
The cluster achieves the balance between high availability and data consistency, allowing operators to choose their preferred trade-off based on business requirements.
What Next?
Please try the latest release and give us your valuable feedback.
If you want to install KubeDB, please follow the installation instruction from here .
If you want to upgrade KubeDB from a previous version, please follow the upgrade instruction from here .
Support
To speak with us, please leave a message on our website .
To receive product announcements, follow us on Twitter .
If you have found a bug with KubeDB or want to request for new features, please file an issue .





