best practices7 min read

Kubernetes High Availability: Configure a Resilient Production Cluster

SFEIR Instituteβ€’

Key Takeaways

  • βœ“Minimum 3 control planes required for a production HA cluster
  • βœ“94% reduction in downtime with HA architecture (CNCF 2025)
  • βœ“etcd must be distributed on dedicated nodes with API Server load balancing

Kubernetes high availability (HA) is the ability of a cluster to maintain operational services despite individual component failures. For any Kubernetes Cloud Operations Engineer, mastering this architecture is a critical skill: according to Dynatrace's State of Kubernetes 2025 report (source), 78% of organizations now run critical workloads on Kubernetes, making resilience non-negotiable.

TL;DR: Configure an HA cluster with a minimum of 3 control planes, etcd distributed across dedicated nodes, API Server load balancing, and PodDisruptionBudgets. You'll reduce your downtime by 94% according to CNCF 2025 data.

This topic is at the heart of the LFS458 Kubernetes Administration training.

What is Kubernetes Cluster High Availability?

Kubernetes cluster high availability is an architecture where every critical component has redundant replicas, eliminating any single point of failure (SPOF). You must understand this definition before implementing: an HA cluster ensures that the loss of a node, pod, or control plane component doesn't interrupt your services.

The pillars of Kubernetes HA:

ComponentHA ConfigurationMinimum Recommended
API ServerLoad balanced3 instances
etcdDistributed cluster3 or 5 nodes
Controller ManagerLeader election3 instances
SchedulerLeader election3 instances
Worker nodesMulti-AZ3+ per zone
Remember: An HA cluster requires an odd number of etcd nodes (3 or 5) to maintain quorum during leader elections.

To explore these concepts further, consult our guide on Kubernetes control plane architecture.

Why Must Kubernetes Cloud Operations Engineers Master HA?

As a Kubernetes Cloud Operations Engineer, you're responsible for your clusters' SLA. The business stakes are considerable: Gartner estimates the average cost of one hour of IT downtime at $300,000 in 2025 (source).

You must anticipate three types of failures:

  1. Hardware failures: server, disk, or network outages
  2. Software failures: process crashes, OOM, bugs
  3. Operational failures: configuration errors, failed updates

A Kubernetes Infrastructure Engineer who neglects HA exposes their organization to costly outages. According to the CNCF Annual Survey 2025, organizations with HA clusters report 94% less unplanned downtime.

How to Configure etcd in High Availability?

etcd is the key-value database that stores your cluster's complete state. Its availability determines the availability of all of Kubernetes. You must configure it with particular care.

Deploy etcd on dedicated nodes:

# etcd-cluster.yaml
apiVersion: v1
kind: Pod
metadata:
name: etcd
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: etcd
image: registry.k8s.io/etcd:3.5.12-0
command:
- etcd
- --name=etcd-0
- --initial-cluster=etcd-0=https://10.0.1.10:2380,etcd-1=https://10.0.1.11:2380,etcd-2=https://10.0.1.12:2380
- --initial-cluster-state=new
- --listen-peer-urls=https://10.0.1.10:2380
- --listen-client-urls=https://10.0.1.10:2379,https://127.0.0.1:2379
- --advertise-client-urls=https://10.0.1.10:2379
- --data-dir=/var/lib/etcd

Check your etcd cluster health:

etcdctl endpoint health --cluster \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key
Remember: Configure automatic etcd snapshots every hour. You'll be able to restore your cluster in less than 5 minutes in case of data corruption.

How to Deploy Redundant API Servers?

The Kubernetes API Server is the entry point for all interactions with your cluster. You must deploy it in high availability behind a load balancer.

Recommended architecture in 2026:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Load Balancer  β”‚
β”‚   (HAProxy/LB)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 β”‚                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
β”‚ API Server 1β”‚   β”‚ API Server 2β”‚   β”‚ API Server 3β”‚
β”‚  (Node 1)   β”‚   β”‚  (Node 2)   β”‚   β”‚  (Node 3)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configure HAProxy as load balancer:

# /etc/haproxy/haproxy.cfg
frontend kubernetes-api
bind *:6443
mode tcp
option tcplog
default_backend kubernetes-api-backend

backend kubernetes-api-backend
mode tcp
balance roundrobin
option tcp-check
server master1 10.0.1.10:6443 check fall 3 rise 2
server master2 10.0.1.11:6443 check fall 3 rise 2
server master3 10.0.1.12:6443 check fall 3 rise 2

To understand these components in detail, consult our article on Kubernetes cluster administration.

What Are Kubernetes HA Best Practices for Workloads?

Configuring an HA control plane isn't enough. You must also ensure the resilience of your applications. Kubernetes HA best practices cover several aspects.

1. Use PodDisruptionBudgets (PDB):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api-server

2. Configure pod anti-affinity:

spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: api-server
topologyKey: kubernetes.io/hostname

3. Spread your pods across multiple zones:

spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server

These configurations ensure that your critical pods remain available even during planned maintenance. Learn to update a Kubernetes cluster without service interruption.

How Does a Kubernetes Infrastructure Engineer Configure HA Storage?

Persistent storage is often the weak point of HA architectures. You must select replicated storage solutions.

HA storage solutions in 2026:

SolutionReplicationLatencyUse Case
Rook-Ceph3x minimumMediumBlock/object storage
Longhorn2-3xLowEdge, small clusters
Portworx2-3xVery lowEnterprise production
OpenEBS2-3xVariableCloud-native

Example StorageClass with replication:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-block-ha
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
clusterID: rook-ceph
pool: replicapool
imageFormat: "2"
imageFeatures: layering
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
reclaimPolicy: Retain
allowVolumeExpansion: true
Remember: Always configure reclaimPolicy: Retain for your critical volumes. You'll avoid accidental data loss when deleting PVCs.

How to Monitor Your HA Cluster Health?

Proactive monitoring is a pillar of Kubernetes HA best practices. You must detect problems before they impact your users.

Critical metrics to monitor:

# Essential Prometheus alerts
groups:
- name: kubernetes-ha
rules:
- alert: EtcdMembersDown
expr: count(etcd_server_has_leader) < 3
for: 5m
labels:
severity: critical
annotations:
summary: "etcd cluster degraded"

- alert: APIServerLatencyHigh
expr: histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning

Quick diagnostic commands:

# Check control plane component status
kubectl get --raw='/healthz?verbose'

# Check nodes
kubectl get nodes -o wide

# Check system pods
kubectl get pods -n kube-system -o wide

To explore these techniques further, consult our guide on diagnosing and resolving network issues in a Kubernetes cluster.

How to Manage Updates Without Interruption?

Updates are a critical moment for availability. As a Kubernetes Cloud Operations Engineer, you must plan each upgrade meticulously.

Recommended HA update process:

  1. Back up etcd before any operation
  2. Update one control plane at a time
  3. Validate health before moving to the next
  4. Cordon and drain workers progressively
# etcd backup
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db

# Update a control plane node
kubeadm upgrade apply v1.30.0

# Drain a worker
kubectl drain node-worker-1 --ignore-daemonsets --delete-emptydir-data

The LFS458 Kubernetes Administration training covers update procedures in production environments in detail.

What Anti-Patterns Should You Avoid for High Availability?

Some errors silently compromise your HA architecture. You must identify and correct them.

Anti-pattern 1: etcd on the same nodes as workloads

A pod consuming too many resources can impact etcd and cause cluster-wide timeouts.

Anti-pattern 2: No PodDisruptionBudget

Without PDB, kubectl drain can delete all your pods simultaneously.

Anti-pattern 3: Ignoring resource limits

# ❌ Bad practice
resources: {}

# βœ… Good practice
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi

Anti-pattern 4: Network single point of failure

Always configure CNIs with built-in redundancy (Calico, Cilium).

Consult our guide to resolve the 10 most common problems on a Kubernetes cluster.

How to Test Your Cluster's Resilience?

You cannot guarantee HA without testing it regularly. Chaos engineering validates your configurations.

Chaos engineering tools for Kubernetes:

  • Chaos Mesh: native Kubernetes fault injection
  • Litmus: predefined chaos scenarios
  • Gremlin: enterprise platform

Example test with Chaos Mesh:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-test
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api-server
scheduler:
cron: "@every 1h"

To discover the fundamentals before implementing HA, explore our page Kubernetes fundamentals for beginners.

How to Secure Your HA Architecture?

High availability and security are inseparable. A security flaw can compromise your HA. You must apply the defense in depth principle.

Secure etcd communications:

# Generate TLS certificates for etcd
kubeadm init phase certs etcd-ca
kubeadm init phase certs etcd-server
kubeadm init phase certs etcd-peer

Enable API Server auditing:

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: RequestResponse
resources:
- group: ""
resources: ["secrets", "configmaps"]

Consult our guide to secure a Kubernetes cluster and our Kubernetes Training: Complete Guide.

Remember: Encrypt etcd data at rest with --encryption-provider-config. You'll protect your secrets even in case of storage compromise.

Take Action: Train Your Teams in Kubernetes HA

Kubernetes high availability requires advanced skills that you'll develop through guided practice. SFEIR Institute offers certifying training for every level.

Recommended training:

Contact our advisors to define the path suited to your teams: Request a personalized quote.