Key Takeaways
- ✓etcd stores 100% of Kubernetes cluster state
- ✓Use etcdctl snapshot save with ETCDCTL_API=3 for backups
- ✓Test your restores in staging before production
etcd is the distributed key-value database that stores the entire state of a Kubernetes cluster. etcd backup and restore for Kubernetes constitutes a critical skill for any Kubernetes system administrator. Without a functional etcd backup, corruption or data loss means complete cluster reconstruction.
TL;DR: etcd stores 100% of Kubernetes state. Minimum daily snapshot. Test your restores in staging. Useetcdctl snapshot savewithETCDCTL_API=3.
Mastery of etcd is covered in the LFS458 Kubernetes Administration training.
Essential etcdctl commands
| Command | Description | Required flags |
|---|---|---|
etcdctl snapshot save | Creates a snapshot | --endpoints, --cacert, --cert, --key |
etcdctl snapshot restore | Restores from snapshot | --data-dir, --name |
etcdctl snapshot status | Checks snapshot integrity | --write-out=table |
etcdctl member list | Lists cluster members | --write-out=table |
etcdctl endpoint health | Checks endpoint health | --cluster |
etcdctl endpoint status | Detailed endpoint status | --cluster --write-out=table |
Required environment variables
export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
Key takeaway: Always set ETCDCTL_API=3. The v2 API is deprecated and incompatible with Kubernetes 1.24+.
etcd backup: complete procedure
Manual snapshot
# Create a snapshot
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Check integrity
etcdctl snapshot status /backup/etcd-20260228-143000.db --write-out=table
Expected output:
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 3c5e8d2a | 284519 | 1847 | 5.2 MB |
+----------+----------+------------+------------+
etcd backup Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
hostNetwork: true
containers:
- name: backup
image: registry.k8s.io/etcd:3.5.12-0
command:
- /bin/sh
- -c
- |
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
find /backup -mtime +7 -delete
env:
- name: ETCDCTL_API
value: "3"
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup
mountPath: /backup
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup
hostPath:
path: /var/backup/etcd
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
According to the CNCF Annual Survey 2025, 82% of container users run Kubernetes in production. A robust Kubernetes cluster etcd backup strategy is essential.
etcd restore: step-by-step procedure
Step 1: Stop control plane components
# On each control plane node
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Verify shutdown
sudo crictl ps | grep -E "etcd|kube-api"
Step 2: Restore the snapshot
# Delete existing etcd data
sudo rm -rf /var/lib/etcd
# Restore to a new directory
etcdctl snapshot restore /backup/etcd-20260228-143000.db \
--data-dir=/var/lib/etcd \
--name=control-plane-1 \
--initial-cluster=control-plane-1=https://192.168.1.10:2380 \
--initial-advertise-peer-urls=https://192.168.1.10:2380
# Fix permissions
sudo chown -R etcd:etcd /var/lib/etcd
Step 3: Restart components
# Restore manifests
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/
# Verify the cluster
kubectl get nodes
kubectl get pods -A
Key takeaway: Test restoration in staging before production. The etcd snapshot restore Kubernetes process modifies cluster identifiers.
To deepen these critical procedures, consult the LFS458 Kubernetes Administration training which prepares for CKA certification.
etcd maintenance: diagnostic commands
Health check
# Health of all endpoints
etcdctl endpoint health --cluster
# Output: https://192.168.1.10:2379 is healthy: successfully committed proposal
# Detailed status
etcdctl endpoint status --cluster --write-out=table
Expected output:
+---------------------------+------------------+---------+---------+-----------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER |
+---------------------------+------------------+---------+---------+-----------+
| https://192.168.1.10:2379 | 8e9e05c52164694d | 3.5.12 | 5.2MB | true |
| https://192.168.1.11:2379 | 2d3c8a5e7b1f4c92 | 3.5.12 | 5.2MB | false |
| https://192.168.1.12:2379 | 4f6d9c8b2a1e3d70 | 3.5.12 | 5.2MB | false |
+---------------------------+------------------+---------+---------+-----------+
Defragmentation (regular maintenance)
# Check disk usage before
etcdctl endpoint status --write-out=table
# Defragment (one member at a time)
etcdctl defrag --endpoints=https://192.168.1.10:2379
# Check after
etcdctl endpoint status --write-out=table
History compaction
# Get current revision
rev=$(etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')
# Compact up to this revision
etcdctl compact $rev
# Defragment after compaction
etcdctl defrag --endpoints=https://192.168.1.10:2379
Frequent errors and solutions
| Error | Cause | Solution |
|---|---|---|
Error: context deadline exceeded | Endpoint inaccessible | Check certificates and firewall port 2379/2380 |
Error: etcdserver: mvcc: database space exceeded | Quota reached (2GB default) | Compact + defragment + increase --quota-backend-bytes |
Error: member has already been bootstrapped | Data-dir not empty | Delete /var/lib/etcd before restore |
Error: authentication required | Missing certificates | Set ETCDCTL_CACERT, ETCDCTL_CERT, ETCDCTL_KEY |
raft: stopped | Majority lost (quorum) | Restore from snapshot on new cluster |
Key takeaway: etcd quorum requires (n/2)+1 members. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2 failures.
Backup/restore checklist for Kubernetes system administrators preparing for CKS certification
# ✅ BEFORE backup
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster --write-out=table
# ✅ BACKUP
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db
etcdctl snapshot status /backup/etcd-*.db --write-out=table
# ✅ VALIDATION
ls -la /backup/etcd-*.db
etcdctl snapshot status <file> | grep "TOTAL KEYS"
# ✅ RESTORE (test in staging)
etcdctl snapshot restore <file> --data-dir=/tmp/etcd-test
ls -la /tmp/etcd-test/member/
# ✅ AFTER restore
kubectl get nodes
kubectl get pods -A
kubectl get cs
According to the Linux Foundation, CKA certification requires a 66% score and lasts 2 hours. etcd operations represent a significant part of the exam.
Additional resources
To go further in Kubernetes Cluster Administration, consult:
- kubectl cheatsheet: essential commands for Kubernetes cluster administration
- Complete guide: install a multi-node Kubernetes cluster with kubeadm
- Kubernetes Cluster Administration Training in Paris
- Kubernetes Training Thematic Map
- Kubernetes Training: Complete Guide
Next steps: certifications and training
According to the CNCF Training Report, more than 104,000 professionals have taken the CKA exam (49% growth in one year). Mastery of etcd is essential for Kubernetes infrastructure engineers preparing for CKS certification.
Recommended training:. For more depth, consult our Kubernetes cluster administration training.
| Training | Duration | Certification prepared |
|---|---|---|
| LFS458 Kubernetes Administration | 4 days | CKA |
| LFS460 Kubernetes Security Essentials | 4 days | CKS |
| Kubernetes Fundamentals | 1 day | Discovery |
Contact our advisors to plan your Kubernetes certification path.