Cheatsheet5 min read

etcd Cheatsheet: Backup, Restore, and Cluster Maintenance

SFEIR Institute

Key Takeaways

  • etcd stores 100% of Kubernetes cluster state
  • Use etcdctl snapshot save with ETCDCTL_API=3 for backups
  • Test your restores in staging before production

etcd is the distributed key-value database that stores the entire state of a Kubernetes cluster. etcd backup and restore for Kubernetes constitutes a critical skill for any Kubernetes system administrator. Without a functional etcd backup, corruption or data loss means complete cluster reconstruction.

TL;DR: etcd stores 100% of Kubernetes state. Minimum daily snapshot. Test your restores in staging. Use etcdctl snapshot save with ETCDCTL_API=3.

Mastery of etcd is covered in the LFS458 Kubernetes Administration training.


Essential etcdctl commands

CommandDescriptionRequired flags
etcdctl snapshot saveCreates a snapshot--endpoints, --cacert, --cert, --key
etcdctl snapshot restoreRestores from snapshot--data-dir, --name
etcdctl snapshot statusChecks snapshot integrity--write-out=table
etcdctl member listLists cluster members--write-out=table
etcdctl endpoint healthChecks endpoint health--cluster
etcdctl endpoint statusDetailed endpoint status--cluster --write-out=table

Required environment variables

export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
Key takeaway: Always set ETCDCTL_API=3. The v2 API is deprecated and incompatible with Kubernetes 1.24+.

etcd backup: complete procedure

Manual snapshot

# Create a snapshot
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key

# Check integrity
etcdctl snapshot status /backup/etcd-20260228-143000.db --write-out=table

Expected output:

+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 3c5e8d2a |   284519 |       1847 |     5.2 MB |
+----------+----------+------------+------------+

etcd backup Kubernetes CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *"  # Every 6 hours
jobTemplate:
spec:
template:
spec:
hostNetwork: true
containers:
- name: backup
image: registry.k8s.io/etcd:3.5.12-0
command:
- /bin/sh
- -c
- |
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
find /backup -mtime +7 -delete
env:
- name: ETCDCTL_API
value: "3"
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup
mountPath: /backup
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup
hostPath:
path: /var/backup/etcd
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule

According to the CNCF Annual Survey 2025, 82% of container users run Kubernetes in production. A robust Kubernetes cluster etcd backup strategy is essential.


etcd restore: step-by-step procedure

Step 1: Stop control plane components

# On each control plane node
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/

# Verify shutdown
sudo crictl ps | grep -E "etcd|kube-api"

Step 2: Restore the snapshot

# Delete existing etcd data
sudo rm -rf /var/lib/etcd

# Restore to a new directory
etcdctl snapshot restore /backup/etcd-20260228-143000.db \
--data-dir=/var/lib/etcd \
--name=control-plane-1 \
--initial-cluster=control-plane-1=https://192.168.1.10:2380 \
--initial-advertise-peer-urls=https://192.168.1.10:2380

# Fix permissions
sudo chown -R etcd:etcd /var/lib/etcd

Step 3: Restart components

# Restore manifests
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/

# Verify the cluster
kubectl get nodes
kubectl get pods -A
Key takeaway: Test restoration in staging before production. The etcd snapshot restore Kubernetes process modifies cluster identifiers.

To deepen these critical procedures, consult the LFS458 Kubernetes Administration training which prepares for CKA certification.


etcd maintenance: diagnostic commands

Health check

# Health of all endpoints
etcdctl endpoint health --cluster
# Output: https://192.168.1.10:2379 is healthy: successfully committed proposal

# Detailed status
etcdctl endpoint status --cluster --write-out=table

Expected output:

+---------------------------+------------------+---------+---------+-----------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER |
+---------------------------+------------------+---------+---------+-----------+
| https://192.168.1.10:2379 | 8e9e05c52164694d |  3.5.12 |   5.2MB |      true |
| https://192.168.1.11:2379 | 2d3c8a5e7b1f4c92 |  3.5.12 |   5.2MB |     false |
| https://192.168.1.12:2379 | 4f6d9c8b2a1e3d70 |  3.5.12 |   5.2MB |     false |
+---------------------------+------------------+---------+---------+-----------+

Defragmentation (regular maintenance)

# Check disk usage before
etcdctl endpoint status --write-out=table

# Defragment (one member at a time)
etcdctl defrag --endpoints=https://192.168.1.10:2379

# Check after
etcdctl endpoint status --write-out=table

History compaction

# Get current revision
rev=$(etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')

# Compact up to this revision
etcdctl compact $rev

# Defragment after compaction
etcdctl defrag --endpoints=https://192.168.1.10:2379

Frequent errors and solutions

ErrorCauseSolution
Error: context deadline exceededEndpoint inaccessibleCheck certificates and firewall port 2379/2380
Error: etcdserver: mvcc: database space exceededQuota reached (2GB default)Compact + defragment + increase --quota-backend-bytes
Error: member has already been bootstrappedData-dir not emptyDelete /var/lib/etcd before restore
Error: authentication requiredMissing certificatesSet ETCDCTL_CACERT, ETCDCTL_CERT, ETCDCTL_KEY
raft: stoppedMajority lost (quorum)Restore from snapshot on new cluster
Key takeaway: etcd quorum requires (n/2)+1 members. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2 failures.

Backup/restore checklist for Kubernetes system administrators preparing for CKS certification

# ✅ BEFORE backup
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster --write-out=table

# ✅ BACKUP
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db
etcdctl snapshot status /backup/etcd-*.db --write-out=table

# ✅ VALIDATION
ls -la /backup/etcd-*.db
etcdctl snapshot status <file> | grep "TOTAL KEYS"

# ✅ RESTORE (test in staging)
etcdctl snapshot restore <file> --data-dir=/tmp/etcd-test
ls -la /tmp/etcd-test/member/

# ✅ AFTER restore
kubectl get nodes
kubectl get pods -A
kubectl get cs

According to the Linux Foundation, CKA certification requires a 66% score and lasts 2 hours. etcd operations represent a significant part of the exam.


Additional resources

To go further in Kubernetes Cluster Administration, consult:


Next steps: certifications and training

According to the CNCF Training Report, more than 104,000 professionals have taken the CKA exam (49% growth in one year). Mastery of etcd is essential for Kubernetes infrastructure engineers preparing for CKS certification.

Recommended training:. For more depth, consult our Kubernetes cluster administration training.

TrainingDurationCertification prepared
LFS458 Kubernetes Administration4 daysCKA
LFS460 Kubernetes Security Essentials4 daysCKS
Kubernetes Fundamentals1 dayDiscovery

Contact our advisors to plan your Kubernetes certification path.