etcd Cheatsheet: Backup, Restore, and Cluster Maintenance

etcd is the distributed key-value database that stores the entire state of a Kubernetes cluster. etcd backup and restore for Kubernetes constitutes a critical skill for any Kubernetes system administrator. Without a functional etcd backup, corruption or data loss means complete cluster reconstruction.

TL;DR: etcd stores 100% of Kubernetes state. Minimum daily snapshot. Test your restores in staging. Use etcdctl snapshot save with ETCDCTL_API=3.

Mastery of etcd is covered in the LFS458 Kubernetes Administration training.

Essential etcdctl commands

Command	Description	Required flags
`etcdctl snapshot save`	Creates a snapshot	`--endpoints`, `--cacert`, `--cert`, `--key`
`etcdctl snapshot restore`	Restores from snapshot	`--data-dir`, `--name`
`etcdctl snapshot status`	Checks snapshot integrity	`--write-out=table`
`etcdctl member list`	Lists cluster members	`--write-out=table`
`etcdctl endpoint health`	Checks endpoint health	`--cluster`
`etcdctl endpoint status`	Detailed endpoint status	`--cluster --write-out=table`

Required environment variables

export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379

Key takeaway: Always set ETCDCTL_API=3. The v2 API is deprecated and incompatible with Kubernetes 1.24+.

etcd backup: complete procedure

Manual snapshot

# Create a snapshot
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key

# Check integrity
etcdctl snapshot status /backup/etcd-20260228-143000.db --write-out=table

Expected output:

+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 3c5e8d2a |   284519 |       1847 |     5.2 MB |
+----------+----------+------------+------------+

etcd backup Kubernetes CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *"  # Every 6 hours
jobTemplate:
spec:
template:
spec:
hostNetwork: true
containers:
- name: backup
image: registry.k8s.io/etcd:3.5.12-0
command:
- /bin/sh
- -c
- |
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
find /backup -mtime +7 -delete
env:
- name: ETCDCTL_API
value: "3"
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup
mountPath: /backup
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup
hostPath:
path: /var/backup/etcd
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule

According to the CNCF Annual Survey 2025, 82% of container users run Kubernetes in production. A robust Kubernetes cluster etcd backup strategy is essential.

etcd restore: step-by-step procedure

Step 1: Stop control plane components

# On each control plane node
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/

# Verify shutdown
sudo crictl ps | grep -E "etcd|kube-api"

Step 2: Restore the snapshot

# Delete existing etcd data
sudo rm -rf /var/lib/etcd

# Restore to a new directory
etcdctl snapshot restore /backup/etcd-20260228-143000.db \
--data-dir=/var/lib/etcd \
--name=control-plane-1 \
--initial-cluster=control-plane-1=https://192.168.1.10:2380 \
--initial-advertise-peer-urls=https://192.168.1.10:2380

# Fix permissions
sudo chown -R etcd:etcd /var/lib/etcd

Step 3: Restart components

# Restore manifests
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/

# Verify the cluster
kubectl get nodes
kubectl get pods -A

Key takeaway: Test restoration in staging before production. The etcd snapshot restore Kubernetes process modifies cluster identifiers.

To deepen these critical procedures, consult the LFS458 Kubernetes Administration training which prepares for CKA certification.

etcd maintenance: diagnostic commands

Health check

# Health of all endpoints
etcdctl endpoint health --cluster
# Output: https://192.168.1.10:2379 is healthy: successfully committed proposal

# Detailed status
etcdctl endpoint status --cluster --write-out=table

Expected output:

+---------------------------+------------------+---------+---------+-----------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER |
+---------------------------+------------------+---------+---------+-----------+
| https://192.168.1.10:2379 | 8e9e05c52164694d |  3.5.12 |   5.2MB |      true |
| https://192.168.1.11:2379 | 2d3c8a5e7b1f4c92 |  3.5.12 |   5.2MB |     false |
| https://192.168.1.12:2379 | 4f6d9c8b2a1e3d70 |  3.5.12 |   5.2MB |     false |
+---------------------------+------------------+---------+---------+-----------+

Defragmentation (regular maintenance)

# Check disk usage before
etcdctl endpoint status --write-out=table

# Defragment (one member at a time)
etcdctl defrag --endpoints=https://192.168.1.10:2379

# Check after
etcdctl endpoint status --write-out=table

History compaction

# Get current revision
rev=$(etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')

# Compact up to this revision
etcdctl compact $rev

# Defragment after compaction
etcdctl defrag --endpoints=https://192.168.1.10:2379

Frequent errors and solutions

Error	Cause	Solution
`Error: context deadline exceeded`	Endpoint inaccessible	Check certificates and firewall port 2379/2380
`Error: etcdserver: mvcc: database space exceeded`	Quota reached (2GB default)	Compact + defragment + increase `--quota-backend-bytes`
`Error: member has already been bootstrapped`	Data-dir not empty	Delete `/var/lib/etcd` before restore
`Error: authentication required`	Missing certificates	Set `ETCDCTL_CACERT`, `ETCDCTL_CERT`, `ETCDCTL_KEY`
`raft: stopped`	Majority lost (quorum)	Restore from snapshot on new cluster

Key takeaway: etcd quorum requires (n/2)+1 members. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2 failures.

Backup/restore checklist for Kubernetes system administrators preparing for CKS certification

# ✅ BEFORE backup
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster --write-out=table

# ✅ BACKUP
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db
etcdctl snapshot status /backup/etcd-*.db --write-out=table

# ✅ VALIDATION
ls -la /backup/etcd-*.db
etcdctl snapshot status <file> | grep "TOTAL KEYS"

# ✅ RESTORE (test in staging)
etcdctl snapshot restore <file> --data-dir=/tmp/etcd-test
ls -la /tmp/etcd-test/member/

# ✅ AFTER restore
kubectl get nodes
kubectl get pods -A
kubectl get cs

According to the Linux Foundation, CKA certification requires a 66% score and lasts 2 hours. etcd operations represent a significant part of the exam.

Additional resources

To go further in Kubernetes Cluster Administration, consult:

Next steps: certifications and training

According to the CNCF Training Report, more than 104,000 professionals have taken the CKA exam (49% growth in one year). Mastery of etcd is essential for Kubernetes infrastructure engineers preparing for CKS certification.

Recommended training:. For more depth, consult our Kubernetes cluster administration training.

Training	Duration	Certification prepared
LFS458 Kubernetes Administration	4 days	CKA
LFS460 Kubernetes Security Essentials	4 days	CKS
Kubernetes Fundamentals	1 day	Discovery

Contact our advisors to plan your Kubernetes certification path.

Key Takeaways