Key Takeaways
- ✓IT teams spend 34 days/year on Kubernetes troubleshooting
- ✓60% of cluster management time is dedicated to troubleshooting
- ✓kubectl describe, logs, and get events resolve 90% of problems
Resolving common Kubernetes deployment errors represents a critical skill for any Cloud Operations Kubernetes engineer. IT teams spend an average of 34 working days per year resolving Kubernetes problems according to Cloud Native Now.
More than 60% of cluster management time is spent on troubleshooting according to Spectro Cloud. This guide gives you the exact commands and proven solutions to diagnose and fix CrashLoopBackOff, ImagePullBackOff, and other common errors.
TL;DR: Kubernetes deployment errors follow predictable patterns. This guide covers the 7 most frequent errors with their diagnostic commands, root causes, and solutions. Masterkubectl describe,kubectl logs --previous, andkubectl get eventsto resolve 90% of issues.
To master debugging pods Kubernetes in real conditions, follow the LFD459 Kubernetes for Application Developers training.
Quick Symptom Index
| Pod Status | Meaning | Section |
|---|---|---|
CrashLoopBackOff | Container restarts in loop | CrashLoopBackOff |
ImagePullBackOff | Image not found or inaccessible | ImagePullBackOff |
Pending | Pod not scheduled to a node | Pending |
CreateContainerConfigError | Configuration problem | ConfigError |
OOMKilled | Memory exceeded | OOMKilled |
Running but unhealthy | Failing probes | Probes |
Evicted | Node under pressure | Eviction |
Remember: 82% of container users run Kubernetes in production according to the CNCF Annual Survey 2025. Mastering troubleshooting is essential.
CrashLoopBackOff: Container Restarts in Loop {#crashloopbackoff}
CrashLoopBackOff indicates your container starts, crashes, and Kubernetes tries to restart it with exponential backoff. This status represents 40% of Kubernetes support tickets.
Symptom
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
api-server-1 0/1 CrashLoopBackOff 5 (2m ago) 10m
Step 1: Examine Crashed Container Logs
# Current instance logs
kubectl logs api-server-1
# Previous instance logs (after crash)
kubectl logs api-server-1 --previous
# Follow logs in real-time
kubectl logs api-server-1 -f
Step 2: Analyze Pod Events
kubectl describe pod api-server-1 | grep -A20 "Events:"
Causes and Solutions
| Cause | Diagnostic Indicator | Solution |
|---|---|---|
| Application crash at startup | Stack trace in logs | Fix code, verify dependencies |
| Missing environment variable | KeyError, undefined | Add variable in Deployment |
| Missing config file | FileNotFoundError | Verify ConfigMaps and Secrets mounts |
| Port already in use | Address already in use | Modify containerPort or kill process |
| Invalid command | exec format error | Check command: and args: in spec |
# Fix example: adding a missing variable
spec:
containers:
- name: api
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
ImagePullBackOff: Image Not Found {#imagepullbackoff}
ImagePullBackOff means Kubernetes cannot download the specified container image. This problem occurs in 25% of first deployments.
Symptom
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
web-app-1 0/1 ImagePullBackOff 0 5m
Diagnosis
# See exact error message
kubectl describe pod web-app-1 | grep -A5 "Events:"
# Check image name
kubectl get pod web-app-1 -o jsonpath='{.spec.containers[*].image}'
Causes and Solutions
| Cause | Error Message | Solution |
|---|---|---|
| Non-existent image | manifest unknown | Verify image name and tag |
| Private registry | unauthorized | Create an imagePullSecret |
| Invalid tag | tag not found | Use an existing tag (latest, v1.2.3) |
| Blocked network | connection refused | Check firewall rules |
# Create secret for private registry
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=pass \
--docker-email=user@example.com
# Reference in pod
kubectl patch serviceaccount default \
-p '{"imagePullSecrets": [{"name": "regcred"}]}'
To go deeper on these techniques, see the advanced pod and container debugging guide.
Pending: Pod Not Scheduled {#pending}
A Pending pod hasn't been assigned to a node by the scheduler. The cause is usually a lack of resources or impossible constraints to satisfy.
Diagnosis
# Identify the pending reason
kubectl describe pod my-pod | grep -A10 "Events:"
# See available resources on nodes
kubectl describe nodes | grep -A5 "Allocated resources"
# List node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
Causes and Solutions
| Cause | Event Message | Solution |
|---|---|---|
| Insufficient resources | Insufficient cpu/memory | Reduce requests or add nodes |
| Impossible NodeSelector | node(s) didn't match selector | Add label to node or modify selector |
| Untolerated taints | node(s) had taints | Add tolerations to pod |
| Unbound PVC | persistentvolumeclaim not bound | Check PV and StorageClass |
# Example: adjust requests to avoid pending
resources:
requests:
memory: "128Mi" # Reduce if too high
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
Remember: Always configure realistic requests. Overestimated requests block scheduling even if the node has real capacity.
CreateContainerConfigError: Configuration Problem {#createcontainerconfigerror}
CreateContainerConfigError indicates an error in container configuration before it even starts. The problem often comes from referenced Secrets or ConfigMaps.
Diagnosis
kubectl describe pod my-pod | grep -A3 "Warning"
# Verify Secret exists
kubectl get secret my-secret
# Verify key exists in Secret
kubectl get secret my-secret -o jsonpath='{.data}'
Causes and Solutions
| Cause | Solution |
|---|---|
| Non-existent Secret | Create Secret before Deployment |
| Missing key in Secret | Add key with kubectl edit secret |
| Missing referenced ConfigMap | Create required ConfigMap |
| Incorrect subPath | Verify path spelling |
# Create missing secret
kubectl create secret generic app-secret \
--from-literal=API_KEY=abc123
# Check references in deployment
kubectl get deployment my-app -o yaml | grep -A5 "secretKeyRef"
ConfigMaps and Secrets management is covered in detail in our Kubernetes Application Development guides.
OOMKilled: Memory Exceeded {#oomkilled}
OOMKilled means the container exceeded its memory limit and was killed by the Linux kernel. This is a protection to prevent the entire node from becoming unstable.
Diagnosis
# See termination reason
kubectl describe pod my-pod | grep -A3 "Last State"
# See restart history
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState}'
# Monitor memory consumption
kubectl top pod my-pod
Solution
# Increase memory limit
resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi" # Double if OOMKilled is frequent
# Analyze consumption before increasing
kubectl exec my-pod -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes
Remember: Never setlimits.memorytoo close torequests.memory. Leave a 50% margin for peaks.
Probes Liveness and Readiness: Silent Failures {#probes-liveness-and-readiness}
Failing probes cause subtle behaviors. Failing liveness kills the container. Failing readiness removes the pod from the Service without killing it.
Diagnosis
# See probe failures
kubectl describe pod my-pod | grep -E "(Liveness|Readiness)"
# Manually test endpoint
kubectl exec my-pod -- curl -s localhost:8080/health
kubectl exec my-pod -- wget -qO- localhost:8080/ready
Common Errors
| Problem | Symptom | Solution |
|---|---|---|
| initialDelaySeconds too short | Container killed at startup | Increase to 30-60s |
| timeoutSeconds too short | Intermittent failures | Change from 1s to 5s |
| Incorrect port | Connection refused | Check containerPort |
| Incorrect path | 404 Not Found | Fix endpoint path |
# Robust probe configuration
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Eviction: Pod Expelled from Node {#eviction}
Eviction occurs when a node runs out of critical resources. The kubelet expels pods to protect the node.
Diagnosis
# See node conditions
kubectl describe node my-node | grep -A5 "Conditions:"
# See evicted pods
kubectl get pods --field-selector=status.phase=Failed | grep Evicted
# Clean up evicted pods
kubectl delete pods --field-selector=status.phase=Failed
| Condition | Default Threshold | Cause |
|---|---|---|
| MemoryPressure | < 100Mi available | Pods consuming too much |
| DiskPressure | < 10% free space | Logs or images too large |
| PIDPressure | < 100 free PIDs | Too many processes |
Universal Diagnostic Toolkit
# Essential commands for all debugging
kubectl get pods -o wide # Extended view with node
kubectl describe pod <pod> # Complete details
kubectl logs <pod> --previous # Previous crash logs
kubectl get events --sort-by=.lastTimestamp # Recent events
kubectl exec -it <pod> -- /bin/sh # Shell in container
kubectl top pods # Resource consumption
Quick Diagnostic Script
#!/bin/bash
POD=$1
echo "=== Status ==="
kubectl get pod $POD -o wide
echo "=== Events ==="
kubectl describe pod $POD | grep -A15 "Events:"
echo "=== Logs (last 50 lines) ==="
kubectl logs $POD --tail=50
echo "=== Previous logs ==="
kubectl logs $POD --previous --tail=20 2>/dev/null || echo "No previous logs"
Prevention: Avoid Errors Before Deployment
Validate your manifests before applying. 70% of organizations use Kubernetes with Helm according to Orca Security 2025. Use validation tools.
# Validate YAML syntax
kubectl apply --dry-run=client -f deployment.yaml
# Server-side validation (detects more errors)
kubectl apply --dry-run=server -f deployment.yaml
# With Helm
helm template my-release ./chart | kubectl apply --dry-run=server -f -
To go further with best practices, explore the differences between Helm and Kustomize and cloud-native development patterns.
Remember: Systematically test with--dry-run=serverbefore every production deployment. This command detects configuration errors that--dry-run=clientmisses.
Training to Master Kubernetes Troubleshooting
As a CTO highlights in the Spectro Cloud State of Kubernetes 2025: "Just given the capabilities that exist with Kubernetes, and the company's desire to consume more AI tools, we will use Kubernetes more in future."
The Kubernetes Fundamentals training lets you discover debugging basics in 1 day. For complete mastery, the LFD459 Kubernetes for Developers training covers advanced troubleshooting over 3 days and prepares for CKAD certification. Infrastructure engineers preparing for CKA will find in-depth techniques in the LFS458 Kubernetes Administration training. For more, check our Kubernetes Application Development enterprise training for CTOs leading Cloud-Native transformation.
Check our Kubernetes training complete guide to identify the path suited to your profile, or discover training for system administrators and Kubernetes security challenges.