Key Takeaways
- ✓23% of Kubernetes production incidents are related to CrashLoopBackOff (Komodor 2024)
- ✓kubectl logs --previous displays logs from the previous crashed container
- ✓'Exponential backoff: increasing delays between restart attempts'
Debug pod CrashLoopBackOff Kubernetes is one of the most in-demand troubleshooting skills. According to Komodor State of Kubernetes 2024, CrashLoopBackOff represents 23% of production incidents. This guide details causes, diagnostic methodology, and solutions for each scenario. A Backend developer or software engineer must master these techniques to maintain stable applications.
TL;DR: CrashLoopBackOff means the container starts, crashes, and Kubernetes tries to restart it with exponential backoff. Main causes are: application error, missing configuration, insufficient resources, or image problem. Usekubectl describeandkubectl logs --previousto diagnose.
To master Kubernetes troubleshooting, follow the LFS458 Kubernetes Administration training.
What Exactly is CrashLoopBackOff?
CrashLoopBackOff is a pod state indicating that the main container crashes repeatedly. Kubernetes applies an exponential restart delay (backoff) between attempts: 10s, 20s, 40s, up to a maximum of 5 minutes.
This technical definition hides a frustrating operational reality: the pod never runs long enough to be debugged from inside.
# Identify pods in CrashLoopBackOff
kubectl get pods -A | grep CrashLoopBackOff
# Example output
NAMESPACE NAME READY STATUS RESTARTS AGE
production checkout-7d4b5c6f9-x2k4n 0/1 CrashLoopBackOff 15 12m
Key takeaway: The RESTARTS counter indicates the number of restarts. A high number (>10) suggests a persistent problem requiring thorough investigation.
How to Debug Pod CrashLoopBackOff Kubernetes: Methodology
The pod error restart troubleshooting methodology follows a systematic 5-step approach.
Step 1: Collect Basic Information
# Complete pod details
kubectl describe pod checkout-7d4b5c6f9-x2k4n -n production
# Key points to examine in output:
# - Events (end of output)
# - State / Last State
# - Exit Code
# - Reason
The exit code often reveals the cause:
| Exit Code | Meaning | Probable Cause |
|---|---|---|
| 0 | Success | Container terminated normally (not expected for a server) |
| 1 | Application error | Unhandled exception, config error |
| 137 | SIGKILL (OOM) | Memory limit exceeded |
| 139 | SIGSEGV | Segmentation fault |
| 143 | SIGTERM | Graceful termination failed |
Step 2: Examine Previous Container Logs
# Previous crash logs
kubectl logs checkout-7d4b5c6f9-x2k4n -n production --previous
# If multiple containers
kubectl logs checkout-7d4b5c6f9-x2k4n -n production -c main --previous
This command retrieves logs from the container before its crash, essential for understanding the error.
Step 3: Analyze Namespace Events
# Events sorted by timestamp
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
Events reveal scheduling problems, image pulls, or volume mounting issues.
For a global monitoring vision, see the Monitoring and Troubleshooting Kubernetes module.
Main Causes and Solutions for Debug Pod CrashLoopBackOff Kubernetes
Cause 1: Application Error at Startup
The container starts but the application crashes immediately. This is the most common cause (45% of cases according to Komodor).
Symptoms:
Exit Code: 1
Reason: Error
Diagnosis:
# Application logs
kubectl logs checkout-7d4b5c6f9-x2k4n --previous
# Example output
Error: Cannot connect to database at postgres:5432
Solutions:
# 1. Add init containers for dependencies
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z postgres 5432; do sleep 2; done']
# 2. Configure readiness/liveness probes correctly
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Cause 2: Missing Configuration (ConfigMap/Secret)
The container tries to read an environment variable or configuration file that doesn't exist.
Symptoms:
State: Waiting
Reason: CreateContainerConfigError
Diagnosis:
# Check referenced ConfigMaps
kubectl describe pod checkout-7d4b5c6f9-x2k4n | grep -A5 "Environment"
# Verify ConfigMap exists
kubectl get configmap checkout-config -n production
Solutions:
# Make variable optional
env:
- name: DATABASE_URL
valueFrom:
configMapKeyRef:
name: checkout-config
key: database-url
optional: true # Pod starts even if absent
Key takeaway: Use optional: true for non-critical configurations. Validate required configurations in an init container.
Cause 3: OOMKilled (Memory Exceeded)
The container exceeds its memory limit and is killed by the kernel.
Symptoms:
Exit Code: 137
Reason: OOMKilled
Last State: Terminated
Diagnosis:
# Check memory consumption before crash
kubectl top pod checkout-7d4b5c6f9-x2k4n --containers
# Compare with limits
kubectl get pod checkout-7d4b5c6f9-x2k4n -o jsonpath='{.spec.containers[0].resources}'
Solutions:
resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi" # Increase if necessary
For a detailed guide, see Resolve OOMKilled errors.
Cause 4: Container Image Problem
The image cannot be pulled or the entrypoint is incorrect.
Symptoms:
State: Waiting
Reason: ImagePullBackOff
# or
Reason: CrashLoopBackOff with Exit Code: 127 (command not found)
Diagnosis:
# Check pull events
kubectl describe pod checkout-7d4b5c6f9-x2k4n | grep -A3 "Events"
# Test locally
docker run --rm myregistry/checkout:v1.2.3 /bin/sh -c "echo test"
Solutions:
# Check imagePullSecret
imagePullSecrets:
- name: registry-credentials
# Fix command/entrypoint
command: ["/app/checkout"] # Absolute path
args: ["--port=8080"]
Cause 5: Misconfigured Probes
Liveness probes kill the container before it's ready.
Symptoms:
Events:
Liveness probe failed: connection refused
Container checkout-container failed liveness probe, will be restarted
Diagnosis: If your application takes 30 seconds to start, your liveness probe must start at 30 seconds. Aggressive probes are the leading cause of self-inflicted CrashLoopBackOff.
# Check probe timing
kubectl get pod checkout-7d4b5c6f9-x2k4n -o yaml | grep -A10 livenessProbe
Solutions:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60 # Wait for startup
periodSeconds: 10
failureThreshold: 3 # 3 failures before restart
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5 # Faster than liveness
periodSeconds: 5
Key takeaway: readinessProbe should be faster than livenessProbe. Start with conservative values then optimize.
Advanced Kubernetes Debugging Techniques
Using kubectl debug (Kubernetes 1.25+)
Ephemeral containers allow attaching a debug container to a running or crashed pod.
# Attach debug container
kubectl debug -it checkout-7d4b5c6f9-x2k4n --image=busybox:1.36 --target=checkout
# Debug with network tools
kubectl debug -it checkout-7d4b5c6f9-x2k4n --image=nicolaka/netshoot
Copy Pod for Debugging
# Create copy with modified command
kubectl debug checkout-7d4b5c6f9-x2k4n -it --copy-to=checkout-debug \
--container=checkout -- /bin/sh
# Debug pod remains active for investigation
Examine Container Runtime Logs
# On the node (requires SSH access)
crictl logs <container-id>
# Find container ID
kubectl get pod checkout-7d4b5c6f9-x2k4n -o jsonpath='{.status.containerStatuses[0].containerID}'
Quick Troubleshooting Checklist
Use this checklist for systematic diagnosis:
#!/bin/bash
# debug-crashloop.sh <pod-name> <namespace>
POD=$1
NS=${2:-default}
echo "=== 1. Pod State ==="
kubectl get pod $POD -n $NS
echo "=== 2. Description ==="
kubectl describe pod $POD -n $NS | tail -30
echo "=== 3. Previous Logs ==="
kubectl logs $POD -n $NS --previous --tail=50 2>/dev/null || echo "No previous logs"
echo "=== 4. Events ==="
kubectl get events -n $NS --field-selector involvedObject.name=$POD
echo "=== 5. Resources ==="
kubectl top pod $POD -n $NS --containers 2>/dev/null || echo "Metrics not available"
Also see the guide Resolve Kubernetes deployment failures for a complementary approach.
Preventing CrashLoopBackOff in Production
Configuration Best Practices
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: checkout
image: myregistry/checkout:v1.2.3
# Explicit resources
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
# Well-calibrated probes
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 3
Key takeaway: startupProbe (K8s 1.20+) replaces initialDelaySeconds for slow-starting applications. It prevents liveness from killing the container during startup.
Proactive Monitoring
Configure alerts before the problem affects users:
# PrometheusRule
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} in CrashLoop"
The Kubernetes observability checklist in production details these configurations.
Network Issues Causing Crashes
Network issues can cause indirect CrashLoopBackOff (application that times out and crashes).
Symptoms:
- Logs showing connection timeouts
- Exit code 1 after delay
Diagnosis:
# From a debug pod
kubectl run debug --rm -it --image=nicolaka/netshoot -- /bin/bash
# Network tests
nslookup kubernetes.default
curl -v http://checkout-service.production.svc.cluster.local:8080/health
See the guide Network problems diagnosis and resolution for more detail.
When to Escalate and Ask for Help
Some CrashLoopBackOff situations require advanced expertise:
- Exit code 139 (SIGSEGV): memory bug in application, requires profiling
- Intermittent problems: may indicate race conditions or node issues
- After cluster update: possible API incompatibilities
Kubernetes deployment and production covers rollback strategies for problematic deployments.
Trainings to Master Kubernetes Troubleshooting
Pod error restart troubleshooting is a key skill evaluated in CKA and CKAD certifications.
To develop your debugging expertise:
- LFS458 Kubernetes Administration: advanced troubleshooting for administrators (4 days, CKA preparation)
- LFD459 Kubernetes for Developers: application debugging and logs (3 days, CKAD preparation)
- Kubernetes Fundamentals: introduction to pod debugging (1 day)