troubleshooting7 min read

Debug a Pod in CrashLoopBackOff on Kubernetes: Causes and Solutions

SFEIR Institute

Key Takeaways

  • 23% of Kubernetes production incidents are related to CrashLoopBackOff (Komodor 2024)
  • kubectl logs --previous displays logs from the previous crashed container
  • 'Exponential backoff: increasing delays between restart attempts'

Debug pod CrashLoopBackOff Kubernetes is one of the most in-demand troubleshooting skills. According to Komodor State of Kubernetes 2024, CrashLoopBackOff represents 23% of production incidents. This guide details causes, diagnostic methodology, and solutions for each scenario. A Backend developer or software engineer must master these techniques to maintain stable applications.

TL;DR: CrashLoopBackOff means the container starts, crashes, and Kubernetes tries to restart it with exponential backoff. Main causes are: application error, missing configuration, insufficient resources, or image problem. Use kubectl describe and kubectl logs --previous to diagnose.

To master Kubernetes troubleshooting, follow the LFS458 Kubernetes Administration training.

What Exactly is CrashLoopBackOff?

CrashLoopBackOff is a pod state indicating that the main container crashes repeatedly. Kubernetes applies an exponential restart delay (backoff) between attempts: 10s, 20s, 40s, up to a maximum of 5 minutes.

This technical definition hides a frustrating operational reality: the pod never runs long enough to be debugged from inside.

# Identify pods in CrashLoopBackOff
kubectl get pods -A | grep CrashLoopBackOff

# Example output
NAMESPACE   NAME                      READY   STATUS             RESTARTS   AGE
production  checkout-7d4b5c6f9-x2k4n  0/1     CrashLoopBackOff   15         12m
Key takeaway: The RESTARTS counter indicates the number of restarts. A high number (>10) suggests a persistent problem requiring thorough investigation.

How to Debug Pod CrashLoopBackOff Kubernetes: Methodology

The pod error restart troubleshooting methodology follows a systematic 5-step approach.

Step 1: Collect Basic Information

# Complete pod details
kubectl describe pod checkout-7d4b5c6f9-x2k4n -n production

# Key points to examine in output:
# - Events (end of output)
# - State / Last State
# - Exit Code
# - Reason

The exit code often reveals the cause:

Exit CodeMeaningProbable Cause
0SuccessContainer terminated normally (not expected for a server)
1Application errorUnhandled exception, config error
137SIGKILL (OOM)Memory limit exceeded
139SIGSEGVSegmentation fault
143SIGTERMGraceful termination failed

Step 2: Examine Previous Container Logs

# Previous crash logs
kubectl logs checkout-7d4b5c6f9-x2k4n -n production --previous

# If multiple containers
kubectl logs checkout-7d4b5c6f9-x2k4n -n production -c main --previous

This command retrieves logs from the container before its crash, essential for understanding the error.

Step 3: Analyze Namespace Events

# Events sorted by timestamp
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

Events reveal scheduling problems, image pulls, or volume mounting issues.

For a global monitoring vision, see the Monitoring and Troubleshooting Kubernetes module.

Main Causes and Solutions for Debug Pod CrashLoopBackOff Kubernetes

Cause 1: Application Error at Startup

The container starts but the application crashes immediately. This is the most common cause (45% of cases according to Komodor).

Symptoms:

Exit Code: 1
Reason: Error

Diagnosis:

# Application logs
kubectl logs checkout-7d4b5c6f9-x2k4n --previous

# Example output
Error: Cannot connect to database at postgres:5432

Solutions:

# 1. Add init containers for dependencies
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z postgres 5432; do sleep 2; done']

# 2. Configure readiness/liveness probes correctly
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5

Cause 2: Missing Configuration (ConfigMap/Secret)

The container tries to read an environment variable or configuration file that doesn't exist.

Symptoms:

State: Waiting
Reason: CreateContainerConfigError

Diagnosis:

# Check referenced ConfigMaps
kubectl describe pod checkout-7d4b5c6f9-x2k4n | grep -A5 "Environment"

# Verify ConfigMap exists
kubectl get configmap checkout-config -n production

Solutions:

# Make variable optional
env:
- name: DATABASE_URL
valueFrom:
configMapKeyRef:
name: checkout-config
key: database-url
optional: true  # Pod starts even if absent
Key takeaway: Use optional: true for non-critical configurations. Validate required configurations in an init container.

Cause 3: OOMKilled (Memory Exceeded)

The container exceeds its memory limit and is killed by the kernel.

Symptoms:

Exit Code: 137
Reason: OOMKilled
Last State: Terminated

Diagnosis:

# Check memory consumption before crash
kubectl top pod checkout-7d4b5c6f9-x2k4n --containers

# Compare with limits
kubectl get pod checkout-7d4b5c6f9-x2k4n -o jsonpath='{.spec.containers[0].resources}'

Solutions:

resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi"  # Increase if necessary

For a detailed guide, see Resolve OOMKilled errors.

Cause 4: Container Image Problem

The image cannot be pulled or the entrypoint is incorrect.

Symptoms:

State: Waiting
Reason: ImagePullBackOff
# or
Reason: CrashLoopBackOff with Exit Code: 127 (command not found)

Diagnosis:

# Check pull events
kubectl describe pod checkout-7d4b5c6f9-x2k4n | grep -A3 "Events"

# Test locally
docker run --rm myregistry/checkout:v1.2.3 /bin/sh -c "echo test"

Solutions:

# Check imagePullSecret
imagePullSecrets:
- name: registry-credentials

# Fix command/entrypoint
command: ["/app/checkout"]  # Absolute path
args: ["--port=8080"]

Cause 5: Misconfigured Probes

Liveness probes kill the container before it's ready.

Symptoms:

Events:
Liveness probe failed: connection refused
Container checkout-container failed liveness probe, will be restarted

Diagnosis: If your application takes 30 seconds to start, your liveness probe must start at 30 seconds. Aggressive probes are the leading cause of self-inflicted CrashLoopBackOff.

# Check probe timing
kubectl get pod checkout-7d4b5c6f9-x2k4n -o yaml | grep -A10 livenessProbe

Solutions:

livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60  # Wait for startup
periodSeconds: 10
failureThreshold: 3      # 3 failures before restart

readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5   # Faster than liveness
periodSeconds: 5
Key takeaway: readinessProbe should be faster than livenessProbe. Start with conservative values then optimize.

Advanced Kubernetes Debugging Techniques

Using kubectl debug (Kubernetes 1.25+)

Ephemeral containers allow attaching a debug container to a running or crashed pod.

# Attach debug container
kubectl debug -it checkout-7d4b5c6f9-x2k4n --image=busybox:1.36 --target=checkout

# Debug with network tools
kubectl debug -it checkout-7d4b5c6f9-x2k4n --image=nicolaka/netshoot

Copy Pod for Debugging

# Create copy with modified command
kubectl debug checkout-7d4b5c6f9-x2k4n -it --copy-to=checkout-debug \
--container=checkout -- /bin/sh

# Debug pod remains active for investigation

Examine Container Runtime Logs

# On the node (requires SSH access)
crictl logs <container-id>

# Find container ID
kubectl get pod checkout-7d4b5c6f9-x2k4n -o jsonpath='{.status.containerStatuses[0].containerID}'

Quick Troubleshooting Checklist

Use this checklist for systematic diagnosis:

#!/bin/bash
# debug-crashloop.sh <pod-name> <namespace>

POD=$1
NS=${2:-default}

echo "=== 1. Pod State ==="
kubectl get pod $POD -n $NS

echo "=== 2. Description ==="
kubectl describe pod $POD -n $NS | tail -30

echo "=== 3. Previous Logs ==="
kubectl logs $POD -n $NS --previous --tail=50 2>/dev/null || echo "No previous logs"

echo "=== 4. Events ==="
kubectl get events -n $NS --field-selector involvedObject.name=$POD

echo "=== 5. Resources ==="
kubectl top pod $POD -n $NS --containers 2>/dev/null || echo "Metrics not available"

Also see the guide Resolve Kubernetes deployment failures for a complementary approach.

Preventing CrashLoopBackOff in Production

Configuration Best Practices

apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: checkout
image: myregistry/checkout:v1.2.3

# Explicit resources
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi

# Well-calibrated probes
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10

livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 10
failureThreshold: 3

readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 3
Key takeaway: startupProbe (K8s 1.20+) replaces initialDelaySeconds for slow-starting applications. It prevents liveness from killing the container during startup.

Proactive Monitoring

Configure alerts before the problem affects users:

# PrometheusRule
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} in CrashLoop"

The Kubernetes observability checklist in production details these configurations.

Network Issues Causing Crashes

Network issues can cause indirect CrashLoopBackOff (application that times out and crashes).

Symptoms:

  • Logs showing connection timeouts
  • Exit code 1 after delay

Diagnosis:

# From a debug pod
kubectl run debug --rm -it --image=nicolaka/netshoot -- /bin/bash

# Network tests
nslookup kubernetes.default
curl -v http://checkout-service.production.svc.cluster.local:8080/health

See the guide Network problems diagnosis and resolution for more detail.

When to Escalate and Ask for Help

Some CrashLoopBackOff situations require advanced expertise:

  • Exit code 139 (SIGSEGV): memory bug in application, requires profiling
  • Intermittent problems: may indicate race conditions or node issues
  • After cluster update: possible API incompatibilities

Kubernetes deployment and production covers rollback strategies for problematic deployments.

Trainings to Master Kubernetes Troubleshooting

Pod error restart troubleshooting is a key skill evaluated in CKA and CKAD certifications.

To develop your debugging expertise:

Check upcoming sessions or contact us for a custom path.