troubleshooting8 min read

Resolve Common Kubernetes Deployment Errors

SFEIR Institute

Key Takeaways

  • IT teams spend 34 days/year on Kubernetes troubleshooting
  • 60% of cluster management time is dedicated to troubleshooting
  • kubectl describe, logs, and get events resolve 90% of problems

Resolving common Kubernetes deployment errors represents a critical skill for any Cloud Operations Kubernetes engineer. IT teams spend an average of 34 working days per year resolving Kubernetes problems according to Cloud Native Now.

More than 60% of cluster management time is spent on troubleshooting according to Spectro Cloud. This guide gives you the exact commands and proven solutions to diagnose and fix CrashLoopBackOff, ImagePullBackOff, and other common errors.

TL;DR: Kubernetes deployment errors follow predictable patterns. This guide covers the 7 most frequent errors with their diagnostic commands, root causes, and solutions. Master kubectl describe, kubectl logs --previous, and kubectl get events to resolve 90% of issues.

To master debugging pods Kubernetes in real conditions, follow the LFD459 Kubernetes for Application Developers training.

Quick Symptom Index

Pod StatusMeaningSection
CrashLoopBackOffContainer restarts in loopCrashLoopBackOff
ImagePullBackOffImage not found or inaccessibleImagePullBackOff
PendingPod not scheduled to a nodePending
CreateContainerConfigErrorConfiguration problemConfigError
OOMKilledMemory exceededOOMKilled
Running but unhealthyFailing probesProbes
EvictedNode under pressureEviction
Remember: 82% of container users run Kubernetes in production according to the CNCF Annual Survey 2025. Mastering troubleshooting is essential.

CrashLoopBackOff: Container Restarts in Loop {#crashloopbackoff}

CrashLoopBackOff indicates your container starts, crashes, and Kubernetes tries to restart it with exponential backoff. This status represents 40% of Kubernetes support tickets.

Symptom

$ kubectl get pods
NAME           READY   STATUS             RESTARTS      AGE
api-server-1   0/1     CrashLoopBackOff   5 (2m ago)    10m

Step 1: Examine Crashed Container Logs

# Current instance logs
kubectl logs api-server-1

# Previous instance logs (after crash)
kubectl logs api-server-1 --previous

# Follow logs in real-time
kubectl logs api-server-1 -f

Step 2: Analyze Pod Events

kubectl describe pod api-server-1 | grep -A20 "Events:"

Causes and Solutions

CauseDiagnostic IndicatorSolution
Application crash at startupStack trace in logsFix code, verify dependencies
Missing environment variableKeyError, undefinedAdd variable in Deployment
Missing config fileFileNotFoundErrorVerify ConfigMaps and Secrets mounts
Port already in useAddress already in useModify containerPort or kill process
Invalid commandexec format errorCheck command: and args: in spec
# Fix example: adding a missing variable
spec:
containers:
- name: api
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url

ImagePullBackOff: Image Not Found {#imagepullbackoff}

ImagePullBackOff means Kubernetes cannot download the specified container image. This problem occurs in 25% of first deployments.

Symptom

$ kubectl get pods
NAME         READY   STATUS             RESTARTS   AGE
web-app-1    0/1     ImagePullBackOff   0          5m

Diagnosis

# See exact error message
kubectl describe pod web-app-1 | grep -A5 "Events:"

# Check image name
kubectl get pod web-app-1 -o jsonpath='{.spec.containers[*].image}'

Causes and Solutions

CauseError MessageSolution
Non-existent imagemanifest unknownVerify image name and tag
Private registryunauthorizedCreate an imagePullSecret
Invalid tagtag not foundUse an existing tag (latest, v1.2.3)
Blocked networkconnection refusedCheck firewall rules
# Create secret for private registry
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=pass \
--docker-email=user@example.com

# Reference in pod
kubectl patch serviceaccount default \
-p '{"imagePullSecrets": [{"name": "regcred"}]}'

To go deeper on these techniques, see the advanced pod and container debugging guide.

Pending: Pod Not Scheduled {#pending}

A Pending pod hasn't been assigned to a node by the scheduler. The cause is usually a lack of resources or impossible constraints to satisfy.

Diagnosis

# Identify the pending reason
kubectl describe pod my-pod | grep -A10 "Events:"

# See available resources on nodes
kubectl describe nodes | grep -A5 "Allocated resources"

# List node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Causes and Solutions

CauseEvent MessageSolution
Insufficient resourcesInsufficient cpu/memoryReduce requests or add nodes
Impossible NodeSelectornode(s) didn't match selectorAdd label to node or modify selector
Untolerated taintsnode(s) had taintsAdd tolerations to pod
Unbound PVCpersistentvolumeclaim not boundCheck PV and StorageClass
# Example: adjust requests to avoid pending
resources:
requests:
memory: "128Mi"  # Reduce if too high
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
Remember: Always configure realistic requests. Overestimated requests block scheduling even if the node has real capacity.

CreateContainerConfigError: Configuration Problem {#createcontainerconfigerror}

CreateContainerConfigError indicates an error in container configuration before it even starts. The problem often comes from referenced Secrets or ConfigMaps.

Diagnosis

kubectl describe pod my-pod | grep -A3 "Warning"

# Verify Secret exists
kubectl get secret my-secret

# Verify key exists in Secret
kubectl get secret my-secret -o jsonpath='{.data}'

Causes and Solutions

CauseSolution
Non-existent SecretCreate Secret before Deployment
Missing key in SecretAdd key with kubectl edit secret
Missing referenced ConfigMapCreate required ConfigMap
Incorrect subPathVerify path spelling
# Create missing secret
kubectl create secret generic app-secret \
--from-literal=API_KEY=abc123

# Check references in deployment
kubectl get deployment my-app -o yaml | grep -A5 "secretKeyRef"

ConfigMaps and Secrets management is covered in detail in our Kubernetes Application Development guides.

OOMKilled: Memory Exceeded {#oomkilled}

OOMKilled means the container exceeded its memory limit and was killed by the Linux kernel. This is a protection to prevent the entire node from becoming unstable.

Diagnosis

# See termination reason
kubectl describe pod my-pod | grep -A3 "Last State"

# See restart history
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState}'

# Monitor memory consumption
kubectl top pod my-pod

Solution

# Increase memory limit
resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi"  # Double if OOMKilled is frequent
# Analyze consumption before increasing
kubectl exec my-pod -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes
Remember: Never set limits.memory too close to requests.memory. Leave a 50% margin for peaks.

Probes Liveness and Readiness: Silent Failures {#probes-liveness-and-readiness}

Failing probes cause subtle behaviors. Failing liveness kills the container. Failing readiness removes the pod from the Service without killing it.

Diagnosis

# See probe failures
kubectl describe pod my-pod | grep -E "(Liveness|Readiness)"

# Manually test endpoint
kubectl exec my-pod -- curl -s localhost:8080/health
kubectl exec my-pod -- wget -qO- localhost:8080/ready

Common Errors

ProblemSymptomSolution
initialDelaySeconds too shortContainer killed at startupIncrease to 30-60s
timeoutSeconds too shortIntermittent failuresChange from 1s to 5s
Incorrect portConnection refusedCheck containerPort
Incorrect path404 Not FoundFix endpoint path
# Robust probe configuration
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5

Eviction: Pod Expelled from Node {#eviction}

Eviction occurs when a node runs out of critical resources. The kubelet expels pods to protect the node.

Diagnosis

# See node conditions
kubectl describe node my-node | grep -A5 "Conditions:"

# See evicted pods
kubectl get pods --field-selector=status.phase=Failed | grep Evicted

# Clean up evicted pods
kubectl delete pods --field-selector=status.phase=Failed
ConditionDefault ThresholdCause
MemoryPressure< 100Mi availablePods consuming too much
DiskPressure< 10% free spaceLogs or images too large
PIDPressure< 100 free PIDsToo many processes

Universal Diagnostic Toolkit

# Essential commands for all debugging
kubectl get pods -o wide                    # Extended view with node
kubectl describe pod <pod>                  # Complete details
kubectl logs <pod> --previous               # Previous crash logs
kubectl get events --sort-by=.lastTimestamp # Recent events
kubectl exec -it <pod> -- /bin/sh           # Shell in container
kubectl top pods                            # Resource consumption

Quick Diagnostic Script

#!/bin/bash
POD=$1
echo "=== Status ==="
kubectl get pod $POD -o wide
echo "=== Events ==="
kubectl describe pod $POD | grep -A15 "Events:"
echo "=== Logs (last 50 lines) ==="
kubectl logs $POD --tail=50
echo "=== Previous logs ==="
kubectl logs $POD --previous --tail=20 2>/dev/null || echo "No previous logs"

Prevention: Avoid Errors Before Deployment

Validate your manifests before applying. 70% of organizations use Kubernetes with Helm according to Orca Security 2025. Use validation tools.

# Validate YAML syntax
kubectl apply --dry-run=client -f deployment.yaml

# Server-side validation (detects more errors)
kubectl apply --dry-run=server -f deployment.yaml

# With Helm
helm template my-release ./chart | kubectl apply --dry-run=server -f -

To go further with best practices, explore the differences between Helm and Kustomize and cloud-native development patterns.

Remember: Systematically test with --dry-run=server before every production deployment. This command detects configuration errors that --dry-run=client misses.

Training to Master Kubernetes Troubleshooting

As a CTO highlights in the Spectro Cloud State of Kubernetes 2025: "Just given the capabilities that exist with Kubernetes, and the company's desire to consume more AI tools, we will use Kubernetes more in future."

The Kubernetes Fundamentals training lets you discover debugging basics in 1 day. For complete mastery, the LFD459 Kubernetes for Developers training covers advanced troubleshooting over 3 days and prepares for CKAD certification. Infrastructure engineers preparing for CKA will find in-depth techniques in the LFS458 Kubernetes Administration training. For more, check our Kubernetes Application Development enterprise training for CTOs leading Cloud-Native transformation.

Check our Kubernetes training complete guide to identify the path suited to your profile, or discover training for system administrators and Kubernetes security challenges.