Network Problem Diagnosis and Resolution on Kubernetes

IT teams spend an average of 60% of their cluster management time on troubleshooting, with a significant portion concerning network issues. This guide provides a structured methodology for identifying and resolving the most common network problems on Kubernetes.

TL;DR: Kubernetes network problems fall into 6 categories: DNS, Services, Pod-to-Pod, CNI, Ingress, and NetworkPolicies. Each category has specific diagnostic commands. Always start with kubectl get events and verify DNS before investigating further.

To master network diagnosis in depth, follow the LFS458 Kubernetes Administration training.

Network Symptom Index

Symptom	Quick Diagnostic Command	Section
`could not resolve host`	`kubectl exec -it -- nslookup kubernetes`	DNS
`connection refused` on a Service	`kubectl get endpoints`	Services
Pods can't communicate with each other	`kubectl exec -it -- ping`	Pod-to-Pod
`NetworkPlugin cni failed`	`kubectl describe node`	CNI
502/504 from outside	`kubectl get ingress -o wide`	Ingress
Connections blocked without error	`kubectl get networkpolicies -A`	NetworkPolicies

How to Diagnose DNS Problems on Kubernetes?

DNS is the leading cause of network problems on Kubernetes. A pod that cannot resolve internal service names becomes immediately unusable.

Typical Symptoms

Error: getaddrinfo ENOTFOUND my-service
Error: could not resolve host 'my-database.default.svc.cluster.local'

Step 1: Check CoreDNS Pod

# CoreDNS pod state
kubectl get pods -n kube-system -l k8s-app=kube-dns

# CoreDNS logs to identify errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Step 2: Test Resolution from a Pod

# Create a debug pod
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- nslookup kubernetes.default

# Expected result
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

Causes and Solutions

Cause	Diagnosis	Solution
CoreDNS crashed	`kubectl get pods -n kube-system` shows CrashLoopBackOff	Check CoreDNS pod memory resources
Corrupted ConfigMap	`kubectl get configmap coredns -n kube-system -o yaml`	Restore default configuration
kube-dns Service missing	`kubectl get svc -n kube-system kube-dns`	Recreate Service with `kubectl apply`
Pod using wrong DNS	Pod's `/etc/resolv.conf` incorrect	Check `dnsPolicy` in PodSpec

Remember: Always test DNS with nslookup kubernetes.default before investigating other network layers. This command validates the complete chain: Pod → Service → CoreDNS.

For deeper CNI DNS Kubernetes network debugging, check our guide on debugging CrashLoopBackOff pods.

How to Resolve Service Connectivity Issues?

A Kubernetes Service acts as an internal load balancer. When it malfunctions, no pod can reach the backends.

Typical Symptoms

curl: (7) Failed to connect to my-service port 80: Connection refused

Complete Diagnosis

# 1. Verify Service exists and exposes correct ports
kubectl get svc my-service -o wide

# 2. Check Endpoints (associated pods)
kubectl get endpoints my-service

# 3. If Endpoints empty, check selector
kubectl describe svc my-service | grep Selector
kubectl get pods --selector=app=my-app

Common Causes

Empty Endpoints: The Service selector doesn't match any pod. Fix pod labels or Service selector.

# Service
spec:
selector:
app: my-app  # Must match pod labels

# Pod
metadata:
labels:
app: my-app  # Verify exact match

Port mismatch: Service targetPort doesn't match container exposed port.

# Check listening port in container
kubectl exec -it my-pod -- netstat -tlnp

According to Spectro Cloud, configuration errors represent the majority of network incidents. An experienced Kubernetes infrastructure engineer systematically checks Endpoints before investigating further.

Remember: Empty Endpoints = selector problem. Run kubectl get endpoints as first reflex.

How to Diagnose Pod-to-Pod Communication Failures?

Inter-pod communication relies on the CNI plugin and cluster network configuration. Failures at this level often indicate infrastructure issues.

Connectivity Tests

# Get target pod IP
kubectl get pod target-pod -o jsonpath='{.status.podIP}'

# Test from another pod
kubectl exec -it source-pod -- ping -c 3 <target-pod-ip>
kubectl exec -it source-pod -- curl -v http://<target-pod-ip>:8080

Advanced Network Diagnostics

# Check routes in a pod
kubectl exec -it my-pod -- ip route

# Check network interfaces
kubectl exec -it my-pod -- ip addr

# Trace network path
kubectl exec -it my-pod -- traceroute <target-ip>

Diagnostic Matrix

Test Fails	Test Succeeds	Probable Cause
Ping IP	Ping localhost	CNI or node problem
Curl port	Ping IP	Application not started or wrong port
Pods on different nodes	Pods on same node	Overlay network problem
All pods	None	Restrictive NetworkPolicy

Kubernetes Cloud operations engineers must master these diagnostics to reduce resolution time. IT teams lose an average of on Kubernetes incidents.

How to Resolve CNI Plugin Errors?

The CNI (Container Network Interface) manages IP address assignment to pods and network routing. CNI errors block pod startup.

Symptoms

Warning  FailedCreatePodSandBox  NetworkPlugin cni failed to set up pod network

Diagnosis

# CNI state on nodes
kubectl describe node <node-name> | grep -A5 "Conditions:"

# CNI pod logs (Calico example)
kubectl logs -n kube-system -l k8s-app=calico-node --tail=100

# Check CNI configuration
ls -la /etc/cni/net.d/
cat /etc/cni/net.d/10-calico.conflist

Solutions by CNI

CNI	Common Issue	Solution
Calico	`calico-node` in CrashLoop	Check etcd/datastore connectivity
Flannel	Pods without IP	Restart `flanneld` on affected node
Cilium	`cilium-agent` degraded	Validate kernel version (≥4.9.17 required)
Weave	Network fragmentation	Reset with `weave reset`

# Force restart CNI pods
kubectl rollout restart daemonset -n kube-system calico-node

Modern CNIs like Cilium leverage eBPF for improved performance and observability.

Remember: CNI problems affect all pods on a node. If multiple pods fail simultaneously on the same node, suspect the CNI.

How to Troubleshoot Ingress and External Access Issues?

Ingress controls HTTP/HTTPS access from outside the cluster. 502/504 errors or timeouts indicate configuration or backend problems.

Ingress Diagnosis

# Ingress state
kubectl get ingress my-ingress -o wide

# Check Ingress Controller
kubectl get pods -n ingress-nginx
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

# Check external address
kubectl get svc -n ingress-nginx ingress-nginx-controller

Troubleshooting Checklist

Verify backend Service exists and has Endpoints
Validate Service name in Ingress matches exactly
Confirm specified port is correct
Test direct Service access (bypass Ingress)

# Direct backend test (bypass Ingress)
kubectl port-forward svc/my-service 8080:80
curl localhost:8080

For a structured troubleshooting approach, refer to our Kubernetes production observability checklist.

TLS Errors

# Check TLS secret
kubectl get secret my-tls-secret -o yaml

# Validate certificate
kubectl get secret my-tls-secret -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout

How to Identify Blocking NetworkPolicies?

NetworkPolicies control traffic between pods. An overly restrictive policy silently blocks connections without an explicit error message.

Symptoms

Connections that timeout without a clear error. Pods seem to work but cannot communicate.

Diagnosis

# List all NetworkPolicies
kubectl get networkpolicies -A

# Analyze a specific policy
kubectl describe networkpolicy my-policy -n my-namespace

# Check which pods are affected
kubectl get pods -n my-namespace --show-labels

Connectivity Test with Policy

# Identify policies applied to a pod
kubectl get networkpolicies -n my-namespace -o json | jq '.items[] | select(.spec.podSelector.matchLabels | to_entries[] | .key == "app" and .value == "my-app")'

Troubleshooting Rule

Temporarily disable NetworkPolicies to confirm they're the cause:

# Backup then delete
kubectl get networkpolicy my-policy -o yaml > policy-backup.yaml
kubectl delete networkpolicy my-policy

# Test connectivity
# Then restore
kubectl apply -f policy-backup.yaml

Remember: If a connection fails without an error message, check NetworkPolicies. They're the most common cause of silent blocks.

The LFS460 Kubernetes Security Fundamentals training covers secure NetworkPolicy configuration in detail, an essential skill for system administrators preparing for CKS certification.

How to Prevent Recurring Network Problems?

Proactive Monitoring

According to Grafana Labs, 75% of Kubernetes users adopt Prometheus and Grafana for monitoring. Configure alerts on critical network metrics.

# Example Prometheus alert for CoreDNS
- alert: CoreDNSDown
expr: up{job="coredns"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "CoreDNS is down"

Prevention Checklist

Action	Frequency	Tool
Check CoreDNS health	Continuous	Prometheus
Validate Endpoints	Before each deployment	CI/CD
Test inter-node connectivity	Weekly	Automated script
Audit NetworkPolicies	Monthly	kubectl + documentation
Update CNI	Quarterly	Follow releases

Check our complete guide on Kubernetes Monitoring and Troubleshooting to implement a complete observability strategy.

Network Documentation

Systematically document:

Cluster network architecture (pods, services, nodes CIDR)
Applied NetworkPolicies and their justification
Custom DNS configuration
External firewall rules

For an overview of deployment failures related to networking, check our Kubernetes deployment troubleshooting guide.

Remember: Prevention costs less than troubleshooting. Teams that document their network architecture resolve incidents 3x faster.

Develop Your Kubernetes Network Diagnosis Skills

Kubernetes network diagnosis requires deep understanding of each layer: DNS, Services, CNI, Ingress, and NetworkPolicies. With 82% of container users running Kubernetes in production, these skills have become essential.

As Chris Aniszczyk, CNCF CTO, states: "Kubernetes is no longer experimental but foundational."

SFEIR offers official Linux Foundation trainings to master Kubernetes administration and troubleshooting:

LFS458 Kubernetes Administration: 4 days to master network diagnosis, cluster management, and prepare for CKA certification
LFS460 Kubernetes Security Fundamentals: 4 days on network security, NetworkPolicies, and CKS preparation
Kubernetes Fundamentals: 1 day to discover essential concepts before going deeper

Contact our advisors to build your Kubernetes networking skill development path.

Key Takeaways

Network Symptom Index

How to Diagnose DNS Problems on Kubernetes?

Typical Symptoms

Step 1: Check CoreDNS Pod

Step 2: Test Resolution from a Pod

Causes and Solutions

How to Resolve Service Connectivity Issues?

Typical Symptoms

Complete Diagnosis

Common Causes

How to Diagnose Pod-to-Pod Communication Failures?

Connectivity Tests

Advanced Network Diagnostics

Diagnostic Matrix

How to Resolve CNI Plugin Errors?

Symptoms

Diagnosis

Solutions by CNI

How to Troubleshoot Ingress and External Access Issues?

Ingress Diagnosis

Troubleshooting Checklist

TLS Errors

How to Identify Blocking NetworkPolicies?

Symptoms

Diagnosis

Connectivity Test with Policy

Troubleshooting Rule

How to Prevent Recurring Network Problems?

Proactive Monitoring

Prevention Checklist

Network Documentation

Develop Your Kubernetes Network Diagnosis Skills