Key Takeaways
- ✓60% of IT time spent on network troubleshooting according to Spectro Cloud 2025
- ✓'6 categories: DNS, Services, Pod-to-Pod, CNI, Ingress, NetworkPolicies'
- ✓Start with kubectl get events and verify DNS first
IT teams spend an average of 60% of their cluster management time on troubleshooting, with a significant portion concerning network issues. This guide provides a structured methodology for identifying and resolving the most common network problems on Kubernetes.
TL;DR: Kubernetes network problems fall into 6 categories: DNS, Services, Pod-to-Pod, CNI, Ingress, and NetworkPolicies. Each category has specific diagnostic commands. Always start with kubectl get events and verify DNS before investigating further.
To master network diagnosis in depth, follow the LFS458 Kubernetes Administration training.
Network Symptom Index
| Symptom | Quick Diagnostic Command | Section |
|---|---|---|
could not resolve host | kubectl exec -it | DNS |
connection refused on a Service | kubectl get endpoints | Services |
| Pods can't communicate with each other | kubectl exec -it | Pod-to-Pod |
NetworkPlugin cni failed | kubectl describe node | CNI |
| 502/504 from outside | kubectl get ingress -o wide | Ingress |
| Connections blocked without error | kubectl get networkpolicies -A | NetworkPolicies |
How to Diagnose DNS Problems on Kubernetes?
DNS is the leading cause of network problems on Kubernetes. A pod that cannot resolve internal service names becomes immediately unusable.
Typical Symptoms
Error: getaddrinfo ENOTFOUND my-service
Error: could not resolve host 'my-database.default.svc.cluster.local'
Step 1: Check CoreDNS Pod
# CoreDNS pod state
kubectl get pods -n kube-system -l k8s-app=kube-dns
# CoreDNS logs to identify errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
Step 2: Test Resolution from a Pod
# Create a debug pod
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- nslookup kubernetes.default
# Expected result
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
Causes and Solutions
| Cause | Diagnosis | Solution |
|---|---|---|
| CoreDNS crashed | kubectl get pods -n kube-system shows CrashLoopBackOff | Check CoreDNS pod memory resources |
| Corrupted ConfigMap | kubectl get configmap coredns -n kube-system -o yaml | Restore default configuration |
| kube-dns Service missing | kubectl get svc -n kube-system kube-dns | Recreate Service with kubectl apply |
| Pod using wrong DNS | Pod's /etc/resolv.conf incorrect | Check dnsPolicy in PodSpec |
Remember: Always test DNS with nslookup kubernetes.default before investigating other network layers. This command validates the complete chain: Pod → Service → CoreDNS.
For deeper CNI DNS Kubernetes network debugging, check our guide on debugging CrashLoopBackOff pods.
How to Resolve Service Connectivity Issues?
A Kubernetes Service acts as an internal load balancer. When it malfunctions, no pod can reach the backends.
Typical Symptoms
curl: (7) Failed to connect to my-service port 80: Connection refused
Complete Diagnosis
# 1. Verify Service exists and exposes correct ports
kubectl get svc my-service -o wide
# 2. Check Endpoints (associated pods)
kubectl get endpoints my-service
# 3. If Endpoints empty, check selector
kubectl describe svc my-service | grep Selector
kubectl get pods --selector=app=my-app
Common Causes
Empty Endpoints: The Service selector doesn't match any pod. Fix pod labels or Service selector.
# Service
spec:
selector:
app: my-app # Must match pod labels
# Pod
metadata:
labels:
app: my-app # Verify exact match
Port mismatch: Service targetPort doesn't match container exposed port.
# Check listening port in container
kubectl exec -it my-pod -- netstat -tlnp
According to Spectro Cloud, configuration errors represent the majority of network incidents. An experienced Kubernetes infrastructure engineer systematically checks Endpoints before investigating further.
Remember: Empty Endpoints = selector problem. Run kubectl get endpoints as first reflex.
How to Diagnose Pod-to-Pod Communication Failures?
Inter-pod communication relies on the CNI plugin and cluster network configuration. Failures at this level often indicate infrastructure issues.
Connectivity Tests
# Get target pod IP
kubectl get pod target-pod -o jsonpath='{.status.podIP}'
# Test from another pod
kubectl exec -it source-pod -- ping -c 3 <target-pod-ip>
kubectl exec -it source-pod -- curl -v http://<target-pod-ip>:8080
Advanced Network Diagnostics
# Check routes in a pod
kubectl exec -it my-pod -- ip route
# Check network interfaces
kubectl exec -it my-pod -- ip addr
# Trace network path
kubectl exec -it my-pod -- traceroute <target-ip>
Diagnostic Matrix
| Test Fails | Test Succeeds | Probable Cause |
|---|---|---|
| Ping IP | Ping localhost | CNI or node problem |
| Curl port | Ping IP | Application not started or wrong port |
| Pods on different nodes | Pods on same node | Overlay network problem |
| All pods | None | Restrictive NetworkPolicy |
Kubernetes Cloud operations engineers must master these diagnostics to reduce resolution time. IT teams lose an average of 34 working days per year on Kubernetes incidents.
How to Resolve CNI Plugin Errors?
The CNI (Container Network Interface) manages IP address assignment to pods and network routing. CNI errors block pod startup.
Symptoms
Warning FailedCreatePodSandBox NetworkPlugin cni failed to set up pod network
Diagnosis
# CNI state on nodes
kubectl describe node <node-name> | grep -A5 "Conditions:"
# CNI pod logs (Calico example)
kubectl logs -n kube-system -l k8s-app=calico-node --tail=100
# Check CNI configuration
ls -la /etc/cni/net.d/
cat /etc/cni/net.d/10-calico.conflist
Solutions by CNI
| CNI | Common Issue | Solution |
|---|---|---|
| Calico | calico-node in CrashLoop | Check etcd/datastore connectivity |
| Flannel | Pods without IP | Restart flanneld on affected node |
| Cilium | cilium-agent degraded | Validate kernel version (≥4.9.17 required) |
| Weave | Network fragmentation | Reset with weave reset |
# Force restart CNI pods
kubectl rollout restart daemonset -n kube-system calico-node
Modern CNIs like Cilium leverage eBPF for improved performance and observability.
Remember: CNI problems affect all pods on a node. If multiple pods fail simultaneously on the same node, suspect the CNI.
How to Troubleshoot Ingress and External Access Issues?
Ingress controls HTTP/HTTPS access from outside the cluster. 502/504 errors or timeouts indicate configuration or backend problems.
Ingress Diagnosis
# Ingress state
kubectl get ingress my-ingress -o wide
# Check Ingress Controller
kubectl get pods -n ingress-nginx
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100
# Check external address
kubectl get svc -n ingress-nginx ingress-nginx-controller
Troubleshooting Checklist
- Verify backend Service exists and has Endpoints
- Validate Service name in Ingress matches exactly
- Confirm specified port is correct
- Test direct Service access (bypass Ingress)
# Direct backend test (bypass Ingress)
kubectl port-forward svc/my-service 8080:80
curl localhost:8080
For a structured troubleshooting approach, refer to our Kubernetes production observability checklist.
TLS Errors
# Check TLS secret
kubectl get secret my-tls-secret -o yaml
# Validate certificate
kubectl get secret my-tls-secret -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout
How to Identify Blocking NetworkPolicies?
NetworkPolicies control traffic between pods. An overly restrictive policy silently blocks connections without an explicit error message.
Symptoms
Connections that timeout without a clear error. Pods seem to work but cannot communicate.
Diagnosis
# List all NetworkPolicies
kubectl get networkpolicies -A
# Analyze a specific policy
kubectl describe networkpolicy my-policy -n my-namespace
# Check which pods are affected
kubectl get pods -n my-namespace --show-labels
Connectivity Test with Policy
# Identify policies applied to a pod
kubectl get networkpolicies -n my-namespace -o json | jq '.items[] | select(.spec.podSelector.matchLabels | to_entries[] | .key == "app" and .value == "my-app")'
Troubleshooting Rule
Temporarily disable NetworkPolicies to confirm they're the cause:
# Backup then delete
kubectl get networkpolicy my-policy -o yaml > policy-backup.yaml
kubectl delete networkpolicy my-policy
# Test connectivity
# Then restore
kubectl apply -f policy-backup.yaml
Remember: If a connection fails without an error message, check NetworkPolicies. They're the most common cause of silent blocks.
The LFS460 Kubernetes Security Fundamentals training covers secure NetworkPolicy configuration in detail, an essential skill for system administrators preparing for CKS certification.
How to Prevent Recurring Network Problems?
Proactive Monitoring
According to Grafana Labs, 75% of Kubernetes users adopt Prometheus and Grafana for monitoring. Configure alerts on critical network metrics.
# Example Prometheus alert for CoreDNS
- alert: CoreDNSDown
expr: up{job="coredns"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "CoreDNS is down"
Prevention Checklist
| Action | Frequency | Tool |
|---|---|---|
| Check CoreDNS health | Continuous | Prometheus |
| Validate Endpoints | Before each deployment | CI/CD |
| Test inter-node connectivity | Weekly | Automated script |
| Audit NetworkPolicies | Monthly | kubectl + documentation |
| Update CNI | Quarterly | Follow releases |
Check our complete guide on Kubernetes Monitoring and Troubleshooting to implement a complete observability strategy.
Network Documentation
Systematically document:
- Cluster network architecture (pods, services, nodes CIDR)
- Applied NetworkPolicies and their justification
- Custom DNS configuration
- External firewall rules
For an overview of deployment failures related to networking, check our Kubernetes deployment troubleshooting guide.
Remember: Prevention costs less than troubleshooting. Teams that document their network architecture resolve incidents 3x faster.
Develop Your Kubernetes Network Diagnosis Skills
Kubernetes network diagnosis requires deep understanding of each layer: DNS, Services, CNI, Ingress, and NetworkPolicies. With 82% of container users running Kubernetes in production, these skills have become essential.
As Chris Aniszczyk, CNCF CTO, states: "Kubernetes is no longer experimental but foundational."
SFEIR offers official Linux Foundation trainings to master Kubernetes administration and troubleshooting:
- LFS458 Kubernetes Administration: 4 days to master network diagnosis, cluster management, and prepare for CKA certification
- LFS460 Kubernetes Security Fundamentals: 4 days on network security, NetworkPolicies, and CKS preparation
- Kubernetes Fundamentals: 1 day to discover essential concepts before going deeper
Contact our advisors to build your Kubernetes networking skill development path.