troubleshooting8 min read

Network Problem Diagnosis and Resolution on Kubernetes

SFEIR Institute

Key Takeaways

  • 60% of IT time spent on network troubleshooting according to Spectro Cloud 2025
  • '6 categories: DNS, Services, Pod-to-Pod, CNI, Ingress, NetworkPolicies'
  • Start with kubectl get events and verify DNS first

IT teams spend an average of 60% of their cluster management time on troubleshooting, with a significant portion concerning network issues. This guide provides a structured methodology for identifying and resolving the most common network problems on Kubernetes.

TL;DR: Kubernetes network problems fall into 6 categories: DNS, Services, Pod-to-Pod, CNI, Ingress, and NetworkPolicies. Each category has specific diagnostic commands. Always start with kubectl get events and verify DNS before investigating further.

To master network diagnosis in depth, follow the LFS458 Kubernetes Administration training.

Network Symptom Index

SymptomQuick Diagnostic CommandSection
could not resolve hostkubectl exec -it -- nslookup kubernetesDNS
connection refused on a Servicekubectl get endpoints Services
Pods can't communicate with each otherkubectl exec -it -- ping Pod-to-Pod
NetworkPlugin cni failedkubectl describe node CNI
502/504 from outsidekubectl get ingress -o wideIngress
Connections blocked without errorkubectl get networkpolicies -ANetworkPolicies

How to Diagnose DNS Problems on Kubernetes?

DNS is the leading cause of network problems on Kubernetes. A pod that cannot resolve internal service names becomes immediately unusable.

Typical Symptoms

Error: getaddrinfo ENOTFOUND my-service
Error: could not resolve host 'my-database.default.svc.cluster.local'

Step 1: Check CoreDNS Pod

# CoreDNS pod state
kubectl get pods -n kube-system -l k8s-app=kube-dns

# CoreDNS logs to identify errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Step 2: Test Resolution from a Pod

# Create a debug pod
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- nslookup kubernetes.default

# Expected result
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

Causes and Solutions

CauseDiagnosisSolution
CoreDNS crashedkubectl get pods -n kube-system shows CrashLoopBackOffCheck CoreDNS pod memory resources
Corrupted ConfigMapkubectl get configmap coredns -n kube-system -o yamlRestore default configuration
kube-dns Service missingkubectl get svc -n kube-system kube-dnsRecreate Service with kubectl apply
Pod using wrong DNSPod's /etc/resolv.conf incorrectCheck dnsPolicy in PodSpec
Remember: Always test DNS with nslookup kubernetes.default before investigating other network layers. This command validates the complete chain: Pod → Service → CoreDNS.

For deeper CNI DNS Kubernetes network debugging, check our guide on debugging CrashLoopBackOff pods.

How to Resolve Service Connectivity Issues?

A Kubernetes Service acts as an internal load balancer. When it malfunctions, no pod can reach the backends.

Typical Symptoms

curl: (7) Failed to connect to my-service port 80: Connection refused

Complete Diagnosis

# 1. Verify Service exists and exposes correct ports
kubectl get svc my-service -o wide

# 2. Check Endpoints (associated pods)
kubectl get endpoints my-service

# 3. If Endpoints empty, check selector
kubectl describe svc my-service | grep Selector
kubectl get pods --selector=app=my-app

Common Causes

Empty Endpoints: The Service selector doesn't match any pod. Fix pod labels or Service selector.

# Service
spec:
selector:
app: my-app  # Must match pod labels

# Pod
metadata:
labels:
app: my-app  # Verify exact match

Port mismatch: Service targetPort doesn't match container exposed port.

# Check listening port in container
kubectl exec -it my-pod -- netstat -tlnp

According to Spectro Cloud, configuration errors represent the majority of network incidents. An experienced Kubernetes infrastructure engineer systematically checks Endpoints before investigating further.

Remember: Empty Endpoints = selector problem. Run kubectl get endpoints as first reflex.

How to Diagnose Pod-to-Pod Communication Failures?

Inter-pod communication relies on the CNI plugin and cluster network configuration. Failures at this level often indicate infrastructure issues.

Connectivity Tests

# Get target pod IP
kubectl get pod target-pod -o jsonpath='{.status.podIP}'

# Test from another pod
kubectl exec -it source-pod -- ping -c 3 <target-pod-ip>
kubectl exec -it source-pod -- curl -v http://<target-pod-ip>:8080

Advanced Network Diagnostics

# Check routes in a pod
kubectl exec -it my-pod -- ip route

# Check network interfaces
kubectl exec -it my-pod -- ip addr

# Trace network path
kubectl exec -it my-pod -- traceroute <target-ip>

Diagnostic Matrix

Test FailsTest SucceedsProbable Cause
Ping IPPing localhostCNI or node problem
Curl portPing IPApplication not started or wrong port
Pods on different nodesPods on same nodeOverlay network problem
All podsNoneRestrictive NetworkPolicy

Kubernetes Cloud operations engineers must master these diagnostics to reduce resolution time. IT teams lose an average of 34 working days per year on Kubernetes incidents.

How to Resolve CNI Plugin Errors?

The CNI (Container Network Interface) manages IP address assignment to pods and network routing. CNI errors block pod startup.

Symptoms

Warning  FailedCreatePodSandBox  NetworkPlugin cni failed to set up pod network

Diagnosis

# CNI state on nodes
kubectl describe node <node-name> | grep -A5 "Conditions:"

# CNI pod logs (Calico example)
kubectl logs -n kube-system -l k8s-app=calico-node --tail=100

# Check CNI configuration
ls -la /etc/cni/net.d/
cat /etc/cni/net.d/10-calico.conflist

Solutions by CNI

CNICommon IssueSolution
Calicocalico-node in CrashLoopCheck etcd/datastore connectivity
FlannelPods without IPRestart flanneld on affected node
Ciliumcilium-agent degradedValidate kernel version (≥4.9.17 required)
WeaveNetwork fragmentationReset with weave reset
# Force restart CNI pods
kubectl rollout restart daemonset -n kube-system calico-node

Modern CNIs like Cilium leverage eBPF for improved performance and observability.

Remember: CNI problems affect all pods on a node. If multiple pods fail simultaneously on the same node, suspect the CNI.

How to Troubleshoot Ingress and External Access Issues?

Ingress controls HTTP/HTTPS access from outside the cluster. 502/504 errors or timeouts indicate configuration or backend problems.

Ingress Diagnosis

# Ingress state
kubectl get ingress my-ingress -o wide

# Check Ingress Controller
kubectl get pods -n ingress-nginx
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

# Check external address
kubectl get svc -n ingress-nginx ingress-nginx-controller

Troubleshooting Checklist

  1. Verify backend Service exists and has Endpoints
  2. Validate Service name in Ingress matches exactly
  3. Confirm specified port is correct
  4. Test direct Service access (bypass Ingress)
# Direct backend test (bypass Ingress)
kubectl port-forward svc/my-service 8080:80
curl localhost:8080

For a structured troubleshooting approach, refer to our Kubernetes production observability checklist.

TLS Errors

# Check TLS secret
kubectl get secret my-tls-secret -o yaml

# Validate certificate
kubectl get secret my-tls-secret -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout

How to Identify Blocking NetworkPolicies?

NetworkPolicies control traffic between pods. An overly restrictive policy silently blocks connections without an explicit error message.

Symptoms

Connections that timeout without a clear error. Pods seem to work but cannot communicate.

Diagnosis

# List all NetworkPolicies
kubectl get networkpolicies -A

# Analyze a specific policy
kubectl describe networkpolicy my-policy -n my-namespace

# Check which pods are affected
kubectl get pods -n my-namespace --show-labels

Connectivity Test with Policy

# Identify policies applied to a pod
kubectl get networkpolicies -n my-namespace -o json | jq '.items[] | select(.spec.podSelector.matchLabels | to_entries[] | .key == "app" and .value == "my-app")'

Troubleshooting Rule

Temporarily disable NetworkPolicies to confirm they're the cause:

# Backup then delete
kubectl get networkpolicy my-policy -o yaml > policy-backup.yaml
kubectl delete networkpolicy my-policy

# Test connectivity
# Then restore
kubectl apply -f policy-backup.yaml
Remember: If a connection fails without an error message, check NetworkPolicies. They're the most common cause of silent blocks.

The LFS460 Kubernetes Security Fundamentals training covers secure NetworkPolicy configuration in detail, an essential skill for system administrators preparing for CKS certification.

How to Prevent Recurring Network Problems?

Proactive Monitoring

According to Grafana Labs, 75% of Kubernetes users adopt Prometheus and Grafana for monitoring. Configure alerts on critical network metrics.

# Example Prometheus alert for CoreDNS
- alert: CoreDNSDown
expr: up{job="coredns"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "CoreDNS is down"

Prevention Checklist

ActionFrequencyTool
Check CoreDNS healthContinuousPrometheus
Validate EndpointsBefore each deploymentCI/CD
Test inter-node connectivityWeeklyAutomated script
Audit NetworkPoliciesMonthlykubectl + documentation
Update CNIQuarterlyFollow releases

Check our complete guide on Kubernetes Monitoring and Troubleshooting to implement a complete observability strategy.

Network Documentation

Systematically document:

  • Cluster network architecture (pods, services, nodes CIDR)
  • Applied NetworkPolicies and their justification
  • Custom DNS configuration
  • External firewall rules

For an overview of deployment failures related to networking, check our Kubernetes deployment troubleshooting guide.

Remember: Prevention costs less than troubleshooting. Teams that document their network architecture resolve incidents 3x faster.

Develop Your Kubernetes Network Diagnosis Skills

Kubernetes network diagnosis requires deep understanding of each layer: DNS, Services, CNI, Ingress, and NetworkPolicies. With 82% of container users running Kubernetes in production, these skills have become essential.

As Chris Aniszczyk, CNCF CTO, states: "Kubernetes is no longer experimental but foundational."

SFEIR offers official Linux Foundation trainings to master Kubernetes administration and troubleshooting:

Contact our advisors to build your Kubernetes networking skill development path.