Diagnose and Resolve Network Issues in a Kubernetes Cluster

Every Kubernetes infrastructure engineer preparing for CKA certification spends a significant portion of their time on network troubleshooting. According to , IT teams spend 34 working days per year resolving Kubernetes problems. Kubernetes network debugging represents a differentiating skill for any professional.

TL;DR: Kubernetes DNS Services troubleshooting follows a structured methodology: verify pod-to-pod connectivity, validate DNS resolution, test Services, then inspect Network Policies. kubectl, nslookup, and tcpdump are your allies.

Professionals who want to go further follow the LFS458 Kubernetes Administration training.

Why Must the Kubernetes Infrastructure Engineer Master Network Diagnosis?

Kubernetes networking relies on multiple abstraction layers. A malfunction can occur at the CNI, Services, DNS, or Network Policies level.

According to the CNCF Annual Survey 2025, 82% of container users run Kubernetes in production. This massive adoption multiplies potential network incidents.

Layer	Components	Common Issues
L2/L3	CNI (Calico, Cilium)	Pods without IP, missing routes
L4	Services, kube-proxy	Inaccessible ClusterIP, corrupted iptables
L7	Ingress, DNS	TLS certificates, DNS resolution
Policies	NetworkPolicy	Unintentionally blocked traffic

Remember: 80% of Kubernetes network problems come from misconfigurations, not software bugs. Check manifests first before suspecting infrastructure.

How to Resolve Kubernetes Pod Network Issues at Connectivity Level?

Basic Connectivity Test

# Create a diagnostic pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

# From the pod, test connectivity to another pod
ping <pod-ip>
curl <pod-ip>:<port>

# Test connectivity to a Service
curl <service-name>.<namespace>.svc.cluster.local:<port>

Check CNI State

The CNI assigns IPs to pods. A failing CNI causes pods to be in ContainerCreating state.

# Check CNI pods
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get pods -n kube-system -l k8s-app=cilium

# Inspect CNI logs
kubectl logs -n kube-system -l k8s-app=calico-node --tail=100

# Check routes on a node
ip route show

For complete network configuration, see Configure a Kubernetes cluster network: CNI, Services, Ingress.

Diagnosis with tcpdump

# Capture traffic on pod interface
kubectl debug -it <pod-name> --image=nicolaka/netshoot -- tcpdump -i eth0 -nn port 80

# Analyze ICMP traffic
kubectl debug -it <pod-name> --image=nicolaka/netshoot -- tcpdump -i eth0 icmp

Remember: A pod without an IP after 30 seconds indicates a CNI problem. Check CNI DaemonSet logs first.

How to Master Kubernetes DNS Services Troubleshooting?

DNS is critical. Every Service resolution goes through CoreDNS. A failing DNS paralyzes the cluster.

Check CoreDNS

# CoreDNS pod state
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl get deployment coredns -n kube-system

# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# CoreDNS metrics
kubectl get --raw /api/v1/namespaces/kube-system/pods/coredns-xxx/proxy/metrics

DNS Resolution Tests

# From a test pod
kubectl run dnstest --rm -it --image=busybox:1.36 -- nslookup kubernetes.default

# Full resolution
kubectl run dnstest --rm -it --image=nicolaka/netshoot -- dig kubernetes.default.svc.cluster.local

# Check pod's resolv.conf file
kubectl exec <pod-name> -- cat /etc/resolv.conf

Expected DNS configuration:

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Common DNS Issues

Symptom	Probable Cause	Solution
`NXDOMAIN`	Non-existent service	Check `kubectl get svc`
`SERVFAIL`	CoreDNS overloaded	Scale the Deployment
Timeout	NetworkPolicy blocks UDP 53	Allow DNS traffic
Slow resolution	ndots:5 too high	Reduce or use FQDN

# Optimize DNS queries in pod
apiVersion: v1
kind: Pod
metadata:
name: optimized-dns
spec:
dnsConfig:
options:
- name: ndots
value: "2"
- name: single-request-reopen
containers:
- name: app
image: myapp:v1

For fundamentals, see Kubernetes fundamentals for beginners.

How to Debug Kubernetes Services That Don't Work?

A Service exposes pods via a stable IP. Issues generally come from selectors or endpoints.

Check Endpoints

# List Service endpoints
kubectl get endpoints <service-name>
kubectl describe endpoints <service-name>

# A Service without endpoints indicates:
# - No pods with matching labels
# - Non-Ready pods
# - Pods on different ports

Validate Selectors

# Compare Service and pod labels
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'
kubectl get pods -l <selector> --show-labels

# Example mismatch
# Service selector: app=frontend
# Pod labels: app=front-end  # MISMATCH!

Diagnose kube-proxy

kube-proxy manages iptables/IPVS rules to route traffic to pods.

# Check kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# Inspect iptables rules (iptables mode)
iptables -t nat -L KUBE-SERVICES -n | grep <service-name>

# Inspect IPVS (ipvs mode)
ipvsadm -Ln | grep <cluster-ip>

# kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50

Remember: A ClusterIP type Service is only accessible from inside the cluster. Use kubectl port-forward to test from your workstation.

For essential commands, see kubectl cheat sheet: essential administration commands.

How to Identify Blocks Caused by Network Policies?

Network Policies implement a "deny by default" security model once activated. Legitimate traffic can be blocked by an overly restrictive policy.

Network Policy Audit

# List all Network Policies
kubectl get networkpolicies -A

# Inspect a specific policy
kubectl describe networkpolicy <policy-name> -n <namespace>

# Identify impacted pods
kubectl get pods -n <namespace> -l <policy-selector>

Example of Blocking Policy

# This policy blocks ALL incoming traffic to app=backend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-ingress
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
# No ingress rule = all blocked

Unblock Necessary Traffic

# Allow traffic from frontend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080

Modern CNIs like Cilium offer advanced observability on blocked traffic thanks to eBPF.

For deeper security, see Secure a Kubernetes cluster: best practices.

What Tools to Use for Advanced Kubernetes Network Debugging Diagnosis?

Built-in Tools

# kubectl debug to inject a debug container
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=<container-name>

# kubectl exec for existing pods
kubectl exec -it <pod-name> -- /bin/sh

# kubectl port-forward to test Services
kubectl port-forward svc/<service-name> 8080:80

Specialized Tools

Tool	Usage	Command
netshoot	All-in-one image (curl, dig, tcpdump, nmap)	`kubectl run netshoot --rm -it --image=nicolaka/netshoot`
ksniff	Wireshark capture from a pod	`kubectl sniff`
kubeshark	Real-time API traffic analysis	`kubeshark tap`
cilium connectivity test	Cilium connectivity validation	`cilium connectivity test`

# Example with kubeshark
kubeshark tap -n default
# Opens a web interface to analyze HTTP/gRPC traffic

# Cilium connectivity test
kubectl exec -n kube-system cilium-xxx -- cilium connectivity test

Remember: The nicolaka/netshoot image contains over 30 network tools. Keep it in your troubleshooting arsenal.

How Does the Kubernetes Infrastructure Engineer Approach Exam Troubleshooting?

The CKA exam allocates 30% of the score to troubleshooting, with a significant portion concerning networking. According to Linux Foundation, the exam lasts 2 hours with a 66% passing score.

Quick Diagnosis Methodology

Identify the exact symptom (timeout, connection refused, DNS failure)
Isolate the problematic layer (pod, Service, DNS, policy)
Check relevant logs and events
Test connectivity at each level
Fix and validate

# Typical diagnosis workflow
# 1. Pod state
kubectl get pods -o wide

# 2. Recent events
kubectl get events --sort-by=.metadata.creationTimestamp

# 3. Application logs
kubectl logs <pod-name> --tail=50

# 4. Connectivity test
kubectl run test --rm -it --image=busybox -- wget -qO- <service>:<port>

For common problems, see Resolve the 10 most common Kubernetes cluster problems.

As reminds us: "Don't let your knowledge remain theoretical - set up a real Kubernetes environment to solidify your skills."

How to Document and Prevent Recurring Network Incidents?

Diagnostic Runbook

Create a structured document for each incident type:

## Incident: Pods cannot resolve DNS

### Symptoms
- Applications timeout on HTTP calls
- Logs: "could not resolve host"

### Diagnosis
1. kubectl get pods -n kube-system -l k8s-app=kube-dns
2. kubectl logs -n kube-system coredns-xxx
3. kubectl run test --rm -it --image=busybox -- nslookup kubernetes

### Solutions
- Scale CoreDNS: kubectl scale deployment coredns -n kube-system --replicas=3
- Check NetworkPolicy on UDP/53
- Restart CoreDNS pods if needed

Proactive Monitoring

# PrometheusRule to alert on DNS errors
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: coredns-alerts
spec:
groups:
- name: coredns
rules:
- alert: CoreDNSErrorsHigh
expr: rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m]) > 0.1
for: 5m
labels:
severity: warning

For environment comparisons, see Managed Kubernetes (EKS, AKS, GKE) vs self-hosted: comparison.

Remember: Every resolved incident should enrich your knowledge base. Time invested in documentation pays off with each future occurrence.

Develop Your Troubleshooting Skills with SFEIR Institute

Kubernetes network diagnosis distinguishes junior administrators from experts. With 71% of Fortune 100 companies running Kubernetes in production, these skills are highly valued.

SFEIR Institute trainings cover advanced troubleshooting:

LFS458 Kubernetes Administration: 4 days including network diagnosis, CNI configuration, and production problem resolution. Complete CKA preparation.
Kubernetes Fundamentals: 1 day to understand network architecture before diving deeper into troubleshooting.
LFS460 Kubernetes Security Fundamentals: 4 days to master Network Policies and network security.

According to Linux Foundation, CKA and CKS certifications are valid for 2 years. Validate your skills with training guided by practitioners.

Contact our teams to build your path to certification.

Key Takeaways