troubleshooting8 min read

Diagnose and Resolve Network Issues in a Kubernetes Cluster

SFEIR Institute

Key Takeaways

  • 34 working days per year spent on Kubernetes issues according to Cloud Native Now
  • '4-step methodology: pod-to-pod, DNS, Services, Network Policies'
  • 'Essential tools: kubectl, nslookup, tcpdump'

Every Kubernetes infrastructure engineer preparing for CKA certification spends a significant portion of their time on network troubleshooting. According to Cloud Native Now, IT teams spend 34 working days per year resolving Kubernetes problems. Kubernetes network debugging represents a differentiating skill for any professional.

TL;DR: Kubernetes DNS Services troubleshooting follows a structured methodology: verify pod-to-pod connectivity, validate DNS resolution, test Services, then inspect Network Policies. kubectl, nslookup, and tcpdump are your allies.

Professionals who want to go further follow the LFS458 Kubernetes Administration training.

Why Must the Kubernetes Infrastructure Engineer Master Network Diagnosis?

Kubernetes networking relies on multiple abstraction layers. A malfunction can occur at the CNI, Services, DNS, or Network Policies level.

According to the CNCF Annual Survey 2025, 82% of container users run Kubernetes in production. This massive adoption multiplies potential network incidents.

LayerComponentsCommon Issues
L2/L3CNI (Calico, Cilium)Pods without IP, missing routes
L4Services, kube-proxyInaccessible ClusterIP, corrupted iptables
L7Ingress, DNSTLS certificates, DNS resolution
PoliciesNetworkPolicyUnintentionally blocked traffic
Remember: 80% of Kubernetes network problems come from misconfigurations, not software bugs. Check manifests first before suspecting infrastructure.

How to Resolve Kubernetes Pod Network Issues at Connectivity Level?

Basic Connectivity Test

# Create a diagnostic pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

# From the pod, test connectivity to another pod
ping <pod-ip>
curl <pod-ip>:<port>

# Test connectivity to a Service
curl <service-name>.<namespace>.svc.cluster.local:<port>

Check CNI State

The CNI assigns IPs to pods. A failing CNI causes pods to be in ContainerCreating state.

# Check CNI pods
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get pods -n kube-system -l k8s-app=cilium

# Inspect CNI logs
kubectl logs -n kube-system -l k8s-app=calico-node --tail=100

# Check routes on a node
ip route show

For complete network configuration, see Configure a Kubernetes cluster network: CNI, Services, Ingress.

Diagnosis with tcpdump

# Capture traffic on pod interface
kubectl debug -it <pod-name> --image=nicolaka/netshoot -- tcpdump -i eth0 -nn port 80

# Analyze ICMP traffic
kubectl debug -it <pod-name> --image=nicolaka/netshoot -- tcpdump -i eth0 icmp
Remember: A pod without an IP after 30 seconds indicates a CNI problem. Check CNI DaemonSet logs first.

How to Master Kubernetes DNS Services Troubleshooting?

DNS is critical. Every Service resolution goes through CoreDNS. A failing DNS paralyzes the cluster.

Check CoreDNS

# CoreDNS pod state
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl get deployment coredns -n kube-system

# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# CoreDNS metrics
kubectl get --raw /api/v1/namespaces/kube-system/pods/coredns-xxx/proxy/metrics

DNS Resolution Tests

# From a test pod
kubectl run dnstest --rm -it --image=busybox:1.36 -- nslookup kubernetes.default

# Full resolution
kubectl run dnstest --rm -it --image=nicolaka/netshoot -- dig kubernetes.default.svc.cluster.local

# Check pod's resolv.conf file
kubectl exec <pod-name> -- cat /etc/resolv.conf

Expected DNS configuration:

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Common DNS Issues

SymptomProbable CauseSolution
NXDOMAINNon-existent serviceCheck kubectl get svc
SERVFAILCoreDNS overloadedScale the Deployment
TimeoutNetworkPolicy blocks UDP 53Allow DNS traffic
Slow resolutionndots:5 too highReduce or use FQDN
# Optimize DNS queries in pod
apiVersion: v1
kind: Pod
metadata:
name: optimized-dns
spec:
dnsConfig:
options:
- name: ndots
value: "2"
- name: single-request-reopen
containers:
- name: app
image: myapp:v1

For fundamentals, see Kubernetes fundamentals for beginners.

How to Debug Kubernetes Services That Don't Work?

A Service exposes pods via a stable IP. Issues generally come from selectors or endpoints.

Check Endpoints

# List Service endpoints
kubectl get endpoints <service-name>
kubectl describe endpoints <service-name>

# A Service without endpoints indicates:
# - No pods with matching labels
# - Non-Ready pods
# - Pods on different ports

Validate Selectors

# Compare Service and pod labels
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'
kubectl get pods -l <selector> --show-labels

# Example mismatch
# Service selector: app=frontend
# Pod labels: app=front-end  # MISMATCH!

Diagnose kube-proxy

kube-proxy manages iptables/IPVS rules to route traffic to pods.

# Check kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# Inspect iptables rules (iptables mode)
iptables -t nat -L KUBE-SERVICES -n | grep <service-name>

# Inspect IPVS (ipvs mode)
ipvsadm -Ln | grep <cluster-ip>

# kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50
Remember: A ClusterIP type Service is only accessible from inside the cluster. Use kubectl port-forward to test from your workstation.

For essential commands, see kubectl cheat sheet: essential administration commands.

How to Identify Blocks Caused by Network Policies?

Network Policies implement a "deny by default" security model once activated. Legitimate traffic can be blocked by an overly restrictive policy.

Network Policy Audit

# List all Network Policies
kubectl get networkpolicies -A

# Inspect a specific policy
kubectl describe networkpolicy <policy-name> -n <namespace>

# Identify impacted pods
kubectl get pods -n <namespace> -l <policy-selector>

Example of Blocking Policy

# This policy blocks ALL incoming traffic to app=backend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-ingress
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
# No ingress rule = all blocked

Unblock Necessary Traffic

# Allow traffic from frontend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080

Modern CNIs like Cilium offer advanced observability on blocked traffic thanks to eBPF.

For deeper security, see Secure a Kubernetes cluster: best practices.

What Tools to Use for Advanced Kubernetes Network Debugging Diagnosis?

Built-in Tools

# kubectl debug to inject a debug container
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=<container-name>

# kubectl exec for existing pods
kubectl exec -it <pod-name> -- /bin/sh

# kubectl port-forward to test Services
kubectl port-forward svc/<service-name> 8080:80

Specialized Tools

ToolUsageCommand
netshootAll-in-one image (curl, dig, tcpdump, nmap)kubectl run netshoot --rm -it --image=nicolaka/netshoot
ksniffWireshark capture from a podkubectl sniff
kubesharkReal-time API traffic analysiskubeshark tap
cilium connectivity testCilium connectivity validationcilium connectivity test
# Example with kubeshark
kubeshark tap -n default
# Opens a web interface to analyze HTTP/gRPC traffic

# Cilium connectivity test
kubectl exec -n kube-system cilium-xxx -- cilium connectivity test
Remember: The nicolaka/netshoot image contains over 30 network tools. Keep it in your troubleshooting arsenal.

How Does the Kubernetes Infrastructure Engineer Approach Exam Troubleshooting?

The CKA exam allocates 30% of the score to troubleshooting, with a significant portion concerning networking. According to Linux Foundation, the exam lasts 2 hours with a 66% passing score.

Quick Diagnosis Methodology

  1. Identify the exact symptom (timeout, connection refused, DNS failure)
  2. Isolate the problematic layer (pod, Service, DNS, policy)
  3. Check relevant logs and events
  4. Test connectivity at each level
  5. Fix and validate
# Typical diagnosis workflow
# 1. Pod state
kubectl get pods -o wide

# 2. Recent events
kubectl get events --sort-by=.metadata.creationTimestamp

# 3. Application logs
kubectl logs <pod-name> --tail=50

# 4. Connectivity test
kubectl run test --rm -it --image=busybox -- wget -qO- <service>:<port>

For common problems, see Resolve the 10 most common Kubernetes cluster problems.

As TealHQ reminds us: "Don't let your knowledge remain theoretical - set up a real Kubernetes environment to solidify your skills."

How to Document and Prevent Recurring Network Incidents?

Diagnostic Runbook

Create a structured document for each incident type:

## Incident: Pods cannot resolve DNS

### Symptoms
- Applications timeout on HTTP calls
- Logs: "could not resolve host"

### Diagnosis
1. kubectl get pods -n kube-system -l k8s-app=kube-dns
2. kubectl logs -n kube-system coredns-xxx
3. kubectl run test --rm -it --image=busybox -- nslookup kubernetes

### Solutions
- Scale CoreDNS: kubectl scale deployment coredns -n kube-system --replicas=3
- Check NetworkPolicy on UDP/53
- Restart CoreDNS pods if needed

Proactive Monitoring

# PrometheusRule to alert on DNS errors
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: coredns-alerts
spec:
groups:
- name: coredns
rules:
- alert: CoreDNSErrorsHigh
expr: rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m]) > 0.1
for: 5m
labels:
severity: warning

For environment comparisons, see Managed Kubernetes (EKS, AKS, GKE) vs self-hosted: comparison.

Remember: Every resolved incident should enrich your knowledge base. Time invested in documentation pays off with each future occurrence.

Develop Your Troubleshooting Skills with SFEIR Institute

Kubernetes network diagnosis distinguishes junior administrators from experts. With 71% of Fortune 100 companies running Kubernetes in production, these skills are highly valued.

SFEIR Institute trainings cover advanced troubleshooting:

According to Linux Foundation, CKA and CKS certifications are valid for 2 years. Validate your skills with training guided by practitioners.

Contact our teams to build your path to certification.