Key Takeaways
- ✓34 working days per year spent on Kubernetes issues according to Cloud Native Now
- ✓'4-step methodology: pod-to-pod, DNS, Services, Network Policies'
- ✓'Essential tools: kubectl, nslookup, tcpdump'
Every Kubernetes infrastructure engineer preparing for CKA certification spends a significant portion of their time on network troubleshooting. According to Cloud Native Now, IT teams spend 34 working days per year resolving Kubernetes problems. Kubernetes network debugging represents a differentiating skill for any professional.
TL;DR: Kubernetes DNS Services troubleshooting follows a structured methodology: verify pod-to-pod connectivity, validate DNS resolution, test Services, then inspect Network Policies. kubectl, nslookup, and tcpdump are your allies.
Professionals who want to go further follow the LFS458 Kubernetes Administration training.
Why Must the Kubernetes Infrastructure Engineer Master Network Diagnosis?
Kubernetes networking relies on multiple abstraction layers. A malfunction can occur at the CNI, Services, DNS, or Network Policies level.
According to the CNCF Annual Survey 2025, 82% of container users run Kubernetes in production. This massive adoption multiplies potential network incidents.
| Layer | Components | Common Issues |
|---|---|---|
| L2/L3 | CNI (Calico, Cilium) | Pods without IP, missing routes |
| L4 | Services, kube-proxy | Inaccessible ClusterIP, corrupted iptables |
| L7 | Ingress, DNS | TLS certificates, DNS resolution |
| Policies | NetworkPolicy | Unintentionally blocked traffic |
Remember: 80% of Kubernetes network problems come from misconfigurations, not software bugs. Check manifests first before suspecting infrastructure.
How to Resolve Kubernetes Pod Network Issues at Connectivity Level?
Basic Connectivity Test
# Create a diagnostic pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash
# From the pod, test connectivity to another pod
ping <pod-ip>
curl <pod-ip>:<port>
# Test connectivity to a Service
curl <service-name>.<namespace>.svc.cluster.local:<port>
Check CNI State
The CNI assigns IPs to pods. A failing CNI causes pods to be in ContainerCreating state.
# Check CNI pods
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get pods -n kube-system -l k8s-app=cilium
# Inspect CNI logs
kubectl logs -n kube-system -l k8s-app=calico-node --tail=100
# Check routes on a node
ip route show
For complete network configuration, see Configure a Kubernetes cluster network: CNI, Services, Ingress.
Diagnosis with tcpdump
# Capture traffic on pod interface
kubectl debug -it <pod-name> --image=nicolaka/netshoot -- tcpdump -i eth0 -nn port 80
# Analyze ICMP traffic
kubectl debug -it <pod-name> --image=nicolaka/netshoot -- tcpdump -i eth0 icmp
Remember: A pod without an IP after 30 seconds indicates a CNI problem. Check CNI DaemonSet logs first.
How to Master Kubernetes DNS Services Troubleshooting?
DNS is critical. Every Service resolution goes through CoreDNS. A failing DNS paralyzes the cluster.
Check CoreDNS
# CoreDNS pod state
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl get deployment coredns -n kube-system
# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# CoreDNS metrics
kubectl get --raw /api/v1/namespaces/kube-system/pods/coredns-xxx/proxy/metrics
DNS Resolution Tests
# From a test pod
kubectl run dnstest --rm -it --image=busybox:1.36 -- nslookup kubernetes.default
# Full resolution
kubectl run dnstest --rm -it --image=nicolaka/netshoot -- dig kubernetes.default.svc.cluster.local
# Check pod's resolv.conf file
kubectl exec <pod-name> -- cat /etc/resolv.conf
Expected DNS configuration:
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
Common DNS Issues
| Symptom | Probable Cause | Solution |
|---|---|---|
NXDOMAIN | Non-existent service | Check kubectl get svc |
SERVFAIL | CoreDNS overloaded | Scale the Deployment |
| Timeout | NetworkPolicy blocks UDP 53 | Allow DNS traffic |
| Slow resolution | ndots:5 too high | Reduce or use FQDN |
# Optimize DNS queries in pod
apiVersion: v1
kind: Pod
metadata:
name: optimized-dns
spec:
dnsConfig:
options:
- name: ndots
value: "2"
- name: single-request-reopen
containers:
- name: app
image: myapp:v1
For fundamentals, see Kubernetes fundamentals for beginners.
How to Debug Kubernetes Services That Don't Work?
A Service exposes pods via a stable IP. Issues generally come from selectors or endpoints.
Check Endpoints
# List Service endpoints
kubectl get endpoints <service-name>
kubectl describe endpoints <service-name>
# A Service without endpoints indicates:
# - No pods with matching labels
# - Non-Ready pods
# - Pods on different ports
Validate Selectors
# Compare Service and pod labels
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'
kubectl get pods -l <selector> --show-labels
# Example mismatch
# Service selector: app=frontend
# Pod labels: app=front-end # MISMATCH!
Diagnose kube-proxy
kube-proxy manages iptables/IPVS rules to route traffic to pods.
# Check kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
# Inspect iptables rules (iptables mode)
iptables -t nat -L KUBE-SERVICES -n | grep <service-name>
# Inspect IPVS (ipvs mode)
ipvsadm -Ln | grep <cluster-ip>
# kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50
Remember: AClusterIPtype Service is only accessible from inside the cluster. Usekubectl port-forwardto test from your workstation.
For essential commands, see kubectl cheat sheet: essential administration commands.
How to Identify Blocks Caused by Network Policies?
Network Policies implement a "deny by default" security model once activated. Legitimate traffic can be blocked by an overly restrictive policy.
Network Policy Audit
# List all Network Policies
kubectl get networkpolicies -A
# Inspect a specific policy
kubectl describe networkpolicy <policy-name> -n <namespace>
# Identify impacted pods
kubectl get pods -n <namespace> -l <policy-selector>
Example of Blocking Policy
# This policy blocks ALL incoming traffic to app=backend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-ingress
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
# No ingress rule = all blocked
Unblock Necessary Traffic
# Allow traffic from frontend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Modern CNIs like Cilium offer advanced observability on blocked traffic thanks to eBPF.
For deeper security, see Secure a Kubernetes cluster: best practices.
What Tools to Use for Advanced Kubernetes Network Debugging Diagnosis?
Built-in Tools
# kubectl debug to inject a debug container
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=<container-name>
# kubectl exec for existing pods
kubectl exec -it <pod-name> -- /bin/sh
# kubectl port-forward to test Services
kubectl port-forward svc/<service-name> 8080:80
Specialized Tools
| Tool | Usage | Command |
|---|---|---|
| netshoot | All-in-one image (curl, dig, tcpdump, nmap) | kubectl run netshoot --rm -it --image=nicolaka/netshoot |
| ksniff | Wireshark capture from a pod | kubectl sniff |
| kubeshark | Real-time API traffic analysis | kubeshark tap |
| cilium connectivity test | Cilium connectivity validation | cilium connectivity test |
# Example with kubeshark
kubeshark tap -n default
# Opens a web interface to analyze HTTP/gRPC traffic
# Cilium connectivity test
kubectl exec -n kube-system cilium-xxx -- cilium connectivity test
Remember: The nicolaka/netshoot image contains over 30 network tools. Keep it in your troubleshooting arsenal.
How Does the Kubernetes Infrastructure Engineer Approach Exam Troubleshooting?
The CKA exam allocates 30% of the score to troubleshooting, with a significant portion concerning networking. According to Linux Foundation, the exam lasts 2 hours with a 66% passing score.
Quick Diagnosis Methodology
- Identify the exact symptom (timeout, connection refused, DNS failure)
- Isolate the problematic layer (pod, Service, DNS, policy)
- Check relevant logs and events
- Test connectivity at each level
- Fix and validate
# Typical diagnosis workflow
# 1. Pod state
kubectl get pods -o wide
# 2. Recent events
kubectl get events --sort-by=.metadata.creationTimestamp
# 3. Application logs
kubectl logs <pod-name> --tail=50
# 4. Connectivity test
kubectl run test --rm -it --image=busybox -- wget -qO- <service>:<port>
For common problems, see Resolve the 10 most common Kubernetes cluster problems.
As TealHQ reminds us: "Don't let your knowledge remain theoretical - set up a real Kubernetes environment to solidify your skills."
How to Document and Prevent Recurring Network Incidents?
Diagnostic Runbook
Create a structured document for each incident type:
## Incident: Pods cannot resolve DNS
### Symptoms
- Applications timeout on HTTP calls
- Logs: "could not resolve host"
### Diagnosis
1. kubectl get pods -n kube-system -l k8s-app=kube-dns
2. kubectl logs -n kube-system coredns-xxx
3. kubectl run test --rm -it --image=busybox -- nslookup kubernetes
### Solutions
- Scale CoreDNS: kubectl scale deployment coredns -n kube-system --replicas=3
- Check NetworkPolicy on UDP/53
- Restart CoreDNS pods if needed
Proactive Monitoring
# PrometheusRule to alert on DNS errors
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: coredns-alerts
spec:
groups:
- name: coredns
rules:
- alert: CoreDNSErrorsHigh
expr: rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m]) > 0.1
for: 5m
labels:
severity: warning
For environment comparisons, see Managed Kubernetes (EKS, AKS, GKE) vs self-hosted: comparison.
Remember: Every resolved incident should enrich your knowledge base. Time invested in documentation pays off with each future occurrence.
Develop Your Troubleshooting Skills with SFEIR Institute
Kubernetes network diagnosis distinguishes junior administrators from experts. With 71% of Fortune 100 companies running Kubernetes in production, these skills are highly valued.
SFEIR Institute trainings cover advanced troubleshooting:
- LFS458 Kubernetes Administration: 4 days including network diagnosis, CNI configuration, and production problem resolution. Complete CKA preparation.
- Kubernetes Fundamentals: 1 day to understand network architecture before diving deeper into troubleshooting.
- LFS460 Kubernetes Security Fundamentals: 4 days to master Network Policies and network security.
According to Linux Foundation, CKA and CKS certifications are valid for 2 years. Validate your skills with training guided by practitioners.
Contact our teams to build your path to certification.