Key Takeaways
- ✓'Observability relies on three pillars: metrics, logs, and traces'
- ✓Prometheus + Grafana dominate with 75% adoption
- ✓Instrument during development, not in production
Observability and application monitoring on Kubernetes represents a critical challenge for any Cloud operations engineer. Without visibility into metrics, logs and traces, diagnosing a production incident becomes an expensive guessing game. This guide details essential practices for instrumenting your Kubernetes applications with Prometheus, Grafana and OpenTelemetry standards.
TL;DR: Kubernetes observability relies on three pillars: metrics (Prometheus), logs (Loki/EFK) and traces (Jaeger/Tempo). Prometheus + Grafana dominates with 75% adoption. Instrument during development, not in production.
This skill is at the heart of the LFD459 Kubernetes for Application Developers training.
What is observability and application monitoring on Kubernetes?
Observability is the ability to understand a system's internal state from its external outputs. On Kubernetes, this encompasses metrics, logs and distributed traces.
Monitoring is collecting and analyzing data to detect anomalies. Observability goes further: it enables investigating unknown problems.
| Concept | Definition | Kubernetes tools |
|---|---|---|
| Metrics | Timestamped numerical values | Prometheus, Datadog |
| Logs | Textual event records | Loki, Elasticsearch |
| Traces | Request tracking across services | Jaeger, Tempo |
According to Grafana Labs, 75% of organizations use Prometheus + Grafana for Kubernetes monitoring.
Key takeaway: The three pillars (metrics, logs, traces) are complementary. Metrics detect, logs explain, traces locate.
Why is observability and application monitoring on Kubernetes critical?
Kubernetes orchestrates hundreds of ephemeral containers. Without structured observability, identifying the root cause of an incident becomes impossible.
Kubernetes-specific challenges:
- Pod ephemerality: a crashed pod disappears with its local logs
- Complex networking: Services, Ingress, NetworkPolicies obscure the flow
- Dynamic scaling: the number of instances constantly changes
The increasing complexity of workloads demands mature observability. To deepen Kubernetes application development, instrumentation is a prerequisite.
How to configure Prometheus for Kubernetes monitoring?
Prometheus is the de facto standard for Kubernetes metrics collection. Its pull model scrapes /metrics endpoints from applications.
Installation with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
ServiceMonitor configuration
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: my-application
endpoints:
- port: metrics
interval: 30s
path: /metrics
Essential metrics to collect:
container_cpu_usage_seconds_total: CPU consumptioncontainer_memory_usage_bytes: memory utilizationkube_pod_status_phase: pod statehttp_requests_total: application requests
Key takeaway: Configure ServiceMonitors rather than static configurations. Prometheus automatically discovers new pods.
To properly manage configurations, master Kubernetes ConfigMaps and Secrets.
What Grafana dashboards for a Cloud operations engineer?
Grafana visualizes Prometheus metrics via interactive dashboards. The Cloud operations engineer configures views adapted to workloads.
Essential dashboards
| Dashboard | Grafana ID | Usage |
|---|---|---|
| Kubernetes Cluster | 315 | Global cluster view |
| Node Exporter | 1860 | System metrics |
| Nginx Ingress | 9614 | Inbound traffic |
| Application RED | Custom | Latency, errors, throughput |
Example PromQL panel
# HTTP 5xx error rate by service
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
* 100
Dashboard best practices:
- Structure by level: cluster → namespace → deployment → pod
- Use variables: $namespace, $deployment to filter
- Define thresholds: red > 80% CPU, orange > 60%
Kubernetes cluster administration requires infrastructure-oriented dashboards, while developers focus on application metrics.
How to implement Kubernetes metrics, logs and traces?
The three observability pillars are implemented differently but must be correlated.
Application metrics with Prometheus client
from prometheus_client import Counter, Histogram, start_http_server
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency', ['endpoint'])
@REQUEST_LATENCY.labels(endpoint='/api/users').time()
def get_users():
REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').inc()
# business logic
Structured logs with Kubernetes labels
apiVersion: v1
kind: Pod
metadata:
labels:
app: my-app
version: v1.2.3
spec:
containers:
- name: app
env:
- name: LOG_FORMAT
value: "json"
- name: LOG_LEVEL
value: "info"
Recommended log format (JSON):
{
"timestamp": "2026-02-28T10:30:00Z",
"level": "error",
"service": "payment-api",
"trace_id": "abc123",
"message": "Payment failed",
"error_code": "INSUFFICIENT_FUNDS"
}
Distributed traces with OpenTelemetry
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317"
- name: OTEL_SERVICE_NAME
value: "payment-service"
Key takeaway: Correlation is mandatory. Include trace_id in each log to link metrics, logs and traces.
What alerts to configure for Prometheus Grafana Kubernetes monitoring?
Alerts transform passive monitoring into proactive detection. Configure AlertRules Prometheus for critical scenarios.
Essential alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: application-alerts
spec:
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate > 5% on {{ $labels.service }}"
- alert: PodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is restarting frequently"
Severity matrix:
| Severity | Response time | Example |
|---|---|---|
| Critical | < 15 min | Service down, errors > 10% |
| Warning | < 1h | High latency, restarts |
| Info | Next business day | Certificate expires in 30d |
To diagnose alerts, see the guide resolve Kubernetes deployment errors.
How to integrate observability into a CI/CD pipeline?
Observability integrates from development, not just in production. CI/CD pipelines for Kubernetes applications include quality gates based on metrics.
Metrics validation in staging
# .gitlab-ci.yml
deploy-staging:
script:
- kubectl apply -f k8s/
- sleep 60 # warm-up
- |
ERROR_RATE=$(curl -s "prometheus:9090/api/v1/query?query=..." | jq '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate too high: $ERROR_RATE"
kubectl rollout undo deployment/app
exit 1
fi
Automated checks:
- Error rate < 1% after deployment
- p99 latency < defined threshold
- No OOMKilled in the first 5 minutes
- Readiness probes pass
According to Chris Aniszczyk, CNCF CTO: "Kubernetes is no longer experimental but foundational. Soon, it will be essential to AI as well."
Which tools to choose for Kubernetes observability in 2026?
The choice depends on organization size and constraints (cloud, on-premise, budget).
| Criterion | OSS Stack | Commercial Stack |
|---|---|---|
| Metrics | Prometheus | Datadog, New Relic |
| Logs | Loki | Splunk, Elastic Cloud |
| Traces | Jaeger/Tempo | Dynatrace, Honeycomb |
| Cost | Infrastructure only | License + volume |
| Maintenance | Internal team | Managed |
Recommendation by context:
- Startup / SMB: Prometheus + Grafana + Loki (LGTM stack)
- Enterprise on-premise: Elastic Stack or Splunk
- Managed cloud-native: Datadog or native cloud service (CloudWatch, Stackdriver)
The LFD459 training covers application instrumentation regardless of chosen stack.
Key takeaway: OpenTelemetry is the emerging standard. It unifies metrics, logs and traces collection with a single SDK.
How to measure the ROI of Kubernetes observability?
Investment in observability is measured in reduced resolution time (MTTR) and incident prevention.
ROI metrics:
- MTTR: mean time to resolution (target < 30 min)
- MTTD: mean time to detection (target < 5 min)
- Incident frequency: reduction through proactive detection
- Avoided downtime cost: preserved revenue
With 71% of Fortune 100 companies running Kubernetes in production, observability is no longer optional.
The Kubernetes application development training in Paris includes practical instrumentation exercises.
Take action: instrument your Kubernetes applications
Master observability with SFEIR Institute trainings.
Recommended trainings:
- LFD459 Kubernetes for Application Developers training: instrumentation and application debugging (3 days)
- LFS458 Kubernetes Administration training: cluster and infrastructure monitoring (4 days)
- Kubernetes Fundamentals: discovery for beginners (1 day). To go deeper, see our Kubernetes software engineer training.
Contact our advisors to build your Kubernetes learning path.