Key Takeaways
- ✓'Observability relies on 3 pillars: metrics, logs, and traces.'
- ✓Charity Majors: ability to ask arbitrary questions without anticipating them in advance.
- ✓Correlation of the three pillars enables incident diagnosis in minutes.
Kubernetes observability with metrics, logs, and traces represents the foundation of any reliable cluster operation in production. According to Charity Majors, CTO of Honeycomb and pioneer of the observability movement: "Observability is the ability to ask arbitrary questions about your system without having to anticipate those questions in advance." This definition distinguishes observability from traditional monitoring, which is limited to predefined metrics.
TL;DR: Kubernetes observability relies on three complementary pillars: metrics (quantitative state), logs (textual context), and traces (request paths). Their correlation enables diagnosing any incident in minutes.
Professionals who want to go further follow the LFS458 Kubernetes Administration training.
What is Kubernetes Observability with Metrics, Logs, and Traces?
Observability is the property of a system that allows understanding its internal state from its external outputs. In Kubernetes, these outputs are metrics, logs, and distributed traces. A Kubernetes infrastructure engineer must master these three dimensions to operate effectively.
The distinction with monitoring is crucial. Monitoring answers known questions: "Is CPU exceeding 80%?". Observability enables answering unknown questions: "Why are requests to the payment service failing for 3% of users in the Asia region?"
Remember: An observable system exposes enough data to diagnose any problem without code modification or redeployment.
The Pillars of Kubernetes Monitoring Observability
| Pillar | Question | Characteristics | Typical Tools |
|---|---|---|---|
| Metrics | How much? | Numeric, aggregated, time series | Prometheus, Datadog |
| Logs | Why? | Textual, event-based, unstructured | Loki, Elasticsearch |
| Traces | How? | Correlated, distributed, causal | Jaeger, Zipkin |
Kubernetes production monitoring architecture details the implementation of each pillar.
How Do Kubernetes Metrics Work?
Metrics are numeric measurements collected at regular intervals. They allow observing the quantitative state of the cluster and applications. Prometheus, a CNCF graduated project, has established itself as the de facto standard with over 500 available exporters.
Types of Kubernetes Metrics
Kubernetes exposes several categories of metrics:
Infrastructure metrics (kubelet, cAdvisor):
# Memory usage by container
container_memory_usage_bytes{namespace="production"}
# CPU by node
node_cpu_seconds_total{mode="idle"}
Control plane metrics (API server, etcd, scheduler):
# API request latency
histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket[5m]))
# etcd health
etcd_server_has_leader
Application metrics (exposed by your services):
# HTTP request rate
rate(http_requests_total{service="checkout"}[5m])
The Prometheus installation guide explains how to collect these metrics.
Best Practices for Metrics
Name your metrics according to Prometheus conventions:
(e.g.,_ _ http_request_duration_seconds)- Use standard suffixes:
_total(counter),_seconds(duration),_bytes(size)
Remember: Collect USE metrics (Utilization, Saturation, Errors) for infrastructure and RED metrics (Rate, Errors, Duration) for services. This approach covers 90% of diagnostic needs.
How Do Kubernetes Logs Work?
Logs are textual records of events produced by containers and Kubernetes components. Unlike metrics, they are not aggregated and preserve the complete context of each event. A Kubernetes Cloud Operations engineer consults logs to understand the "why" after identifying the "what" via metrics.
Kubernetes Logging Architecture
Kubernetes does not provide a native logging solution. Container logs are written to stdout/stderr and captured by the runtime (containerd, CRI-O). Three architecture patterns exist:
1. Node-level logging agent (recommended):
# Fluentbit DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
spec:
selector:
matchLabels:
app: fluent-bit
template:
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.2
volumeMounts:
- name: varlog
mountPath: /var/log
2. Sidecar container: For specific parsing or routing needs.
3. Application direct push: The application sends directly to the backend (less recommended).
Structuring Logs
Tom Wilkie, VP Product at Grafana Labs and co-creator of Loki, recommends: "Log in JSON with consistent fields. Parsing unstructured logs represents 40% of your logging pipeline cost."
{
"timestamp": "2026-02-27T10:15:30Z",
"level": "error",
"service": "checkout",
"trace_id": "abc123",
"message": "Payment failed",
"error_code": "INSUFFICIENT_FUNDS"
}
For more depth, see the Kubernetes Monitoring and Troubleshooting module.
How Do Distributed Traces Work?
Distributed traces follow the path of a request through the services of a microservices architecture. They answer the question: "How did this request traverse the system?" A unique trace_id links all spans (segments) of the request.
Anatomy of a Trace
A trace consists of hierarchical spans:
Trace ID: abc123
├── Span 1: API Gateway (100ms)
│ ├── Span 2: Auth Service (20ms)
│ └── Span 3: Checkout Service (75ms)
│ ├── Span 4: Inventory DB (30ms)
│ └── Span 5: Payment Gateway (40ms)
Each span contains:
- Operation name: name of the operation
- Start/End timestamps: exact duration
- Tags: metadata (http.status_code, db.type)
- Logs: events during execution
Instrumentation with OpenTelemetry
OpenTelemetry unifies the collection of all three pillars. Its adoption increased by 287% in 2024 according to the CNCF Survey.
# Python instrumentation with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("checkout") as span:
span.set_attribute("order.id", order_id)
process_payment()
Remember: Start by instrumenting entry points (ingress, API gateway) then propagate trace context via standard HTTP headers (W3C Trace Context).
Correlating Metrics, Logs, and Traces for Diagnosis
The power of observability lies in correlating the three pillars. A typical diagnostic workflow:
- Metric alert: "P99 latency > 2s on checkout-service"
- Trace exploration: Identify slow spans via Jaeger/Tempo
- Log analysis: Consult the logs of the problematic span with the trace_id
Implementing Correlation
Inject the trace_id in all logs:
# Fluentbit configuration to enrich logs
[FILTER]
Name lua
Match kube.*
script /scripts/add_trace_id.lua
Grafana dashboards allow navigating between metrics, logs, and traces via data links.
Example of Correlation in Practice
# 1. Identify trace_id from Prometheus
# alert: http_request_duration_seconds{trace_id="abc123"}
# 2. Search in Loki
logcli query '{app="checkout"} |= "abc123"'
# 3. Visualize in Jaeger
# URL: /trace/abc123
For complete implementation, the get started with monitoring in 15 minutes tutorial provides a test environment.
2026 Kubernetes Observability Trends
Observability is evolving rapidly. Three major trends are emerging according to the 2026 trends analysis:
1. eBPF for instrumentation-free observability: Cilium and Pixie collect metrics and traces at the kernel level, without modifying application code.
2. AI-assisted observability: Tools use ML to detect anomalies, automatically correlate signals, and suggest root causes.
3. OpenTelemetry as universal standard: Convergence toward OTel simplifies instrumentation and interoperability between tools.
Choosing Your Kubernetes Observability Tools
Tool choice depends on your context: cluster size, budget, internal skills. Here's a comparison of popular stacks:
| Stack | Strengths | Limitations | Cost |
|---|---|---|---|
| Prometheus + Loki + Tempo | Open source, Grafana integration | Complex scaling | Infra only |
| Datadog | Unified solution, SaaS | High cost at scale | Per host + ingestion |
| Elastic (ELK) | Powerful search | Resource-intensive | License + infra |
| Grafana Cloud | Managed OSS stack | Vendor lock-in | Pay-as-you-go |
Remember: Start with the Prometheus/Loki/Tempo stack to master concepts. Migrate to a managed solution when internal operation becomes a bottleneck.
Training to Master Kubernetes Observability
Available training in France covers observability through different approaches:
- Kubernetes Monitoring and Troubleshooting training in Paris
- Kubernetes Monitoring and Troubleshooting training in Bordeaux
- Kubernetes Monitoring and Troubleshooting training in Lille
Develop Your Observability Skills
To become a Kubernetes Cloud Operations engineer capable of effective diagnosis:
- The LFS458 Kubernetes Administration training covers monitoring and troubleshooting in a CKA context (4 days)
- The LFD459 Kubernetes for Developers training includes application instrumentation for CKAD (3 days)
- Kubernetes Fundamentals introduces basic concepts in 1 day
Contact us to identify the path suited to your profile and objectives.