Understanding Kubernetes Observability: Metrics, Logs and Traces

Kubernetes observability with metrics, logs, and traces represents the foundation of any reliable cluster operation in production. According to Charity Majors, CTO of Honeycomb and pioneer of the observability movement: "Observability is the ability to ask arbitrary questions about your system without having to anticipate those questions in advance." This definition distinguishes observability from traditional monitoring, which is limited to predefined metrics.

TL;DR: Kubernetes observability relies on three complementary pillars: metrics (quantitative state), logs (textual context), and traces (request paths). Their correlation enables diagnosing any incident in minutes.

Professionals who want to go further follow the LFS458 Kubernetes Administration training.

What is Kubernetes Observability with Metrics, Logs, and Traces?

Observability is the property of a system that allows understanding its internal state from its external outputs. In Kubernetes, these outputs are metrics, logs, and distributed traces. A Kubernetes infrastructure engineer must master these three dimensions to operate effectively.

The distinction with monitoring is crucial. Monitoring answers known questions: "Is CPU exceeding 80%?". Observability enables answering unknown questions: "Why are requests to the payment service failing for 3% of users in the Asia region?"

Remember: An observable system exposes enough data to diagnose any problem without code modification or redeployment.

The Pillars of Kubernetes Monitoring Observability

Pillar	Question	Characteristics	Typical Tools
Metrics	How much?	Numeric, aggregated, time series	Prometheus, Datadog
Logs	Why?	Textual, event-based, unstructured	Loki, Elasticsearch
Traces	How?	Correlated, distributed, causal	Jaeger, Zipkin

Kubernetes production monitoring architecture details the implementation of each pillar.

How Do Kubernetes Metrics Work?

Metrics are numeric measurements collected at regular intervals. They allow observing the quantitative state of the cluster and applications. Prometheus, a CNCF graduated project, has established itself as the de facto standard with over 500 available exporters.

Types of Kubernetes Metrics

Kubernetes exposes several categories of metrics:

Infrastructure metrics (kubelet, cAdvisor):

# Memory usage by container
container_memory_usage_bytes{namespace="production"}

# CPU by node
node_cpu_seconds_total{mode="idle"}

Control plane metrics (API server, etcd, scheduler):

# API request latency
histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket[5m]))

# etcd health
etcd_server_has_leader

Application metrics (exposed by your services):

# HTTP request rate
rate(http_requests_total{service="checkout"}[5m])

The Prometheus installation guide explains how to collect these metrics.

Best Practices for Metrics

Name your metrics according to Prometheus conventions:

__ (e.g., http_request_duration_seconds)
Use standard suffixes: _total (counter), _seconds (duration), _bytes (size)

Remember: Collect USE metrics (Utilization, Saturation, Errors) for infrastructure and RED metrics (Rate, Errors, Duration) for services. This approach covers 90% of diagnostic needs.

How Do Kubernetes Logs Work?

Logs are textual records of events produced by containers and Kubernetes components. Unlike metrics, they are not aggregated and preserve the complete context of each event. A Kubernetes Cloud Operations engineer consults logs to understand the "why" after identifying the "what" via metrics.

Kubernetes Logging Architecture

Kubernetes does not provide a native logging solution. Container logs are written to stdout/stderr and captured by the runtime (containerd, CRI-O). Three architecture patterns exist:

1. Node-level logging agent (recommended):

# Fluentbit DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
spec:
selector:
matchLabels:
app: fluent-bit
template:
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.2
volumeMounts:
- name: varlog
mountPath: /var/log

2. Sidecar container: For specific parsing or routing needs.

3. Application direct push: The application sends directly to the backend (less recommended).

Structuring Logs

Tom Wilkie, VP Product at Grafana Labs and co-creator of Loki, recommends: "Log in JSON with consistent fields. Parsing unstructured logs represents 40% of your logging pipeline cost."

{
"timestamp": "2026-02-27T10:15:30Z",
"level": "error",
"service": "checkout",
"trace_id": "abc123",
"message": "Payment failed",
"error_code": "INSUFFICIENT_FUNDS"
}

For more depth, see the Kubernetes Monitoring and Troubleshooting module.

How Do Distributed Traces Work?

Distributed traces follow the path of a request through the services of a microservices architecture. They answer the question: "How did this request traverse the system?" A unique trace_id links all spans (segments) of the request.

Anatomy of a Trace

A trace consists of hierarchical spans:

Trace ID: abc123
├── Span 1: API Gateway (100ms)
│   ├── Span 2: Auth Service (20ms)
│   └── Span 3: Checkout Service (75ms)
│       ├── Span 4: Inventory DB (30ms)
│       └── Span 5: Payment Gateway (40ms)

Each span contains:

Operation name: name of the operation
Start/End timestamps: exact duration
Tags: metadata (http.status_code, db.type)
Logs: events during execution

Instrumentation with OpenTelemetry

OpenTelemetry unifies the collection of all three pillars. Its adoption increased by 287% in 2024 according to the CNCF Survey.

# Python instrumentation with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("checkout") as span:
span.set_attribute("order.id", order_id)
process_payment()

Remember: Start by instrumenting entry points (ingress, API gateway) then propagate trace context via standard HTTP headers (W3C Trace Context).

Correlating Metrics, Logs, and Traces for Diagnosis

The power of observability lies in correlating the three pillars. A typical diagnostic workflow:

Metric alert: "P99 latency > 2s on checkout-service"
Trace exploration: Identify slow spans via Jaeger/Tempo
Log analysis: Consult the logs of the problematic span with the trace_id

Implementing Correlation

Inject the trace_id in all logs:

# Fluentbit configuration to enrich logs
[FILTER]
Name     lua
Match    kube.*
script   /scripts/add_trace_id.lua

Grafana dashboards allow navigating between metrics, logs, and traces via data links.

Example of Correlation in Practice

# 1. Identify trace_id from Prometheus
# alert: http_request_duration_seconds{trace_id="abc123"}

# 2. Search in Loki
logcli query '{app="checkout"} |= "abc123"'

# 3. Visualize in Jaeger
# URL: /trace/abc123

For complete implementation, the get started with monitoring in 15 minutes tutorial provides a test environment.

2026 Kubernetes Observability Trends

Observability is evolving rapidly. Three major trends are emerging according to the 2026 trends analysis:

1. eBPF for instrumentation-free observability: Cilium and Pixie collect metrics and traces at the kernel level, without modifying application code.

2. AI-assisted observability: Tools use ML to detect anomalies, automatically correlate signals, and suggest root causes.

3. OpenTelemetry as universal standard: Convergence toward OTel simplifies instrumentation and interoperability between tools.

Choosing Your Kubernetes Observability Tools

Tool choice depends on your context: cluster size, budget, internal skills. Here's a comparison of popular stacks:

Stack	Strengths	Limitations	Cost
Prometheus + Loki + Tempo	Open source, Grafana integration	Complex scaling	Infra only
Datadog	Unified solution, SaaS	High cost at scale	Per host + ingestion
Elastic (ELK)	Powerful search	Resource-intensive	License + infra
Grafana Cloud	Managed OSS stack	Vendor lock-in	Pay-as-you-go

Remember: Start with the Prometheus/Loki/Tempo stack to master concepts. Migrate to a managed solution when internal operation becomes a bottleneck.

Training to Master Kubernetes Observability

Available training in France covers observability through different approaches:

Develop Your Observability Skills

To become a Kubernetes Cloud Operations engineer capable of effective diagnosis:

The LFS458 Kubernetes Administration training covers monitoring and troubleshooting in a CKA context (4 days)
The LFD459 Kubernetes for Developers training includes application instrumentation for CKAD (3 days)
Kubernetes Fundamentals introduces basic concepts in 1 day

Key Takeaways