Cheatsheet: Essential Kubernetes Metrics to Monitor

This cheatsheet gathers the essential Kubernetes monitoring metrics that every Kubernetes infrastructure engineer or Kubernetes system administrator CKS certification holder must monitor. With 75% of teams using Prometheus and Grafana for Kubernetes monitoring, mastering these metrics has become an essential skill.

TL;DR: This cheatsheet lists critical metrics by category (cluster, nodes, pods, network) with corresponding PromQL queries and recommended alert thresholds.

This skill is at the core of the LFS458 Kubernetes Administration training.

What Are the Essential Kubernetes Monitoring Metrics at Cluster Level?

Cluster metrics provide an overall view of your Kubernetes infrastructure health.

Cluster Health Metrics

Metric	Description	Critical Threshold
`kube_node_status_condition`	Node status	!= Ready
`apiserver_request_total`	API server requests	> 5000/s
`etcd_server_has_leader`	etcd leadership	= 0
`scheduler_pending_pods`	Pending pods	> 10 for 5min

Essential PromQL Queries

# Number of Ready nodes
sum(kube_node_status_condition{condition="Ready",status="true"})

# API server error rate
sum(rate(apiserver_request_total{code=~"5.."}[5m]))
/ sum(rate(apiserver_request_total[5m])) * 100

# API request latency (P99)
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le))

Definition: The Control Plane includes the API server, scheduler, controller-manager, and etcd - critical components whose metrics must be monitored as a priority.

Remember: A healthy cluster shows an API error rate < 1% and P99 latency < 1 second.

See the Kubernetes monitoring architecture in production for complete implementation.

How to Monitor Essential Kubernetes Node Metrics?

Node metrics detect capacity and performance issues.

Quick Reference Table

Metric	Source	PromQL Query	Alert if
CPU used	node-exporter	`node_cpu_seconds_total`	> 85%
Memory used	node-exporter	`node_memory_MemAvailable_bytes`	< 15%
Disk used	node-exporter	`node_filesystem_avail_bytes`	< 10%
Network I/O	node-exporter	`node_network_receive_bytes_total`	Anomaly

Node Monitoring Queries

# CPU per node (%)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Available memory (%)
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk pressure
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100

# Pods per node vs capacity
sum by(node) (kube_pod_info) / on(node) kube_node_status_allocatable{resource="pods"} * 100

According to the CNCF Annual Survey 2025, 82% of organizations run Kubernetes in production, making node monitoring critical.

Definition: Memory pressure (MemoryPressure) is a condition indicating that the node is running low on available memory and may soon evict pods.

What Metrics to Monitor at Pod and Container Level?

Pod metrics detect application issues before they impact users.

Container Resource Metrics

# Memory used per pod
sum by(pod, namespace) (container_memory_usage_bytes{container!=""})

# CPU used per pod
sum by(pod, namespace) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))

# Memory used / limit ratio
container_memory_usage_bytes / container_spec_memory_limit_bytes

# Container restarts
sum by(pod) (kube_pod_container_status_restarts_total)

Recommended Alert Thresholds

Metric	Warning	Critical
Memory / Limit	> 80%	> 90%
CPU / Limit	> 70%	> 85%
Restarts/hour	> 3	> 10
Non-Ready pods	> 0 for 5min	> 0 for 15min

Remember: Monitor the usage/limit ratio to prevent OOMKilled before they occur.

The kubectl debugging commands cheatsheet complements these metrics with manual commands.

How to Monitor Kubernetes Network Metrics?

Network problems are among the most difficult to diagnose without proper metrics.

Essential Network Metrics

# Bandwidth received per pod
sum by(pod) (rate(container_network_receive_bytes_total[5m]))

# Bandwidth transmitted per pod
sum by(pod) (rate(container_network_transmit_bytes_total[5m]))

# Network errors
sum by(pod) (rate(container_network_receive_errors_total[5m]))

# Dropped packets
sum by(pod) (rate(container_network_receive_packets_dropped_total[5m]))

Service Mesh Indicators (Istio)

Metric	Description	Objective
`istio_requests_total`	Request count	Baseline
`istio_request_duration_milliseconds`	Latency	P99 < 500ms
`istio_tcp_connections_opened_total`	TCP connections	Monitoring

Kubernetes observability covers metrics, logs, and traces integration.

What Application Metrics to Monitor?

Application metrics (Golden Signals) are the ultimate indicators of service health.

The 4 Golden Signals

Signal	Definition	Typical Metric
Latency	Response time	`http_request_duration_seconds`
Traffic	Request volume	`http_requests_total`
Errors	Error rate	`http_requests_total{status=~"5.."}`
Saturation	Resource usage	`container_memory_usage_bytes`

Example Queries

# HTTP error rate (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Requests per second
sum(rate(http_requests_total[5m]))

Definition: Golden Signals, defined by Google SRE, are the four fundamental metrics for evaluating distributed system health.

Remember: A typical SLO targets < 1% errors and P95 latency < 200ms.

How to Configure Critical Alerts?

Well-configured alerts transform metrics into actions.

Prometheus Alert Rules

groups:
- name: kubernetes-critical
rules:
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} not ready"

- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} restarting"

- alert: KubernetesContainerOomKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
for: 0m
labels:
severity: critical

The Kubernetes observability checklist in production details the complete configuration.

Quick Reference: kubectl Monitoring Commands

Complement Prometheus with these kubectl commands for real-time debugging.

# Top pods by CPU
kubectl top pods -A --sort-by=cpu | head -20

# Top pods by memory
kubectl top pods -A --sort-by=memory | head -20

# Resources by node
kubectl top nodes

# Recent cluster events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Non-Running pods
kubectl get pods -A --field-selector=status.phase!=Running

See the Kubernetes system administrator page for the complete path.

Take Action: Master Kubernetes Monitoring

Monitoring is a key skill for CKA and CKS exams. According to the Linux Foundation, the CKS exam requires a score of 67% in 2 hours and requires a valid CKA.

As Chris Aniszczyk, CNCF CTO states: "Kubernetes is no longer experimental but foundational. Soon, it will be essential to AI as well."

The LFS458 Kubernetes Administration training covers production monitoring over 4 days. For advanced security, see the LFS460 Kubernetes Security training.

Additional Resources:

Contact our advisors to define your certification path.

Key Takeaways