Cheatsheet5 min read

Cheatsheet: Essential Kubernetes Metrics to Monitor

SFEIR Institute

Key Takeaways

  • 75% of teams use Prometheus and Grafana to monitor Kubernetes
  • kube_node_status_condition indicates node health status
  • PromQL queries allow real-time metric querying

This cheatsheet gathers the essential Kubernetes monitoring metrics that every Kubernetes infrastructure engineer or Kubernetes system administrator CKS certification holder must monitor. With 75% of teams using Prometheus and Grafana for Kubernetes monitoring, mastering these metrics has become an essential skill.

TL;DR: This cheatsheet lists critical metrics by category (cluster, nodes, pods, network) with corresponding PromQL queries and recommended alert thresholds.

This skill is at the core of the LFS458 Kubernetes Administration training.

What Are the Essential Kubernetes Monitoring Metrics at Cluster Level?

Cluster metrics provide an overall view of your Kubernetes infrastructure health.

Cluster Health Metrics

MetricDescriptionCritical Threshold
kube_node_status_conditionNode status!= Ready
apiserver_request_totalAPI server requests> 5000/s
etcd_server_has_leaderetcd leadership= 0
scheduler_pending_podsPending pods> 10 for 5min

Essential PromQL Queries

# Number of Ready nodes
sum(kube_node_status_condition{condition="Ready",status="true"})

# API server error rate
sum(rate(apiserver_request_total{code=~"5.."}[5m]))
/ sum(rate(apiserver_request_total[5m])) * 100

# API request latency (P99)
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le))

Definition: The Control Plane includes the API server, scheduler, controller-manager, and etcd - critical components whose metrics must be monitored as a priority.

Remember: A healthy cluster shows an API error rate < 1% and P99 latency < 1 second.

See the Kubernetes monitoring architecture in production for complete implementation.

How to Monitor Essential Kubernetes Node Metrics?

Node metrics detect capacity and performance issues.

Quick Reference Table

MetricSourcePromQL QueryAlert if
CPU usednode-exporternode_cpu_seconds_total> 85%
Memory usednode-exporternode_memory_MemAvailable_bytes< 15%
Disk usednode-exporternode_filesystem_avail_bytes< 10%
Network I/Onode-exporternode_network_receive_bytes_totalAnomaly

Node Monitoring Queries

# CPU per node (%)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Available memory (%)
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk pressure
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100

# Pods per node vs capacity
sum by(node) (kube_pod_info) / on(node) kube_node_status_allocatable{resource="pods"} * 100

According to the CNCF Annual Survey 2025, 82% of organizations run Kubernetes in production, making node monitoring critical.

Definition: Memory pressure (MemoryPressure) is a condition indicating that the node is running low on available memory and may soon evict pods.

What Metrics to Monitor at Pod and Container Level?

Pod metrics detect application issues before they impact users.

Container Resource Metrics

# Memory used per pod
sum by(pod, namespace) (container_memory_usage_bytes{container!=""})

# CPU used per pod
sum by(pod, namespace) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))

# Memory used / limit ratio
container_memory_usage_bytes / container_spec_memory_limit_bytes

# Container restarts
sum by(pod) (kube_pod_container_status_restarts_total)
MetricWarningCritical
Memory / Limit> 80%> 90%
CPU / Limit> 70%> 85%
Restarts/hour> 3> 10
Non-Ready pods> 0 for 5min> 0 for 15min
Remember: Monitor the usage/limit ratio to prevent OOMKilled before they occur.

The kubectl debugging commands cheatsheet complements these metrics with manual commands.

How to Monitor Kubernetes Network Metrics?

Network problems are among the most difficult to diagnose without proper metrics.

Essential Network Metrics

# Bandwidth received per pod
sum by(pod) (rate(container_network_receive_bytes_total[5m]))

# Bandwidth transmitted per pod
sum by(pod) (rate(container_network_transmit_bytes_total[5m]))

# Network errors
sum by(pod) (rate(container_network_receive_errors_total[5m]))

# Dropped packets
sum by(pod) (rate(container_network_receive_packets_dropped_total[5m]))

Service Mesh Indicators (Istio)

MetricDescriptionObjective
istio_requests_totalRequest countBaseline
istio_request_duration_millisecondsLatencyP99 < 500ms
istio_tcp_connections_opened_totalTCP connectionsMonitoring

Kubernetes observability covers metrics, logs, and traces integration.

What Application Metrics to Monitor?

Application metrics (Golden Signals) are the ultimate indicators of service health.

The 4 Golden Signals

SignalDefinitionTypical Metric
LatencyResponse timehttp_request_duration_seconds
TrafficRequest volumehttp_requests_total
ErrorsError ratehttp_requests_total{status=~"5.."}
SaturationResource usagecontainer_memory_usage_bytes

Example Queries

# HTTP error rate (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Requests per second
sum(rate(http_requests_total[5m]))

Definition: Golden Signals, defined by Google SRE, are the four fundamental metrics for evaluating distributed system health.

Remember: A typical SLO targets < 1% errors and P95 latency < 200ms.

How to Configure Critical Alerts?

Well-configured alerts transform metrics into actions.

Prometheus Alert Rules

groups:
- name: kubernetes-critical
rules:
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} not ready"

- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} restarting"

- alert: KubernetesContainerOomKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
for: 0m
labels:
severity: critical

The Kubernetes observability checklist in production details the complete configuration.

Quick Reference: kubectl Monitoring Commands

Complement Prometheus with these kubectl commands for real-time debugging.

# Top pods by CPU
kubectl top pods -A --sort-by=cpu | head -20

# Top pods by memory
kubectl top pods -A --sort-by=memory | head -20

# Resources by node
kubectl top nodes

# Recent cluster events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Non-Running pods
kubectl get pods -A --field-selector=status.phase!=Running

See the Kubernetes system administrator page for the complete path.

Take Action: Master Kubernetes Monitoring

Monitoring is a key skill for CKA and CKS exams. According to the Linux Foundation, the CKS exam requires a score of 67% in 2 hours and requires a valid CKA.

As Chris Aniszczyk, CNCF CTO states: "Kubernetes is no longer experimental but foundational. Soon, it will be essential to AI as well."

The LFS458 Kubernetes Administration training covers production monitoring over 4 days. For advanced security, see the LFS460 Kubernetes Security training.

Additional Resources:

Contact our advisors to define your certification path.