Key Takeaways
- ✓75% of organizations use Prometheus and Grafana to monitor Kubernetes
- ✓Automated deployment in under 30 minutes with the kube-prometheus stack
Deploying the kube-prometheus stack in production is a strategic step for any team managing Kubernetes clusters.
With 75% of organizations using Prometheus and Grafana for Kubernetes monitoring according to Grafana Labs, this stack has become the de facto standard. This guide details installation, advanced configuration, and best practices for a production-ready deployment.
TL;DR: The kube-prometheus stack combines Prometheus Operator, Grafana, Alertmanager, and preconfigured dashboards. It automates complete monitoring of a Kubernetes cluster with less than 30 minutes of initial configuration.
System administrators or infrastructure engineers preparing for the LFS458 Kubernetes Administration training will master these skills in the LFS458 program.
What Is the kube-prometheus Stack?
The kube-prometheus stack is an integrated set of open source components for Kubernetes monitoring. It includes Prometheus Operator, Prometheus Server, Alertmanager, Grafana, node-exporter, and kube-state-metrics. This stack collects, stores, visualizes, and alerts on your cluster metrics.
Main components:
| Component | Role | Default Port |
|---|---|---|
| Prometheus Operator | Manages Prometheus resources via CRDs | - |
| Prometheus Server | Collects and stores metrics | 9090 |
| Alertmanager | Routes and manages alerts | 9093 |
| Grafana | Visualization and dashboards | 3000 |
| node-exporter | System metrics from nodes | 9100 |
| kube-state-metrics | Kubernetes object metrics | 8080 |
Remember: The kube-prometheus stack uses Custom Resource Definitions (CRDs) to manage Prometheus natively in Kubernetes.
The Kubernetes Monitoring and Troubleshooting section covers observability concepts in depth.
Why Deploy the kube-prometheus Stack in Production?
Advantages Over Manual Installation
Deploying the kube-prometheus stack offers several benefits:
- Automated configuration: ServiceMonitors and PodMonitors automatically discover targets
- High availability: Native multi-replica configuration
- Preconfigured dashboards: 20+ ready-to-use Grafana dashboards
- Standard alerts: Alert rules covering common incidents
According to the CNCF Annual Survey 2025, 82% of container users run Kubernetes in production. These environments require robust monitoring.
Production Use Cases
Natively covered scenarios:
- Node and pod health monitoring
- Alerting on critical resources (CPU, memory, disk)
- Application performance analysis
- Scheduling anomaly detection
- Tracking business metrics exposed by your applications
IT teams spend an average of 34 working days per year resolving Kubernetes issues according to Cloud Native Now. Effective monitoring reduces this time by 40 to 60%.
How to Prepare for kube-prometheus Stack Deployment?
Technical Prerequisites
Before deploying the kube-prometheus stack, validate these elements:
# Minimum Kubernetes version
kubectl version --short
# Required: v1.25+
# Check available resources
kubectl top nodes
# Recommended: 4 vCPU and 8 GB RAM available
# Check Helm
helm version
# Required: v3.10+
Production Sizing
| Cluster Size | Prometheus RAM | Retention | Storage |
|---|---|---|---|
| Small (<50 pods) | 2 Gi | 7 days | 50 Gi |
| Medium (50-200 pods) | 4 Gi | 15 days | 100 Gi |
| Large (200+ pods) | 8-16 Gi | 30 days | 200+ Gi |
Remember: Plan for 1-2 MB of storage per time series per day. A 100-pod cluster generates approximately 10,000 series.
Namespace and RBAC
# namespace-monitoring.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
labels:
name: monitoring
pod-security.kubernetes.io/enforce: privileged
kubectl apply -f namespace-monitoring.yaml
The privileged privilege is required for node-exporter which accesses system metrics.
How to Deploy the kube-prometheus Stack with Helm?
Installation via kube-prometheus-stack
The kube-prometheus-stack Helm chart simplifies complete deployment.
# Add the Prometheus Community repository
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm repo update
# Deploy the stack
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--version 65.0.0 \
--values production-values.yaml
Recommended Production Configuration
Create a production-values.yaml file:
# production-values.yaml
prometheus:
prometheusSpec:
retention: 15d
retentionSize: "90GB"
resources:
requests:
cpu: "500m"
memory: "2Gi"
limits:
cpu: "2000m"
memory: "8Gi"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: standard-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
replicas: 2
podAntiAffinity: hard
alertmanager:
alertmanagerSpec:
replicas: 3
resources:
requests:
cpu: "100m"
memory: "256Mi"
storage:
volumeClaimTemplate:
spec:
storageClassName: standard-ssd
resources:
requests:
storage: 10Gi
grafana:
replicas: 2
persistence:
enabled: true
size: 10Gi
adminPassword: "${GRAFANA_ADMIN_PASSWORD}"
resources:
requests:
cpu: "250m"
memory: "512Mi"
nodeExporter:
resources:
requests:
cpu: "50m"
memory: "64Mi"
Verify the Deployment
# Check pods
kubectl get pods -n monitoring
# Expected output:
# alertmanager-kube-prometheus-alertmanager-0 2/2 Running
# kube-prometheus-grafana-xxx 3/3 Running
# kube-prometheus-kube-state-metrics-xxx 1/1 Running
# kube-prometheus-operator-xxx 1/1 Running
# kube-prometheus-prometheus-node-exporter-xxx 1/1 Running
# prometheus-kube-prometheus-prometheus-0 2/2 Running
# Check installed CRDs
kubectl get crd | grep monitoring.coreos.com
Remember: Prometheus CRDs (ServiceMonitor, PodMonitor, PrometheusRule) allow declarative monitoring configuration.
How to Configure the kube-prometheus Stack for High Availability?
Multi-Replica Architecture
For production deployment, enable high availability:
# ha-config.yaml
prometheus:
prometheusSpec:
replicas: 2
podAntiAffinity: hard
# Sharding for very large clusters
# shards: 2
alertmanager:
alertmanagerSpec:
replicas: 3
# Automatic clustering between replicas
clusterAdvertiseAddress: false
Persistent Storage Configuration
Recommended StorageClass for production:
# storageclass-prometheus.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: prometheus-ssd
provisioner: kubernetes.io/gce-pd # Adapt according to cloud provider
parameters:
type: pd-ssd
replication-type: regional-pd
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
Storage best practices apply to monitoring.
Production Network Policies
# networkpolicy-prometheus.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-ingress
namespace: monitoring
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: prometheus
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
monitoring-access: "true"
ports:
- protocol: TCP
port: 9090
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: grafana
ports:
- protocol: TCP
port: 9090
How to Configure ServiceMonitors for Your Applications?
Create a ServiceMonitor
A ServiceMonitor is a CRD that automatically configures Prometheus to scrape metrics from a Service.
# servicemonitor-myapp.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-metrics
namespace: monitoring
labels:
release: kube-prometheus
spec:
selector:
matchLabels:
app: myapp
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
tlsConfig:
insecureSkipVerify: false
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
PodMonitor for Pods Without a Service
# podmonitor-batch-jobs.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: batch-jobs
namespace: monitoring
spec:
selector:
matchLabels:
type: batch-job
podMetricsEndpoints:
- port: metrics
interval: 60s
Verify Target Discovery
# Port-forward to Prometheus
kubectl port-forward -n monitoring \
svc/kube-prometheus-prometheus 9090:9090
# Access http://localhost:9090/targets
# Verify your ServiceMonitors appear
The Prometheus installation guide details these configurations.
How to Configure Alertmanager for Production?
Configure Alert Routes
# alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: production-alerts
namespace: monitoring
spec:
route:
groupBy: ['alertname', 'cluster', 'service']
groupWait: 30s
groupInterval: 5m
repeatInterval: 4h
receiver: 'default-receiver'
routes:
- matchers:
- name: severity
value: critical
receiver: 'pagerduty-critical'
continue: true
- matchers:
- name: severity
value: warning
receiver: 'slack-warnings'
receivers:
- name: 'default-receiver'
slackConfigs:
- apiURL:
name: slack-webhook
key: url
channel: '#alerts-default'
- name: 'pagerduty-critical'
pagerdutyConfigs:
- routingKey:
name: pagerduty-key
key: routing-key
severity: critical
- name: 'slack-warnings'
slackConfigs:
- apiURL:
name: slack-webhook
key: url
channel: '#alerts-warnings'
Custom Alert Rules
# prometheusrule-custom.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: monitoring
labels:
release: kube-prometheus
spec:
groups:
- name: kubernetes-apps
rules:
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod has restarted {{ $value }} times in the last 15 minutes"
- alert: PodNotReady
expr: |
kube_pod_status_ready{condition="true"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is not ready"
The CrashLoopBackOff pod debugging guide complements these alerts.
How to Optimize Grafana for Production?
Essential Dashboards
The stack includes preconfigured dashboards. Priority dashboards to enable:
- Kubernetes / Compute Resources / Cluster: global view
- Kubernetes / Compute Resources / Namespace (Pods): detail by namespace
- Node Exporter Full: detailed system metrics
- Prometheus Stats: Prometheus health itself
SSO and RBAC Configuration
# grafana-values.yaml
grafana:
grafana.ini:
server:
root_url: https://grafana.example.com
auth.generic_oauth:
enabled: true
name: Corporate SSO
allow_sign_up: true
client_id: ${OAUTH_CLIENT_ID}
client_secret: ${OAUTH_CLIENT_SECRET}
scopes: openid profile email
auth_url: https://sso.example.com/authorize
token_url: https://sso.example.com/token
api_url: https://sso.example.com/userinfo
users:
auto_assign_org_role: Viewer
Remember: Limit Editor access to SRE teams and Admin to platform engineers.
Automatic Dashboard Provisioning
# dashboards-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards-custom
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
custom-overview.json: |
{
"dashboard": {
"title": "Custom Overview",
"panels": [...]
}
}
How to Diagnose kube-prometheus Stack Issues?
Common Issues and Solutions
Prometheus not scraping targets:
# Check ServiceMonitors
kubectl get servicemonitor -n monitoring
# Check selection labels
kubectl describe servicemonitor myapp-metrics -n monitoring
# Prometheus Operator logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-operator
Alertmanager not routing alerts:
# Check configuration
kubectl exec -n monitoring alertmanager-kube-prometheus-alertmanager-0 \
-- amtool config show
# Test an alert
kubectl exec -n monitoring alertmanager-kube-prometheus-alertmanager-0 \
-- amtool alert add test severity=warning
The Kubernetes network diagnostics guide helps resolve connectivity issues.
Health Metrics to Monitor
# Prometheus health
up{job="prometheus"}
# Scrape error rate
sum(rate(prometheus_target_scrape_sample_failed_total[5m])) by (job)
# Prometheus memory usage
process_resident_memory_bytes{job="prometheus"}
# Query latency
histogram_quantile(0.99, rate(prometheus_http_request_duration_seconds_bucket[5m]))
The production monitoring architecture details these patterns.
How to Update the kube-prometheus Stack?
Update Process
# Check available versions
helm search repo prometheus-community/kube-prometheus-stack --versions
# Backup configurations
kubectl get prometheusrule -n monitoring -o yaml > backup-rules.yaml
kubectl get servicemonitor -n monitoring -o yaml > backup-servicemonitors.yaml
# Update with diff
helm diff upgrade kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--version 66.0.0 \
--values production-values.yaml
# Apply update
helm upgrade kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--version 66.0.0 \
--values production-values.yaml
Points of Caution
- CRD updates: CRDs sometimes need manual updates
- Breaking changes: consult the CHANGELOG between versions
- Data retention: Prometheus data is preserved if PVC is retained
Remember: Always test updates on a staging cluster before production.
The Kubernetes observability guide contextualizes these practices.
Summary: Production Checklist for kube-prometheus
Before going to production, validate:
- [ ] Persistent storage configured with appropriate retention
- [ ] HA replicas for Prometheus (2+) and Alertmanager (3+)
- [ ] Restrictive Network Policies applied
- [ ] Alert routes tested end-to-end
- [ ] Custom dashboards provisioned
- [ ] Grafana SSO and RBAC configured
- [ ] Configuration backup automated
- [ ] Runbooks documented for each critical alert
Master Kubernetes Monitoring in Production
Want to deepen your skills in Kubernetes monitoring and administration?
Infrastructure engineers preparing for the LFS458 Kubernetes Administration training benefit from a comprehensive program covering production observability.
Recommended SFEIR Institute trainings:
- LFS458 Kubernetes Administration: 4 days including advanced monitoring and troubleshooting
- LFD459 Kubernetes for Developers: 3 days with application observability focus
- Kubernetes Fundamentals: 1 day to discover essential concepts
Contact our advisors for information on upcoming sessions and financing options.