Deploy Complete kube-prometheus-stack in Production Environment

Deploying the kube-prometheus stack in production is a strategic step for any team managing Kubernetes clusters.

With 75% of organizations using Prometheus and Grafana for Kubernetes monitoring according to Grafana Labs, this stack has become the de facto standard. This guide details installation, advanced configuration, and best practices for a production-ready deployment.

TL;DR: The kube-prometheus stack combines Prometheus Operator, Grafana, Alertmanager, and preconfigured dashboards. It automates complete monitoring of a Kubernetes cluster with less than 30 minutes of initial configuration.

System administrators or infrastructure engineers preparing for the LFS458 Kubernetes Administration training will master these skills in the LFS458 program.

What Is the kube-prometheus Stack?

The kube-prometheus stack is an integrated set of open source components for Kubernetes monitoring. It includes Prometheus Operator, Prometheus Server, Alertmanager, Grafana, node-exporter, and kube-state-metrics. This stack collects, stores, visualizes, and alerts on your cluster metrics.

Main components:

Component	Role	Default Port
Prometheus Operator	Manages Prometheus resources via CRDs	-
Prometheus Server	Collects and stores metrics	9090
Alertmanager	Routes and manages alerts	9093
Grafana	Visualization and dashboards	3000
node-exporter	System metrics from nodes	9100
kube-state-metrics	Kubernetes object metrics	8080

Remember: The kube-prometheus stack uses Custom Resource Definitions (CRDs) to manage Prometheus natively in Kubernetes.

The Kubernetes Monitoring and Troubleshooting section covers observability concepts in depth.

Why Deploy the kube-prometheus Stack in Production?

Advantages Over Manual Installation

Deploying the kube-prometheus stack offers several benefits:

Automated configuration: ServiceMonitors and PodMonitors automatically discover targets
High availability: Native multi-replica configuration
Preconfigured dashboards: 20+ ready-to-use Grafana dashboards
Standard alerts: Alert rules covering common incidents

According to the CNCF Annual Survey 2025, 82% of container users run Kubernetes in production. These environments require robust monitoring.

Production Use Cases

Natively covered scenarios:

Node and pod health monitoring
Alerting on critical resources (CPU, memory, disk)
Application performance analysis
Scheduling anomaly detection
Tracking business metrics exposed by your applications

IT teams spend an average of 34 working days per year resolving Kubernetes issues. Effective monitoring reduces this time by 40 to 60%.

How to Prepare for kube-prometheus Stack Deployment?

Technical Prerequisites

Before deploying the kube-prometheus stack, validate these elements:

# Minimum Kubernetes version
kubectl version --short
# Required: v1.25+

# Check available resources
kubectl top nodes
# Recommended: 4 vCPU and 8 GB RAM available

# Check Helm
helm version
# Required: v3.10+

Production Sizing

Cluster Size	Prometheus RAM	Retention	Storage
Small (<50 pods)	2 Gi	7 days	50 Gi
Medium (50-200 pods)	4 Gi	15 days	100 Gi
Large (200+ pods)	8-16 Gi	30 days	200+ Gi

Remember: Plan for 1-2 MB of storage per time series per day. A 100-pod cluster generates approximately 10,000 series.

Namespace and RBAC

# namespace-monitoring.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
labels:
name: monitoring
pod-security.kubernetes.io/enforce: privileged

kubectl apply -f namespace-monitoring.yaml

The privileged privilege is required for node-exporter which accesses system metrics.

How to Deploy the kube-prometheus Stack with Helm?

Installation via kube-prometheus-stack

The kube-prometheus-stack Helm chart simplifies complete deployment.

# Add the Prometheus Community repository
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm repo update

# Deploy the stack
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--version 65.0.0 \
--values production-values.yaml

Recommended Production Configuration

Create a production-values.yaml file:

# production-values.yaml
prometheus:
prometheusSpec:
retention: 15d
retentionSize: "90GB"
resources:
requests:
cpu: "500m"
memory: "2Gi"
limits:
cpu: "2000m"
memory: "8Gi"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: standard-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
replicas: 2
podAntiAffinity: hard

alertmanager:
alertmanagerSpec:
replicas: 3
resources:
requests:
cpu: "100m"
memory: "256Mi"
storage:
volumeClaimTemplate:
spec:
storageClassName: standard-ssd
resources:
requests:
storage: 10Gi

grafana:
replicas: 2
persistence:
enabled: true
size: 10Gi
adminPassword: "${GRAFANA_ADMIN_PASSWORD}"
resources:
requests:
cpu: "250m"
memory: "512Mi"

nodeExporter:
resources:
requests:
cpu: "50m"
memory: "64Mi"

Verify the Deployment

# Check pods
kubectl get pods -n monitoring

# Expected output:
# alertmanager-kube-prometheus-alertmanager-0   2/2     Running
# kube-prometheus-grafana-xxx                   3/3     Running
# kube-prometheus-kube-state-metrics-xxx        1/1     Running
# kube-prometheus-operator-xxx                  1/1     Running
# kube-prometheus-prometheus-node-exporter-xxx  1/1     Running
# prometheus-kube-prometheus-prometheus-0       2/2     Running

# Check installed CRDs
kubectl get crd | grep monitoring.coreos.com

Remember: Prometheus CRDs (ServiceMonitor, PodMonitor, PrometheusRule) allow declarative monitoring configuration.

How to Configure the kube-prometheus Stack for High Availability?

Multi-Replica Architecture

For production deployment, enable high availability:

# ha-config.yaml
prometheus:
prometheusSpec:
replicas: 2
podAntiAffinity: hard
# Sharding for very large clusters
# shards: 2

alertmanager:
alertmanagerSpec:
replicas: 3
# Automatic clustering between replicas
clusterAdvertiseAddress: false

Persistent Storage Configuration

Recommended StorageClass for production:

# storageclass-prometheus.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: prometheus-ssd
provisioner: kubernetes.io/gce-pd  # Adapt according to cloud provider
parameters:
type: pd-ssd
replication-type: regional-pd
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

Storage best practices apply to monitoring.

Production Network Policies

# networkpolicy-prometheus.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-ingress
namespace: monitoring
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: prometheus
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
monitoring-access: "true"
ports:
- protocol: TCP
port: 9090
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: grafana
ports:
- protocol: TCP
port: 9090

How to Configure ServiceMonitors for Your Applications?

Create a ServiceMonitor

A ServiceMonitor is a CRD that automatically configures Prometheus to scrape metrics from a Service.

# servicemonitor-myapp.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-metrics
namespace: monitoring
labels:
release: kube-prometheus
spec:
selector:
matchLabels:
app: myapp
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
tlsConfig:
insecureSkipVerify: false
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod

PodMonitor for Pods Without a Service

# podmonitor-batch-jobs.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: batch-jobs
namespace: monitoring
spec:
selector:
matchLabels:
type: batch-job
podMetricsEndpoints:
- port: metrics
interval: 60s

Verify Target Discovery

# Port-forward to Prometheus
kubectl port-forward -n monitoring \
svc/kube-prometheus-prometheus 9090:9090

# Access http://localhost:9090/targets
# Verify your ServiceMonitors appear

The Prometheus installation guide details these configurations.

How to Configure Alertmanager for Production?

Configure Alert Routes

# alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: production-alerts
namespace: monitoring
spec:
route:
groupBy: ['alertname', 'cluster', 'service']
groupWait: 30s
groupInterval: 5m
repeatInterval: 4h
receiver: 'default-receiver'
routes:
- matchers:
- name: severity
value: critical
receiver: 'pagerduty-critical'
continue: true
- matchers:
- name: severity
value: warning
receiver: 'slack-warnings'
receivers:
- name: 'default-receiver'
slackConfigs:
- apiURL:
name: slack-webhook
key: url
channel: '#alerts-default'
- name: 'pagerduty-critical'
pagerdutyConfigs:
- routingKey:
name: pagerduty-key
key: routing-key
severity: critical
- name: 'slack-warnings'
slackConfigs:
- apiURL:
name: slack-webhook
key: url
channel: '#alerts-warnings'

Custom Alert Rules

# prometheusrule-custom.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: monitoring
labels:
release: kube-prometheus
spec:
groups:
- name: kubernetes-apps
rules:
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod has restarted {{ $value }} times in the last 15 minutes"

- alert: PodNotReady
expr: |
kube_pod_status_ready{condition="true"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is not ready"

The CrashLoopBackOff pod debugging guide complements these alerts.

How to Optimize Grafana for Production?

Essential Dashboards

The stack includes preconfigured dashboards. Priority dashboards to enable:

Kubernetes / Compute Resources / Cluster: global view
Kubernetes / Compute Resources / Namespace (Pods): detail by namespace
Node Exporter Full: detailed system metrics
Prometheus Stats: Prometheus health itself

SSO and RBAC Configuration

# grafana-values.yaml
grafana:
grafana.ini:
server:
root_url: https://grafana.example.com
auth.generic_oauth:
enabled: true
name: Corporate SSO
allow_sign_up: true
client_id: ${OAUTH_CLIENT_ID}
client_secret: ${OAUTH_CLIENT_SECRET}
scopes: openid profile email
auth_url: https://sso.example.com/authorize
token_url: https://sso.example.com/token
api_url: https://sso.example.com/userinfo
users:
auto_assign_org_role: Viewer

Remember: Limit Editor access to SRE teams and Admin to platform engineers.

Automatic Dashboard Provisioning

# dashboards-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards-custom
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
custom-overview.json: |
{
"dashboard": {
"title": "Custom Overview",
"panels": [...]
}
}

How to Diagnose kube-prometheus Stack Issues?

Common Issues and Solutions

Prometheus not scraping targets:

# Check ServiceMonitors
kubectl get servicemonitor -n monitoring

# Check selection labels
kubectl describe servicemonitor myapp-metrics -n monitoring

# Prometheus Operator logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-operator

Alertmanager not routing alerts:

# Check configuration
kubectl exec -n monitoring alertmanager-kube-prometheus-alertmanager-0 \
-- amtool config show

# Test an alert
kubectl exec -n monitoring alertmanager-kube-prometheus-alertmanager-0 \
-- amtool alert add test severity=warning

The Kubernetes network diagnostics guide helps resolve connectivity issues.

Health Metrics to Monitor

# Prometheus health
up{job="prometheus"}

# Scrape error rate
sum(rate(prometheus_target_scrape_sample_failed_total[5m])) by (job)

# Prometheus memory usage
process_resident_memory_bytes{job="prometheus"}

# Query latency
histogram_quantile(0.99, rate(prometheus_http_request_duration_seconds_bucket[5m]))

The production monitoring architecture details these patterns.

How to Update the kube-prometheus Stack?

Update Process

# Check available versions
helm search repo prometheus-community/kube-prometheus-stack --versions

# Backup configurations
kubectl get prometheusrule -n monitoring -o yaml > backup-rules.yaml
kubectl get servicemonitor -n monitoring -o yaml > backup-servicemonitors.yaml

# Update with diff
helm diff upgrade kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--version 66.0.0 \
--values production-values.yaml

# Apply update
helm upgrade kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--version 66.0.0 \
--values production-values.yaml

Points of Caution

CRD updates: CRDs sometimes need manual updates
Breaking changes: consult the CHANGELOG between versions
Data retention: Prometheus data is preserved if PVC is retained

Remember: Always test updates on a staging cluster before production.

The Kubernetes observability guide contextualizes these practices.

Summary: Production Checklist for kube-prometheus

Before going to production, validate:

[ ] Persistent storage configured with appropriate retention
[ ] HA replicas for Prometheus (2+) and Alertmanager (3+)
[ ] Restrictive Network Policies applied
[ ] Alert routes tested end-to-end
[ ] Custom dashboards provisioned
[ ] Grafana SSO and RBAC configured
[ ] Configuration backup automated
[ ] Runbooks documented for each critical alert

Master Kubernetes Monitoring in Production

Want to deepen your skills in Kubernetes monitoring and administration?

Infrastructure engineers preparing for the LFS458 Kubernetes Administration training benefit from a comprehensive program covering production observability.

Recommended SFEIR Institute trainings:

LFS458 Kubernetes Administration: 4 days including advanced monitoring and troubleshooting
LFD459 Kubernetes for Developers: 3 days with application observability focus
Kubernetes Fundamentals: 1 day to discover essential concepts

Contact our advisors for information on upcoming sessions and financing options.

Key Takeaways