Kubernetes Node Management: Add, Maintain, Drain and Autoscaling

Kubernetes node management is a fundamental skill for any production cluster administrator. With 82% of container users running Kubernetes in production and an average of 20+ clusters per organization, mastering drain, cordon, and autoscaling operations becomes essential. This guide walks you step by step through Kubernetes node management training, from adding a worker node to configuring automatic autoscaling.

TL;DR: To effectively manage your Kubernetes nodes, use kubectl cordon to prevent scheduling, kubectl drain to evacuate pods before maintenance, and configure the Cluster Autoscaler to automatically adjust capacity. Each operation should be verified with kubectl get nodes and kubectl describe node.

These skills are at the core of the LFS458 Kubernetes Administration training.

Prerequisites: required environment and tools

Before starting this practical guide, ensure you have the following:

Infrastructure:

A working Kubernetes cluster (v1.28+)
SSH access to cluster nodes
Administrator rights (ClusterRole cluster-admin)

Installed tools:

# Check kubectl version
kubectl version --client
# Client Version: v1.29.0

# Check cluster access
kubectl cluster-info
# Kubernetes control plane is running at https://192.168.1.10:6443

Prior knowledge:

Understanding of Kubernetes control plane architecture
Familiarity with basic concepts (consult Kubernetes fundamentals)

Key takeaway: Without cluster-admin access, you won't be able to perform drain operations or modify node taints. Check your permissions with kubectl auth can-i drain nodes.

Step 1: Add a new worker node to the cluster

Adding a worker node is done in three phases: server preparation, join token generation, and integration verification.

1.1 Prepare the new server

Install dependencies on the new node:

# Disable swap (required for Kubernetes)
sudo swapoff -a
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab

# Install containerd
sudo apt-get update && sudo apt-get install -y containerd

# Configure containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd

1.2 Generate the join token from the control plane

On the master node, create a new token:

kubeadm token create --print-join-command
# Expected result:
# kubeadm join 192.168.1.10:6443 --token abcdef.0123456789abcdef \
#     --discovery-token-ca-cert-hash sha256:xyz123...

1.3 Execute the join command on the worker

On the new worker node:

sudo kubeadm join 192.168.1.10:6443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:xyz123...
# [preflight] Running pre-flight checks
# [kubelet-start] Starting the kubelet
# This node has joined the cluster

Verification from master:

kubectl get nodes
# NAME      STATUS   ROLES           AGE   VERSION
# master    Ready    control-plane   30d   v1.29.0
# worker1   Ready    <none>          25d   v1.29.0
# worker2   Ready    <none>          10s   v1.29.0

For a complete multi-node installation, consult the kubeadm installation guide.

Step 2: Use cordon to isolate a node

The kubectl cordon command marks a node as non-schedulable. Existing pods continue running, but no new pods will be placed on this node.

2.1 Mark a node as non-schedulable

kubectl cordon worker2
# node/worker2 cordoned

Check node status:

kubectl get nodes
# NAME      STATUS                     ROLES           AGE   VERSION
# master    Ready                      control-plane   30d   v1.29.0
# worker1   Ready                      <none>          25d   v1.29.0
# worker2   Ready,SchedulingDisabled   <none>          1h    v1.29.0

2.2 Examine applied taints

kubectl describe node worker2 | grep -A5 Taints
# Taints:             node.kubernetes.io/unschedulable:NoSchedule

Key takeaway: Cordon is ideal for planned maintenance. Existing pods are not impacted, allowing you to prepare maintenance without service interruption.

2.3 Restore scheduling

Use uncordon to reactivate the node:

kubectl uncordon worker2
# node/worker2 uncordoned

kubectl get nodes worker2
# NAME      STATUS   ROLES    AGE   VERSION
# worker2   Ready    <none>   1h    v1.29.0

Step 3: Perform a safe drain for maintenance

kubectl drain evacuates all pods from a node before maintenance. This operation is essential for Kubernetes production node maintenance.

3.1 Standard drain with safety options

kubectl drain worker2 --ignore-daemonsets --delete-emptydir-data
# node/worker2 cordoned
# evicting pod default/nginx-deployment-abc123
# evicting pod kube-system/coredns-xyz789
# pod/nginx-deployment-abc123 evicted
# pod/coredns-xyz789 evicted
# node/worker2 drained

Essential options:

Option	Description
`--ignore-daemonsets`	Ignore DaemonSets (cannot be rescheduled)
`--delete-emptydir-data`	Delete emptyDir data
`--force`	Force eviction of standalone pods
`--grace-period=30`	Grace period for pod shutdown
`--timeout=300s`	Operation timeout

3.2 Handle PodDisruptionBudgets

PodDisruptionBudgets (PDB) can block a drain:

kubectl drain worker2 --ignore-daemonsets
# error: cannot delete pods with local storage, evicting pods with local storage may cause data loss
# error: unable to drain node "worker2" due to PodDisruptionBudget

Check PDBs:

kubectl get pdb -A
# NAMESPACE   NAME           MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# default     nginx-pdb      2               N/A               0                     10d

Configure PDBs only for critical workloads.

3.3 Post-drain verification

kubectl get pods -o wide | grep worker2
# (no results - all pods have been evacuated)

kubectl get nodes worker2
# NAME      STATUS                     ROLES    AGE   VERSION
# worker2   Ready,SchedulingDisabled   <none>   1h    v1.29.0

Key takeaway: Always check PDBs before a planned drain. A blocked drain can delay critical maintenance and create emergency situations.

Step 4: Configure node autoscaling

Kubernetes cluster node autoscaling allows automatic adaptation of cluster capacity to load. According to ScaleOps, 65%+ of workloads use less than half their allocated resources.

4.1 Install the Cluster Autoscaler

Deploy the Cluster Autoscaler (AWS EKS example):

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Configure parameters:

apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: cluster-autoscaler
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --nodes=2:10:my-node-group
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --skip-nodes-with-local-storage=false

4.2 Verify operation

kubectl get pods -n kube-system -l app=cluster-autoscaler
# NAME                                  READY   STATUS    RESTARTS   AGE
# cluster-autoscaler-7c4d5f8d9-abcde    1/1     Running   0          5m

kubectl logs -n kube-system -l app=cluster-autoscaler --tail=20
# I0215 10:30:00.123456  1 scale_up.go:300] Scaled up node group my-node-group: 3 -> 4

The LFS458 training covers autoscaling configuration for different cloud providers in detail.

4.3 Configure scaling metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

kubectl apply -f nginx-hpa.yaml
kubectl get hpa
# NAME        REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
# nginx-hpa   Deployment/nginx   45%/70%   3         10        3          1m

Step 5: Upgrade a node without interruption

Node upgrades are part of Kubernetes drain cordon node management in production.

5.1 Rolling upgrade process

# 1. Drain the node
kubectl drain worker2 --ignore-daemonsets --delete-emptydir-data

# 2. Perform maintenance (OS update, kubelet, etc.)
sudo apt-get update && sudo apt-get upgrade -y kubelet kubeadm
sudo systemctl restart kubelet

# 3. Verify version
kubectl get nodes worker2
# NAME      STATUS                     ROLES    AGE   VERSION
# worker2   Ready,SchedulingDisabled   <none>   30d   v1.29.1

# 4. Reactivate the node
kubectl uncordon worker2

5.2 Post-maintenance health check

kubectl get nodes -o wide
# All nodes should be Ready

kubectl get pods -A -o wide | grep -v Running
# No pods should be in non-Running state

To deepen network aspects of your cluster, consult the guide on CNI, Services and Ingress network configuration.

Troubleshooting: common problems and solutions

Drain remains blocked

Symptom: The kubectl drain command doesn't complete.

kubectl drain worker2 --ignore-daemonsets --timeout=60s
# evicting pod default/stuck-pod
# error: timed out waiting for pod to be deleted

Solution:

# Identify the blocking pod
kubectl get pods -o wide | grep worker2

# Force deletion if necessary
kubectl delete pod stuck-pod --force --grace-period=0

# Retry drain
kubectl drain worker2 --ignore-daemonsets --force

Node remains NotReady after maintenance

Diagnosis:

kubectl describe node worker2 | grep -A10 Conditions
# Type             Status
# MemoryPressure   False
# DiskPressure     False
# PIDPressure      False
# Ready            False

# Check kubelet logs
sudo journalctl -u kubelet -f

Common solution:

sudo systemctl restart kubelet
sudo systemctl restart containerd

Key takeaway: Document each maintenance in a runbook. The ecosystem evolves constantly, your procedures must follow.

Autoscaler doesn't trigger scale-up

Verification:

kubectl describe configmap cluster-autoscaler-status -n kube-system
# Check non-scaling reasons

kubectl get events -n kube-system | grep autoscaler

For advanced node management and certification preparation, explore Kubernetes cluster administration and etcd backup techniques.

Next steps: validate your skills

Kubernetes node management represents a key skill for any Kubernetes system administrator CKAD certification. With 104,000 people having taken the CKA and 49% annual growth, these certifications validate your expertise to employers.

As a company CTO confirms: "Just given the capabilities that exist with Kubernetes, and the company's desire to consume more AI tools, we will use Kubernetes more in future."

Recommended training:

LFS458 Kubernetes Administration: 4 days to master cluster administration and prepare for CKA
LFD459 Kubernetes for developers: 3 days oriented toward application development and CKAD
Kubernetes fundamentals: 1 day to discover essential concepts. To go deeper, consult our Paris cluster administration training.

To explore all Kubernetes skills, consult the Complete Kubernetes Training Guide.

Key Takeaways