Written by Nimisha GJ
on December 10, 2025

How Auto Scaling works in Kubernetes and how to do it

Overview

In this blog, we’ll learn how auto scaling (hpa) works in kubernetes and how to do it

Why ? What ? and How ?

First thing to understand is `WHY`

simple answer - when the traffic increases than normal, we need to make sure that all the requests are served, how to do it just increase the number of applications running,

In kubernetes this is done by Horizontal Pod AutoScaler

Understanding how HPA’s works

In simple terms, the hpa constantly asks metrics server what is the cpu/memory for a pod, if it is more than what we defined it adds more pods

What is this Metrics Server ?

It is a cluster level aggregator of resource usage data. This means that the metrics server gets the node / pod resource usage from kubelet and make these data available in a prometheus endpoint

Control Loop Flow

┌─────────────────────────────────────────────────────────────────┐
│                    HPA Control Loop (every 15s)                 │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                 ┌────────────────────────┐
                 │ 1. Fetch HPA spec from │
                 │    kube-apiserver      │
                 └───────────┬────────────┘
                             │
                             ▼
                 ┌────────────────────────┐
                 │ 2. Query current       │
                 │    metrics from        │
                 │    Metrics API         │
                 └───────────┬────────────┘
                             │
                             ▼
                 ┌────────────────────────┐
                 │ 3. Calculate desired   │
                 │    replicas using      │
                 │    scaling algorithm   │
                 └───────────┬────────────┘
                             │
                             ▼
                 ┌────────────────────────┐
                 │ 4. Apply scaling       │
                 │    decision            │
                 │    (if needed)         │
                 └───────────┬────────────┘
                             │
                             ▼
                 ┌────────────────────────┐
                 │ 5. Update HPA status   │
                 │    (current replicas,  │
                 │     current metrics)   │
                 └────────────────────────┘

Scaling algorithm

desiredReplicas = ceil(currentReplicas × (currentMetricValue / targetMetricValue))

Example Calculation

Given:

Current replicas: 2
Target memory: 50%
Current memory: 80%

Calculation: desiredReplicas = ceil(2 × (80 / 50)) = ceil(2 × 1.6) = ceil(3.2) = 4

Result: Scale from 2 → 4 pods

When multiple metrics are defined, it will calculate for each metric and do max()

The Interesting part: How

HPA can scale Deployment, ReplicaSet, StatefulSet but not Pods because it has no replica field

Before we do this, make sure that metrics server is enabled

Just identify what needs to be scaled and use the below manifest for configuring it

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: otel-collector-hpa
  namespace: monitoring
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: otel-collector
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 50

Here metadata.name is the hpa name, in the spec.scaleTargetRef we define what to target for scaling and then we define max, min replica counts and we define “On what bases it should be scaled”

That’s it just a simple config

Notes

HPA can use custom metrics emitted by a pod/application