Configuring HPA on On-Prem Kubernetes: A Practical Guide

TLDR

Set up metrics-server in your Kubernetes cluster, define resource requests, apply an HPA, and let Kubernetes scale your workloads automatically. Visualize everything in Grafana.

Introduction

This article focuses on configuring the Horizontal Pod Autoscaler (HPA) in an on-premises Kubernetes environment using Resource Metrics, specifically CPU and memory usage collected by the metrics-server.

While Kubernetes also supports Custom Metrics (such as application-specific metrics exposed inside the cluster, often collected via Prometheus Adapter) and External Metrics (metrics from outside the cluster, like queue length, cloud services, or external APIs), this guide will walk you through the most common and foundational approach: autoscaling based on resource usage.

If you need to scale using custom or external metrics, additional setup is required. However, for most on-premises clusters, starting with resource metrics is the essential first step.

Types of Metrics for HPA Autoscaling

  • Resource Metrics (CPU, memory): Focus of this article
    Collected by metrics-server and used for most standard autoscaling needs.
  • Custom Metrics:
    Application-specific metrics (e.g., requests per second), typically gathered via Prometheus Adapter.
  • External Metrics:
    Metrics from outside the cluster (e.g., queue length from SQS, messages in Kafka, or other cloud services). This requires additional configuration beyond resource or in-cluster metrics.

Installing metrics-server

Before proceeding, this guide assumes you already have a working on-premises Kubernetes cluster and Prometheus properly installed and configured. Now, let’s deploy the metrics-server using the official manifest:

$ wget https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml -O metrics-server.yaml
$ kubectl apply -f metrics-server.yaml

The Kubelet certificate must be signed by the cluster’s Certificate Authority. Otherwise, you need to disable certificate validation by passing the --kubelet-insecure-tls flag to the metrics-server. In most on-premises scenarios, it’s common to add this flag to the metrics-server deployment:

$ kubectl -n kube-system edit deployment metrics-server
containers:
- name: metrics-server
  args:
    - --kubelet-insecure-tls
    - --kubelet-preferred-address-types=InternalIP

Make sure the metrics-server deployment is running correctly:

$ kubectl get deployment metrics-server -n kube-system
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
metrics-server   1/1     1            1           18d

and

$ kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"

After confirming that metrics-server is working, you can check live resource usage across your cluster:

$ kubectl top nodes
NAME                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
kub-controlplane.localdomain  156m         3%     1828Mi          23%
kub-node1.localdomain         753m         4%     3113Mi          9%
kub-node2.localdomain         825m         5%     2103Mi          6%
kub-node3.localdomain         444m         2%     3578Mi          11%

and

$ kubectl top pods -A
NAMESPACE               NAME                                                  CPU(cores)   MEMORY(bytes)
appX-qa                 redis-c8878bc85-bvxm6                                 9m           14Mi
argocd                  argocd-application-controller-0                       7m           149Mi
argocd                  argocd-applicationset-controller-64f6bd6456-xfj45     4m           23Mi
[...]

Configuring HPA for Your Deployment

With metrics-server installed and reporting usage correctly, you can now configure an HPA (Horizontal Pod Autoscaler) for your application.

Before configuring HPA, your Deployment must define CPU and memory resource requests and limits. Without them, the HPA won’t be able to calculate resource utilization accurately.

Here’s an example of what to add or change in your deployment:

resources:
  requests:
    cpu: "1000m"
    memory: "2048Mi"
  limits:
    cpu: "2000m"
    memory: "2560Mi"

Below is an example that targets a Deployment named image-processor-hpa and scales it based on CPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: image-processor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: image-processor
  minReplicas: 3
  maxReplicas: 12
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

Note the averageUtilization target (50%) is calculated as a percentage of the CPU request (1000m) defined in your deployment.

⚠️ If your Deployment already specifies a fixed number of replicas, make sure to remove that field. Otherwise, it will conflict with the Horizontal Pod Autoscaler, which takes control over the replica count.

You can verify if the HPA is working correctly by checking the current resource usage of your pods:

$ kubectl top pods -n image-processor
NAME                                 CPU(cores)   MEMORY(bytes)
redis-7cd85f6cdf-85kf8               4m           3Mi
image-processor-86dfdc5945-5czbf     2m           520Mi
image-processor-86dfdc5945-6s8kz     1m           498Mi
image-processor-86dfdc5945-d69js     1m           530Mi
image-processor-86dfdc5945-gcf8p     1m           505Mi
image-processor-86dfdc5945-q4fzs     2m           490Mi
image-processor-86dfdc5945-q54kx     1m           518Mi

and

$ kubectl describe hpa -n image-processor image-processor-hpa
Name:                                                  image-processor-hpa
Namespace:                                             image-processor
Labels:                                                app.kubernetes.io/instance=image-processor
Annotations:                                           <none>
CreationTimestamp:                                     Mon, 26 May 2025 10:24:41 -0300
Reference:                                             Deployment/image-processor
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  0% (1m) / 50%
Min replicas:                                          3
Max replicas:                                          12
Deployment pods:                                       4 current / 4 desired
Conditions:
  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
  ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range
Events:
  Type    Reason             Age                    From                       Message
  ----    ------             ----                   ----                       -------
  Normal  SuccessfulRescale  18m (x490 over 17d)    horizontal-pod-autoscaler  New size: 3; reason: All metrics below target
  Normal  SuccessfulRescale  7m31s (x330 over 18d)  horizontal-pod-autoscaler  New size: 4; reason: cpu resource utilization (percentage of request) above target

Grafana

As an extra step, you can visualize your HPA behavior and scaling history using Grafana, assuming you already have it connected to your Kubernetes cluster through Prometheus.

Grafana provides a ready-to-use dashboard for Horizontal Pod Autoscaler (HPA) metrics:

To use it:

  1. Open your Grafana interface.
  2. Go to DashboardsImport.
  3. Enter the dashboard ID: 22128.
  4. Select your Prometheus data source.
  5. Click Import.

This dashboard provides visibility into replica count over time, metric thresholds, and scaling events for each autoscaled deployment.

Here’s an example of the Grafana dashboard in action:

grafana-k8s-hpa-example

The names in the screenshot were redacted for consistency with the example.

Final Thoughts

Using metrics-server, you can enable basic autoscaling in Kubernetes based on CPU and memory utilization. This setup is straightforward and works well for many general workloads.

However, metrics-server only supports resource metrics (CPU and memory), as documented in the official Kubernetes documentation. If your use case requires scaling based on more advanced or custom metrics, there are alternative approaches.

These advanced options will be covered in a future article.

– dbaio

References & Further Reading