Skip to content

Horizontal Pod Autoscaling

The operator does not set spec.replicas on Gateway Deployments. This leaves replica count ownership to whatever external controller you choose. A HorizontalPodAutoscaler (HPA), KEDA, or a manual kubectl scale. The operator will not fight your autoscaler.

How it works

When the operator builds the desired Deployment it deliberately leaves Spec.Replicas unset. Kubernetes defaults a nil Replicas field to 1 on create, so a fresh Gateway starts with a single pod. On subsequent reconciles the operator only writes Spec.Template and Spec.Strategy — never Spec.Replicas — so whatever replica count the HPA has settled on is preserved across reconciles, user VCL changes, image bumps, and any other controller activity.

This also means you can kubectl scale deployment/<gateway-name> for ad-hoc scaling without the operator reverting it.

Sample HPA manifest

The Deployment created by the operator has the same name and namespace as its Gateway. The following HPA targets a Gateway named my-gateway in the default namespace:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-gateway
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-gateway
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
    scaleUp:
      stabilizationWindowSeconds: 60

minReplicas: 2 is a deliberate floor — it lets you combine the HPA with a PodDisruptionBudget without blocking node drains. A Gateway at minReplicas: 1 combined with minAvailable: 1 will stall every voluntary eviction on that node.

Choosing a metric

CPU utilization (Resource metric). The simplest option and the only one that works out of the box. Varnish is largely CPU-bound once the working set fits in cache — decompression, TLS, VCL, and ghost's route matching all consume CPU — so CPU tracks load reasonably well for most workloads. It requires metrics-server to be installed in the cluster and resource requests to be set on the pod (the operator sets a default CPU request of 100m, see resources-and-scaling.md).

CPU utilization is a ratio of usage / request, so if you raise the CPU request via GatewayClassParameters.spec.resources, adjust averageUtilization accordingly — a 70% target means something very different against a 100m request than against a 2 request.

Memory utilization. Generally a poor fit. Varnish's cache fills to its configured storage size and stays there; memory usage is close to constant regardless of request rate. Autoscaling on memory will produce a fleet sized to the cache, not to the load.

Custom metrics. Requests per second (RPS), p95 latency, or backend connection count track real load better than CPU. These require a custom metrics adapter (Prometheus Adapter, KEDA, etc.) and are out of scope for this document — point the adapter at Varnish's stats (varnishstat) or at whatever ingress metric your monitoring stack exposes. See observability.md for available signals.

Interactions

PodDisruptionBudget. See pod-disruption-budgets.md. A percentage-based PDB (maxUnavailable: 25%) tracks HPA-driven scale automatically; an absolute minAvailable does not.

Rolling restarts. Infrastructure changes (listener edits, varnishdExtraArgs changes, image bumps) trigger a rolling restart via the varnish.io/infra-hash annotation. The restart replaces pods one-by-one against the current HPA-chosen replica count — the HPA is not consulted mid-restart, and replicas are not reset to a default.

Cache warmth on scale-up

Newly added pods start with an empty cache and will generate a burst of backend traffic as they fill. Size your scaleUp stabilizationWindowSeconds and your backend capacity accordingly. The Gateway Service load-balances uniformly across ready pods, so a cold pod is just as likely to receive any given request as a warm one.

See also