Horizontal Pod Autoscaling¶
The operator does not set spec.replicas on Gateway Deployments. This leaves replica
count ownership to whatever external controller you choose. A
HorizontalPodAutoscaler
(HPA), KEDA, or a manual kubectl scale. The operator will not fight your autoscaler.
How it works¶
When the operator builds the desired Deployment it deliberately leaves Spec.Replicas
unset. Kubernetes defaults a nil Replicas field to 1 on create, so a fresh Gateway
starts with a single pod. On subsequent reconciles the operator only writes
Spec.Template and Spec.Strategy — never Spec.Replicas — so whatever replica count
the HPA has settled on is preserved across reconciles, user VCL changes, image bumps,
and any other controller activity.
This also means you can kubectl scale deployment/<gateway-name> for ad-hoc scaling
without the operator reverting it.
Sample HPA manifest¶
The Deployment created by the operator has the same name and namespace as its Gateway.
The following HPA targets a Gateway named my-gateway in the default namespace:
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-gateway
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-gateway
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
scaleUp:
stabilizationWindowSeconds: 60
minReplicas: 2 is a deliberate floor — it lets you combine the HPA with a
PodDisruptionBudget without blocking node drains. A
Gateway at minReplicas: 1 combined with minAvailable: 1 will stall every voluntary
eviction on that node.
Choosing a metric¶
CPU utilization (Resource metric). The simplest option and the only one that works
out of the box. Varnish is largely CPU-bound once the working set fits in cache —
decompression, TLS, VCL, and ghost's route matching all consume CPU — so CPU tracks
load reasonably well for most workloads. It requires metrics-server to be installed
in the cluster and resource requests to be set on the pod (the operator sets a
default CPU request of 100m, see resources-and-scaling.md).
CPU utilization is a ratio of usage / request, so if you raise the CPU request via
GatewayClassParameters.spec.resources, adjust averageUtilization accordingly — a
70% target means something very different against a 100m request than against a 2
request.
Memory utilization. Generally a poor fit. Varnish's cache fills to its configured storage size and stays there; memory usage is close to constant regardless of request rate. Autoscaling on memory will produce a fleet sized to the cache, not to the load.
Custom metrics. Requests per second (RPS), p95 latency, or backend connection
count track real load better than CPU. These require a custom metrics adapter
(Prometheus Adapter, KEDA, etc.) and are out of scope for this document — point the
adapter at Varnish's stats (varnishstat) or at whatever ingress metric your
monitoring stack exposes. See observability.md for available
signals.
Interactions¶
PodDisruptionBudget. See pod-disruption-budgets.md.
A percentage-based PDB (maxUnavailable: 25%) tracks HPA-driven scale automatically;
an absolute minAvailable does not.
Rolling restarts. Infrastructure changes (listener edits, varnishdExtraArgs
changes, image bumps) trigger a rolling restart via the varnish.io/infra-hash
annotation. The restart replaces pods one-by-one against the current HPA-chosen
replica count — the HPA is not consulted mid-restart, and replicas are not reset to
a default.
Cache warmth on scale-up¶
Newly added pods start with an empty cache and will
generate a burst of backend traffic as they fill. Size your scaleUp
stabilizationWindowSeconds and your backend capacity accordingly. The Gateway
Service load-balances uniformly across ready pods, so a cold pod is just as likely
to receive any given request as a warm one.