posted 2021-08-07 by Thomas Kooi
One of the many useful features within Kubernetes, is the concept of horizontal autoscaling your deployments. In this post, we will take a closer look at how to configure this, and some things to watch out for. This post about auto scaling has been split into 2 parts. The first is about auto scaling using CPU and memory metrics. Part 2 focusses on auto scaling with ingress-nginx and linkerd.
Metrics API support within your cluster is necessary. Most managed Kubernetes vendors support this out of the box. The most common implementation is the metrics-server. To valid if your cluster support this, run
kubectl top node.
$ kubectl top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-0-7.eu-west-1.compute.internal 100m 5% 1781Mi 51% ip-10-0-0-92.eu-west-1.compute.internal 162m 8% 1927Mi 56%
There are also options for more advanced set-ups, that support custom metrics. We will come back to this in a future post.
HPA is supported for statefulsets and deployments. We will first try out a deployment;
apiVersion: apps/v1 kind: Deployment metadata: name: myapp namespace: default spec: selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: nginx resources: limits: memory: "128Mi" cpu: "100m" ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: name: myapp namespace: default spec: selector: app: myapp ports: - port: 80 targetPort: 80
We apply the example.yaml into the cluster:
$ kubectl apply -f example.yaml deployment.apps/myapp created service/myapp created $ kubectl get pod NAME READY STATUS RESTARTS AGE myapp-58fd9b8cb7-4pf4d 1/1 Running 0 4s
Next, we deploy a load generator. For this, we make use of the slow_cooker project from Buoyant, the people behind linkerd.
kubectl run load-generator --image=buoyantio/slow_cooker -- -qps 100 -concurrency 10 http://myapp
This will generate 100 rps traffic against the deployed nginx. You can follow the logs of the
load-generator pod to view various metrics, such as latency.
If you wait a minute or so, and run
kubectl top pod, you will notice that the CPU usage of the nginx pod has risen.
$ kubectl top pod NAME CPU(cores) MEMORY(bytes) load-generator 128m 5Mi myapp-5664749b7-bblqk 79m 2Mi
We will now start configuring a HPA policy:
apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: myapp namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: myapp minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60
scaleTargetRef, we reference the deployment we created earlier.
We configure a min and max count for the replicas. This is the lower and upper boundry to which the HPA Controller will size the deployment.
Finally, we configure the
metrics resource with a rule targeting average cpu utilizaton. We start auto scaling once we run above 60% utilization of the CPU limit.
With our deployment, we have a 100m limit, meaning with anything above 60m, we will scale the deployment.
Note that this is average utilization across all replica pods, so if you have one pod running at 90m, and another at 10m, the auto scaler will not trigger, while a cpu utilizaton of pod a at 90m, and pod b at 55m will.
Also, there is a tolerance for this. By deafult, anything within 10% of the target utilization, will not trigger an autoscale (either down or up).
You can read more about the algorithm behind HPA in the kubernetes hpa docs.
Upon applying the HPA policy, it will take a minute for it to start having effect.
$ kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE myapp Deployment/myapp 75%/60% 1 10 1 51s
Shortly after, a new pod will be deployed.
$ kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE myapp Deployment/myapp 40%/70% 1 10 2 8m42s $ kubectl describe hpa myapp Name: myapp Namespace: default Labels: <none> Annotations: <none> CreationTimestamp: Sat, 07 Aug 2021 12:30:10 +0200 Reference: Deployment/myapp Metrics: ( current / target ) resource cpu on pods (as a percentage of request): 40% (40m) / 70% Min replicas: 1 Max replicas: 10 Deployment pods: 2 current / 2 desired Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale recommended size matches current size ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request) ScalingLimited False DesiredWithinRange the desired count is within the acceptable range Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulRescale 5m26s horizontal-pod-autoscaler New size: 2; reason: cpu resource utilization (percentage of request) above target
You will notice that the cpu utilization of both pods, is now below the target utilization;
NAME CPU(cores) MEMORY(bytes) load-generator 139m 5Mi myapp-5664749b7-bblqk 41m 2Mi myapp-5664749b7-lrj8g 54m 2Mi
You can configure auto scaling to work based on memory:
metrics: - type: Resource resource: name: memory target: type: Utilization averageUtilization: 50
Note that since
metrics is an array, you can use multiple rules:
$ kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE myapp Deployment/myapp 2%/50%, 48%/60% 1 10 2 19m
A few things you will want to take into consideration when making use of auto scaling;
Your cluster needs either enough capacity to handle the increase in workloads, or needs to support node-auto-scaling. Be aware that with node auto scaling, it will take a while before new nodes are ready.
Starting up new pods is much quicker versus provisioning new nodes within a cluster.
To deal with this, you have need to have some capacity available to handle an initial pod auto scaling burst, while you (or your cloud provider / service provider) can provision new machines to join your cluster as nodes.
Configure reliable deployments
When using auto scaling, you will want to make sure you have a few properties configured on your deployment in order to avoid (connection) errors during scaling operations. These are mostly the same you will want to configure, if you want to support zero-down time deployments.
startupProbe; make sure a pod is fully able to serve traffic before marking it ready.
startupProbe: httpGet: path: / port: http initialDelaySeconds: 5 periodSeconds: 5 readinessProbe: httpGet: path: / port: http initialDelaySeconds: 5 periodSeconds: 5 livenessProbe: httpGet: path: / port: http initialDelaySeconds: 5 periodSeconds: 5
Your application needs to be able to perform a graceful shutdown, and wait a certain amount of seconds before exiting. This is in order to wait for all traffic to stop being sent to the pod that is terminating.
Since Kubernetes is a distributed system, it takes a little while before all systems know that a pod on another node is shutting down. Requests may be inflight, just as the pod enters the termination state. In order to handle this gracefully, your application etiher needs to handle traffic at this stage, or you could consider using a
lifecycle hook, if your application shuts down really quickly.
A good thing to configure for upstream components is a retry mechanism for failed connections. Though retrying is not always feasable for every time of request or transaction, and also increases latency for those requests slightly.
Don’t configure replicas
Do not configure the
replicas field in your deployment when making use of HPA. This will conflict, and any time you perform a new deployment, the amount of pods will scale down/up to the value in your
Deployment, after which the HPA Controller will reset it again. This results in a lot of unnecessary pod terminations / creations.
HorizontalPodAutoscaler resource to configure the minimum desired amount of replicas instead, by configuring
apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: myapp spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: myapp # This will make sure there are always at least 3 pods in the myapp deployment minReplicas: 3 maxReplicas: 10
Horizontal pod auto scaling also supports custom metrics. A common implementation for this, is the Prometheus metrics adapter.
With this, you can configure your HPA to scale up/down based on requests per seconds, latency or any other custom metric, provided this is available.