posted 2021-08-07 by Thomas Kooi
One of the many useful features within Kubernetes, is the concept of horizontal autoscaling your deployments. In this post, we will take a closer look at how to configure this, and some things to watch out for. This post about auto scaling has been split into 2 parts. The first is about auto scaling using CPU and memory metrics. Part 2 focusses on auto scaling with ingress-nginx and linkerd.
Metrics API support within your cluster is necessary. Most managed Kubernetes vendors support this out of the box. The most common implementation is the metrics-server. To valid if your cluster support this, run kubectl top node
.
Example:
$ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-0-7.eu-west-1.compute.internal 100m 5% 1781Mi 51%
ip-10-0-0-92.eu-west-1.compute.internal 162m 8% 1927Mi 56%
There are also options for more advanced set-ups, that support custom metrics. We will come back to this in a future post.
HPA is supported for statefulsets and deployments. We will first try out a deployment;
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: default
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: nginx
resources:
limits:
memory: "128Mi"
cpu: "100m"
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: myapp
namespace: default
spec:
selector:
app: myapp
ports:
- port: 80
targetPort: 80
We apply the example.yaml into the cluster:
$ kubectl apply -f example.yaml
deployment.apps/myapp created
service/myapp created
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
myapp-58fd9b8cb7-4pf4d 1/1 Running 0 4s
Next, we deploy a load generator. For this, we make use of the slow_cooker project from Buoyant, the people behind linkerd.
kubectl run load-generator --image=buoyantio/slow_cooker -- -qps 100 -concurrency 10 http://myapp
This will generate 100 rps traffic against the deployed nginx. You can follow the logs of the load-generator
pod to view various metrics, such as latency.
If you wait a minute or so, and run kubectl top pod
, you will notice that the CPU usage of the nginx pod has risen.
$ kubectl top pod
NAME CPU(cores) MEMORY(bytes)
load-generator 128m 5Mi
myapp-5664749b7-bblqk 79m 2Mi
We will now start configuring a HPA policy:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: myapp
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Within the scaleTargetRef
, we reference the deployment we created earlier.
We configure a min and max count for the replicas. This is the lower and upper boundry to which the HPA Controller will size the deployment.
Finally, we configure the metrics
resource with a rule targeting average cpu utilizaton. We start auto scaling once we run above 60% utilization of the CPU limit.
With our deployment, we have a 100m limit, meaning with anything above 60m, we will scale the deployment.
Note that this is average utilization across all replica pods, so if you have one pod running at 90m, and another at 10m, the auto scaler will not trigger, while a cpu utilizaton of pod a at 90m, and pod b at 55m will.
Also, there is a tolerance for this. By deafult, anything within 10% of the target utilization, will not trigger an autoscale (either down or up).
You can read more about the algorithm behind HPA in the kubernetes hpa docs.
Upon applying the HPA policy, it will take a minute for it to start having effect.
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
myapp Deployment/myapp 75%/60% 1 10 1 51s
Shortly after, a new pod will be deployed.
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
myapp Deployment/myapp 40%/70% 1 10 2 8m42s
$ kubectl describe hpa myapp
Name: myapp
Namespace: default
Labels: <none>
Annotations: <none>
CreationTimestamp: Sat, 07 Aug 2021 12:30:10 +0200
Reference: Deployment/myapp
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): 40% (40m) / 70%
Min replicas: 1
Max replicas: 10
Deployment pods: 2 current / 2 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 5m26s horizontal-pod-autoscaler New size: 2; reason: cpu resource utilization (percentage of request) above target
You will notice that the cpu utilization of both pods, is now below the target utilization;
NAME CPU(cores) MEMORY(bytes)
load-generator 139m 5Mi
myapp-5664749b7-bblqk 41m 2Mi
myapp-5664749b7-lrj8g 54m 2Mi
You can configure auto scaling to work based on memory:
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 50
Note that since metrics
is an array, you can use multiple rules:
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
myapp Deployment/myapp 2%/50%, 48%/60% 1 10 2 19m
A few things you will want to take into consideration when making use of auto scaling;
Cluster capacity
Your cluster needs either enough capacity to handle the increase in workloads, or needs to support node-auto-scaling. Be aware that with node auto scaling, it will take a while before new nodes are ready.
Starting up new pods is much quicker versus provisioning new nodes within a cluster.
To deal with this, you have need to have some capacity available to handle an initial pod auto scaling burst, while you (or your cloud provider / service provider) can provision new machines to join your cluster as nodes.
Configure reliable deployments
When using auto scaling, you will want to make sure you have a few properties configured on your deployment in order to avoid (connection) errors during scaling operations. These are mostly the same you will want to configure, if you want to support zero-down time deployments.
Probes
Configure a readinessProbe
and startupProbe
; make sure a pod is fully able to serve traffic before marking it ready.
For example:
startupProbe:
httpGet:
path: /
port: http
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
httpGet:
path: /
port: http
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /
port: http
initialDelaySeconds: 5
periodSeconds: 5
Graceful shutdowns
Your application needs to be able to perform a graceful shutdown, and wait a certain amount of seconds before exiting. This is in order to wait for all traffic to stop being sent to the pod that is terminating.
Since Kubernetes is a distributed system, it takes a little while before all systems know that a pod on another node is shutting down. Requests may be inflight, just as the pod enters the termination state. In order to handle this gracefully, your application etiher needs to handle traffic at this stage, or you could consider using a preStop
lifecycle
hook, if your application shuts down really quickly.
A good thing to configure for upstream components is a retry mechanism for failed connections. Though retrying is not always feasable for every time of request or transaction, and also increases latency for those requests slightly.
Don’t configure replicas
Do not configure the replicas
field in your deployment when making use of HPA. This will conflict, and any time you perform a new deployment, the amount of pods will scale down/up to the value in your Deployment
, after which the HPA Controller will reset it again. This results in a lot of unnecessary pod terminations / creations.
Use the HorizontalPodAutoscaler
resource to configure the minimum desired amount of replicas instead, by configuring spec.minReplicas
.
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: myapp
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
# This will make sure there are always at least 3 pods in the myapp deployment
minReplicas: 3
maxReplicas: 10
Horizontal pod auto scaling also supports custom metrics. A common implementation for this, is the Prometheus metrics adapter.
With this, you can configure your HPA to scale up/down based on requests per seconds, latency or any other custom metric, provided this is available.