Using auto scaling in Kubernetes (Part 1)

Taking a look at autoscaling on Kubernetes in practice

posted 2021-08-07 by Thomas Kooi

Kubernetes Autoscaling hpa

One of the many useful features within Kubernetes, is the concept of horizontal autoscaling your deployments. In this post, we will take a closer look at how to configure this, and some things to watch out for. This post about auto scaling has been split into 2 parts. The first is about auto scaling using CPU and memory metrics. Part 2 focusses on auto scaling with ingress-nginx and linkerd.

Requirements

Metrics API support within your cluster is necessary. Most managed Kubernetes vendors support this out of the box. The most common implementation is the metrics-server. To valid if your cluster support this, run kubectl top node.

Example:

$ kubectl top node
NAME                                      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-0-7.eu-west-1.compute.internal    100m         5%     1781Mi          51%
ip-10-0-0-92.eu-west-1.compute.internal   162m         8%     1927Mi          56%

There are also options for more advanced set-ups, that support custom metrics. We will come back to this in a future post.

Example

HPA is supported for statefulsets and deployments. We will first try out a deployment;

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: default
spec:
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: nginx
        resources:
          limits:
            memory: "128Mi"
            cpu: "100m"
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: myapp
  namespace: default
spec:
  selector:
    app: myapp
  ports:
  - port: 80
    targetPort: 80

We apply the example.yaml into the cluster:

$ kubectl apply -f example.yaml 
deployment.apps/myapp created
service/myapp created

$ kubectl get pod
NAME                     READY   STATUS        RESTARTS   AGE
myapp-58fd9b8cb7-4pf4d   1/1     Running       0          4s

Next, we deploy a load generator. For this, we make use of the slow_cooker project from Buoyant, the people behind linkerd.

kubectl run load-generator --image=buoyantio/slow_cooker -- -qps 100 -concurrency 10 http://myapp

This will generate 100 rps traffic against the deployed nginx. You can follow the logs of the load-generator pod to view various metrics, such as latency.

If you wait a minute or so, and run kubectl top pod, you will notice that the CPU usage of the nginx pod has risen.

$ kubectl top pod
NAME                    CPU(cores)   MEMORY(bytes)   
load-generator          128m         5Mi             
myapp-5664749b7-bblqk   79m          2Mi         

We will now start configuring a HPA policy:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Within the scaleTargetRef, we reference the deployment we created earlier. We configure a min and max count for the replicas. This is the lower and upper boundry to which the HPA Controller will size the deployment.

Finally, we configure the metrics resource with a rule targeting average cpu utilizaton. We start auto scaling once we run above 60% utilization of the CPU limit.

With our deployment, we have a 100m limit, meaning with anything above 60m, we will scale the deployment.

When does the HPA scale?

Note that this is average utilization across all replica pods, so if you have one pod running at 90m, and another at 10m, the auto scaler will not trigger, while a cpu utilizaton of pod a at 90m, and pod b at 55m will.

Also, there is a tolerance for this. By deafult, anything within 10% of the target utilization, will not trigger an autoscale (either down or up).

You can read more about the algorithm behind HPA in the kubernetes hpa docs.

Seeing it in action

Upon applying the HPA policy, it will take a minute for it to start having effect.

$ kubectl get hpa
NAME    REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
myapp   Deployment/myapp   75%/60%   1         10        1          51s

Shortly after, a new pod will be deployed.

$ kubectl get hpa
NAME    REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
myapp   Deployment/myapp   40%/70%   1         10        2          8m42s

$ kubectl describe hpa myapp
Name:                                                  myapp
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Sat, 07 Aug 2021 12:30:10 +0200
Reference:                                             Deployment/myapp
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  40% (40m) / 70%
Min replicas:                                          1
Max replicas:                                          10
Deployment pods:                                       2 current / 2 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type    Reason             Age    From                       Message
  ----    ------             ----   ----                       -------
  Normal  SuccessfulRescale  5m26s  horizontal-pod-autoscaler  New size: 2; reason: cpu resource utilization (percentage of request) above target

You will notice that the cpu utilization of both pods, is now below the target utilization;

NAME                    CPU(cores)   MEMORY(bytes)   
load-generator          139m         5Mi             
myapp-5664749b7-bblqk   41m          2Mi             
myapp-5664749b7-lrj8g   54m          2Mi  

You can configure auto scaling to work based on memory:

  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 50

Note that since metrics is an array, you can use multiple rules:

$ kubectl get hpa
NAME    REFERENCE          TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
myapp   Deployment/myapp   2%/50%, 48%/60%   1         10        2          19m

What to watch out for

A few things you will want to take into consideration when making use of auto scaling;

Cluster capacity

Your cluster needs either enough capacity to handle the increase in workloads, or needs to support node-auto-scaling. Be aware that with node auto scaling, it will take a while before new nodes are ready.

Starting up new pods is much quicker versus provisioning new nodes within a cluster.

To deal with this, you have need to have some capacity available to handle an initial pod auto scaling burst, while you (or your cloud provider / service provider) can provision new machines to join your cluster as nodes.

Configure reliable deployments

When using auto scaling, you will want to make sure you have a few properties configured on your deployment in order to avoid (connection) errors during scaling operations. These are mostly the same you will want to configure, if you want to support zero-down time deployments.

Probes

Configure a readinessProbe and startupProbe; make sure a pod is fully able to serve traffic before marking it ready.

For example:

startupProbe:
  httpGet:
    path: /
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5
readinessProbe:
  httpGet:
    path: /
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5
livenessProbe:
  httpGet:
    path: /
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5

Graceful shutdowns

Your application needs to be able to perform a graceful shutdown, and wait a certain amount of seconds before exiting. This is in order to wait for all traffic to stop being sent to the pod that is terminating.

Since Kubernetes is a distributed system, it takes a little while before all systems know that a pod on another node is shutting down. Requests may be inflight, just as the pod enters the termination state. In order to handle this gracefully, your application etiher needs to handle traffic at this stage, or you could consider using a preStop lifecycle hook, if your application shuts down really quickly.

A good thing to configure for upstream components is a retry mechanism for failed connections. Though retrying is not always feasable for every time of request or transaction, and also increases latency for those requests slightly.

Don’t configure replicas

Do not configure the replicas field in your deployment when making use of HPA. This will conflict, and any time you perform a new deployment, the amount of pods will scale down/up to the value in your Deployment, after which the HPA Controller will reset it again. This results in a lot of unnecessary pod terminations / creations.

Use the HorizontalPodAutoscaler resource to configure the minimum desired amount of replicas instead, by configuring spec.minReplicas.

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  # This will make sure there are always at least 3 pods in the myapp deployment
  minReplicas: 3
  maxReplicas: 10

Custom metrics

Horizontal pod auto scaling also supports custom metrics. A common implementation for this, is the Prometheus metrics adapter.

With this, you can configure your HPA to scale up/down based on requests per seconds, latency or any other custom metric, provided this is available.