Max Lobur Max Lobur  on Engineering December 4, 2018

Kubernetes Multi-AZ deployments Using Pod Anti-Affinity

Very Good Security (VGS) uses Kubernetes, hosted on AWS, to speed up application delivery and optimize hosting costs. A common issue is ensuring replicas are evenly distributed across availability zones making applications resilient and HA.

By default, the Kubernetes scheduler uses a bin-packing algorithm to fit as many pods as possible into a cluster. The scheduler prefers a more evenly distributed general node load to app replicas precisely spread across nodes. Therefore, by default, multi-replica is not guaranteed multi-AZ.

For production services, we use explicit pod anti-affinity to ensure replicas are distributed between AZs.

In this article, we will have AWS-based cluster with 3 nodes in 3 availability zones of the us-west-2 region.

Group 10-3

Example of hard AZ-based anti-affinity

spec:
  replicas: 4
  selector:
    ...
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: redash
        tier: backend
      name: redash
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - redash
            topologyKey: failure-domain.beta.kubernetes.io/zone
      containers:
      - args:
        - server
...

Results in


redash-1842984337-92cft           4/4       Running   2          2m        100.92.22.65    ip-10-20-35-160.us-west-2.compute.internal   app=redash,pod-template-hash=1842984337,tier=backend
redash-1842984337-l3ljk           4/4       Running   2          2m        100.70.66.129   ip-10-20-80-215.us-west-2.compute.internal   app=redash,pod-template-hash=1842984337,tier=backend
redash-1842984337-qlrkj           4/4       Running   0          19m       100.88.238.7    ip-10-20-115-58.us-west-2.compute.internal   app=redash,pod-template-hash=1842984337,tier=backend
redash-1842984337-s93ls           0/4       Pending   0          16s       <none>                                                       app=redash,pod-template-hash=1842984337,tier=backend

Here we run 4 replicas on a 3-node cluster, the 4th pod cannot be scheduled because this is hard anti-affinity (requiredDuringSchedulingIgnoredDuringExecution) and we have only 3 AZs.

This policy has less use in real deployments because users usually prefer capacity/uptime over HA. In other words, if 1 of the nodes goes down, it is preferred to schedule its replica on the remaining nodes (temporary ignoring anti-affinity), rather than have partial downtime. To achieve such behavior we use soft AZ anti-affinity.

Example of a soft AZ-based anti-affinity

spec:
  replicas: 3
  selector:
    ...
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: redash
        tier: backend
      name: redash
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - redash
              topologyKey: failure-domain.beta.kubernetes.io/zone
            weight: 100
      containers:
      - args:
        - server
...

Results in


redash-3474901287-4k90r           4/4       Running   0          2m        100.92.22.66    ip-10-20-35-160.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-nhb9f           4/4       Running   0          8s        100.70.66.130   ip-10-20-80-215.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-qxq9f           4/4       Running   0          8s        100.88.238.9    ip-10-20-115-58.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend

This time we provisioned 3 replicas with Soft AZ anti-affinity (preferredDuringSchedulingIgnoredDuringExecution), and although it's soft, they are getting spread across AZs. This is also due to the set weight of 100 which makes anti-affinity more important for the scheduler than the node-load policy, e.g. k8s will prefer AZ spreading over equally-loading of the nodes and other factors.

Then we switch replicas to 4, which results in:


redash-3474901287-16ksc           4/4       Running   0          7s        100.70.66.131   ip-10-20-80-215.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-4k90r           4/4       Running   0          6m        100.92.22.66    ip-10-20-35-160.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-nhb9f           4/4       Running   0          3m        100.70.66.130   ip-10-20-80-215.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-qxq9f           4/4       Running   0          3m        100.88.238.9    ip-10-20-115-58.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend 

We now have 4 replicas > 3 AZs, thus 1 of the pods got co-located with another one in AZ. It has not failed because this is a Soft policy, not a Hard one.

Suppose you change replicas to 9:


redash-3474901287-4k90r           4/4       Running   0          10m       100.92.22.66    ip-10-20-35-160.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-bb71n           4/4       Running   0          38s       100.70.66.133   ip-10-20-80-215.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-bncsm           4/4       Running   0          14s       100.92.22.71    ip-10-20-35-160.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-d4tmz           4/4       Running   0          14s       100.92.22.70    ip-10-20-35-160.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-gd51w           4/4       Running   0          14s       100.88.238.12   ip-10-20-115-58.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-rtzss           4/4       Running   0          14s       100.92.22.72    ip-10-20-35-160.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-vjlnr           4/4       Running   0          14s       100.70.66.134   ip-10-20-80-215.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-w1k5f           4/4       Running   0          38s       100.88.238.11   ip-10-20-115-58.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-xp98s           4/4       Running   0          14s       100.70.66.135   ip-10-20-80-215.us-west-2.compute.internal   app=redash,pod-template-hash=3474901287,tier=backend

In this case pods are not spread equally:

  • The 10-20-35-160 node is running 4 replicas
  • The 10-20-115-58 node is running 2 replicas
  • The 10-20-80-215 node is running 3 replicas

This happens because once each of the AZs have 1 pod with app=redash, the soft anti-affinity stops having any power. For scheduler, it is equally impossible to obey it for each of the nodes, thus scheduler is guided by other policies, e.g. equal load split between the nodes.

Soft anti-affinity is obeyed during down-scaling as well. For example, when you scale down the replicas from 9 to 3, the scheduler selects pods to kill in such a way that app will get 1 pod per AZ at the end.

Conclusion:

  • Explicit anti-affinity policy and multiple replicas are required for production deployments.
  • Soft anti-affinity is preferred over hard, unless the specifics of your project dictate otherwise.
  • The number of replicas can exceed the number of AZs, if dictated by your deployment workload, and soft anti-affinity is used. In this case, an even AZ distribution is not guaranteed for the replicas beyond the number of AZs.

Further reading

For more information on the Kubernetes Scheduling Policy, including the anti-affinity policies:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

Subscribe to our Blog

Please enter a valid email address.