Precautions for Termination of Pod of Container Native Load balancing of GKE

Maciej
3 min readSep 27, 2021

Introduction

GKE recommends Container Native load balancing. This allows the GCP load balancer to route directly to the pod’s IP using a mechanism called Alias ​​IP, NEG. However, if the pods are not set properly, downtime will occur when the pods are evicted from the node due to cluster maintenance or the like. In this article, I will explain how Container Native Load Balancer works and how to properly configure Pods.

Container Native load balancing mechanism

As described in Container Native Load Balancing, there is a Custom Controller called NEG Controller in the Master node of GKE, and when a Service with a specific Annotation is registered, it seems that a NEG resource is created in GCP and the Pod associated with the Service is attached to NEG. .. Also, as the name of the zonal network endpoint group suggests, a NEG is created for each zone, and the pod belongs to the NEG of the zone in which it exists.

Source: https://cloud.google.com/

Precautions to wear when evacuating a pod

GKE will automatically upgrade the cluster so that the nodes will be rolling updated. Then, the pod scheduled at that timing will be spit out once and recreated on another node, so if you do not pay attention to the life cycle, downtime will occur.

Life cycle when a pod is evacuated

When the pod is evacuated and once it is out of the NEG and cannot be routed, the flow will be as follows.

  1. Pod goes to Terminating state
  2. The following run at the same time
  3. Pod saved from Service Endpoint is removed
  4. Pod preStop + SIGTERM processing runs
  5. NEG Controller removes the pod from the NEG by detecting that the pod has been removed from the service.
  6. GCLB is no longer routed to the evacuated pod

In other words, there are two points to be aware of here.

  • Prevent Pod from stopping before leaving NEG
  • Even if it deviates from NEG, only the request being processed is processed.

Specific correspondence

Regarding first, let’s implement preStop properly in the pod.
It’s the same as when routing with basic Service.

lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 20"]

Regarding second, let’s set the connection drain in the Backend setting of GCLB. Also, be aware that if 1 is not set longer than the drain time of 2, the pod will die before the worst drain ends.

If you are using Ingress, you can also set it in CRD.

Confirmation method

You can see that even if you use locust or Apache Bench to apply a constant load and use the drain command to evacuate the pod from the node, you can move safely without downtime.

kubectl drain gke-service-1-service-1-nodes-375a6482-7p7p --ignore-daemonsets --delete-local-data

Conclusion

Container Load Balancing is a very powerful mechanism, but
it is necessary to follow the basic concept of Service routing and LB setting and set it faithfully.

--

--

Maciej

DevOps Consultant. I’m strongly focused on automation, security, and reliability.