August 24, 2021 | Joe Kuo

Prevent Kubernetes IP Exhaustion with Shoreline’s Argo Op Pack


Many infrastructure groups deploy Argo for workflow orchestration on Kubernetes. While Argo makes declaratively managing workflows easy, it can leave behind many stale pods after workflow execution. In Kubernetes, each of these dangling pods consumes one IP address, whether it is running or not. If you’re not careful, your node can run out of allocatable IP addresses, leading to out-of-capacity issues even with inactive pods!  

Shoreline’s Argo Op Pack is purpose-built to remediate IP exhaustion related to Argo workflows automatically. It constantly monitors the local node, comparing the number of allocated IPs against a configurable threshold maximum. From there, Shoreline automatically cleans up old Argo garbage pods if the total assigned IPs exceeds the threshold.

The Argo Op Pack comes with several additional features, including:

  • Configurable job and workflow state rules
  • Configurable job and workflow age rules
  • Automatic capacity provisioning
  • Plus, extra Argo management functions

The Argo Op Pack dramatically reduces the operational burden of administering Argo, decreasing wasteful overcapacity and lowering operating costs. 

How Argo Generates Garbage Pods 

Each Argo workflow is a collection of one or more jobs. Jobs run as a Kubernetes pod, and every pod consumes an IP address in Kubernetes. 

An Argo workflow can rapidly deplete the available IPs within an EC2 node during complex, multi-step jobs such as machine learning or data processing. Many AWS instance types run out of IPs when they have 8-16 pods. While your subnet may have thousands of free IP addresses, the instance can’t allocate anymore, meaning that IP exhaustion kicks in much earlier than anticipated. 

Since every Argo pod claims an IP address, you must delete them all eventually. When IPs are exhausted on a node, Kubernetes cannot use any free CPU and memory for scheduling. In an autoscaled environment, this means that Argo IP exhaustion can trigger the provisioning of new capacity prematurely. Most customers overcome this hurdle by either over-provisioning hardware which leads to higher costs or implementing custom logic for cluster auto-scaling and workflow clean-up. The Shoreline Argo Op Pack eliminates the need for this extra development.

Shoreline’s Argo Op Pack Automatically Cleans Up 

An Op Pack consists of a pre-packaged set of Alarms, Actions, and Bots distributed as a Terraform module. Shoreline’s Argo Op Pack contains the Alarms, Actions, and Bots necessary to continuously detect potential IP exhaustion caused by Argo, and kick off the appropriate remediation tasks to clean up old Argo pods. 

Argo Control Loop

The Argo Op pack contains Alarms for both static and dynamic IP exhaustion thresholds. It also has configurable cleaner Actions that remove old Argo pods. Finally, it contains a set of informational Actions to help debug Argo workflows.  

The Argo Op Pack handles the corner cases needed in the delicate process of deleting jobs. We only want to delete jobs if they have been completed or failed. When workflows fail, operators may need to diagnose the failure. Therefore, we should take great care not to delete all failed jobs and lose the ability for root cause analysis. 

Shoreline’s Op Pack makes all of this complexity configurable while including sensible defaults. Additionally, in a cluster with heterogeneous instance types, it is not possible to set the maximum elastic network interfaces statically. For example, an m5.16xlarge instance can support up to 737 maximum IPs, whereas an m4.2xlarge instance can only go up to 58 IPs. To manage this, the Argo Op pack dynamically detects these sizes and adjusts the thresholds while considering the configured safety margin.

The Op Pack also includes informational Actions which provide the following data:

  • count_ips -  count the number of IPs consumed by pods on the node
  • get_running_pods - get running Argo job pods
  • get_completed_pods - get completed Argo job pods
  • get_failed_pods - get failed Argo job pods

All the above information can be queried using the Shoreline command-line-interface (CLI).

Conclusion

For a demo of Shoreline’s Argo Op Pack or incident automation in general, please reach out to joe@shoreline.io. We would love to schedule a demo session and discuss how to automate away your incidents.

Ready to get started?

Shoreline helps you eliminate repetitive tickets and increase your availability at the same time. Get started today by scheduling a call with us to see a demo.