JobSet: New API for distributed ML and HPC applications on Kubernetes

The new open source JobSet API is designed to provide more flexible and diverse configuration options for large-scale distributed HPC and ML use cases.

listen Print view

(Image: Gorodenkoff / Shutterstock.com)

3 min. read

Kubernetes and its system for batch processing workloads are generally suitable for distributed machine learning model training and high-performance compute applications. In particular, larger compute tasks such as large language models (LLMs), which have to be distributed across a large number of hosts due to limited memory resources on GPUs and TPUs, benefit from containerized deployment on Kubernetes. However, implementations already designed for this, such as the Job API or the KubeFlow Operator, still lack some configuration options in practice –, for example with regard to communication between pods, different pod templates and job groups. The new open source JobSet API is now intended to provide a different approach for displaying distributed jobs.

Building on the Job API, JobSet models distributed batch workloads as a group of Kubernetes jobs. This allows developers to assign different pod templates to different pod groups such as leaders, workers, etc. To declaratively create identical subordinate jobs that can be distributed across dedicated hardware accelerator areas (GPUs or TPUs of the same type networked via high-speed connections), JobSet uses ReplicatedJob – a template that includes a certain number of job replicas. For communication between the pods in the individual areas, JobSet provides a headless service that ensures automatic configuration and lifecycle management.

Concept of the new open source API JobSet for Kubernetes.

(Image: kubernetes.io)

JobSet also makes it possible to explicitly assign child jobs within a topology domain – to one of the dedicated hardware accelerator areas, for example. This makes it possible to implement certain training methods for ML models such as Distributed Data Parallel (DDP), in which only one model replica is executed per high-speed accelerator area and only the replicas are synchronized via the slower cross-domain network.

JobSet also offers configurable success and error policies. For example, developers can define a policy that specifies the maximum number of times a JobSet should restart after an error. If a job is then marked as failed, the entire JobSet is recreated so that the workload in question can be resumed from the last checkpoint.

Videos by heise

The range of applications of JobSet and the most important functions available to date are summarized in the article in the Kubernetes blog. Using an example of distributed ML training with the ML framework Jax, the authors also demonstrate how JobSet can be configured for a TPU multislice workload. The API development team plans to add further functions in the future, which can be found in the overview in the JobSet roadmap.

(map)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.