• AIPressRoom
  • Posts
  • Vertex AI Persistent Resources and capacity assurance

Vertex AI Persistent Resources and capacity assurance

Customers often look for capacity assurance for their critical machine learning (ML) workloads. This is to ensure the capacity is available when they need it, especially during peak seasonal events, like Black Friday/Cyber Monday (BFCM), Super Bowl, Tax Season and other occasions where running freshly trained models is essential. We announced persistent resources at Google Cloud Next ’23 as a capacity assurance option for model training in response to customer requests.

In this blog, we explain how you can use persistent resources to ensure compute resource availability and achieve faster startup times to run your critical model training applications.

What are persistent resources?

Vertex AI provides a managed training service which enables you to operationalize large scale model training. You can run your training applications based on any machine learning (ML) framework and also handle more advanced ML workflows like distributed training jobs. These training jobs are ephemeral in nature and when they complete the provisioned virtual machines (VMs) are deleted.

A Vertex AI persistent resource is a long-running cluster you can use to submit multiple Vertex AI Training jobs, and you let the cluster delegate the resources.

When to use persistent resources?

An ideal use case would be to have a persistent resource, with accelerator worker resource pools, you can send similar training jobs and let the cluster delegate the resources. They are recommended in the following scenarios:

  • You are already leveraging Vertex AI Training for your custom training jobs.

  • You want to ensure capacity availability for scarce resources like GPUs (A100s or H100s).

  • You are submitting similar jobs multiple times and can benefit from data and image caching by running the jobs on the same persistent resource.

  • CPU or GPU based jobs types are short-lived where the actual training is shorter than the job startup time.

If you have a critical ML workload, during peak seasonal events you may need to ensure GPU accelerator availability, like having the highly demanded A100s, or newly released H100 GPUs. Unlike ephemeral training jobs submitted to Vertex AI Training, VMs in a persistent resource remain available so you can submit your training jobs to the cluster and utilize the available worker resource pools.

Why should I use persistent resources?