• AIPressRoom
  • Posts
  • Saying the Preview of Amazon SageMaker Profiler: Observe and visualize detailed {hardware} efficiency knowledge in your mannequin coaching workloads

Saying the Preview of Amazon SageMaker Profiler: Observe and visualize detailed {hardware} efficiency knowledge in your mannequin coaching workloads

In the present day, we’re happy to announce the preview of Amazon SageMaker Profiler, a functionality of Amazon SageMaker that gives an in depth view into the AWS compute sources provisioned throughout coaching deep studying fashions on SageMaker. With SageMaker Profiler, you possibly can observe all actions on CPUs and GPUs, corresponding to CPU and GPU utilizations, kernel runs on GPUs, kernel launches on CPUs, sync operations, reminiscence operations throughout GPUs, latencies between kernel launches and corresponding runs, and knowledge switch between CPUs and GPUs. On this put up, we stroll you thru the capabilities of SageMaker Profiler.

SageMaker Profiler supplies Python modules for annotating PyTorch or TensorFlow coaching scripts and activating SageMaker Profiler. It additionally presents a person interface (UI) that visualizes the profile, a statistical abstract of profiled occasions, and the timeline of a coaching job for monitoring and understanding the time relationship of the occasions between GPUs and CPUs.

The necessity for profiling coaching jobs

With the rise of deep studying (DL), machine studying (ML) has change into compute and knowledge intensive, sometimes requiring multi-node, multi-GPU clusters. As state-of-the-art fashions develop in measurement within the order of trillions of parameters, their computational complexity and value additionally enhance quickly. ML practitioners have to deal with frequent challenges of environment friendly useful resource utilization when coaching such massive fashions. That is notably evident in massive language fashions (LLMs), which generally have billions of parameters and due to this fact require massive multi-node GPU clusters in an effort to practice them effectively.

When coaching these fashions on massive compute clusters, we will encounter compute useful resource optimization challenges corresponding to I/O bottlenecks, kernel launch latencies, reminiscence limits, and low useful resource utilizations. If the coaching job configuration is just not optimized, these challenges can lead to inefficient {hardware} utilization and longer coaching instances or incomplete coaching runs, which enhance the general prices and timelines for the venture.

Stipulations

The next are the conditions to start out utilizing SageMaker Profiler:

  • A SageMaker area in your AWS account – For directions on establishing a site, see Onboard to Amazon SageMaker Domain using quick setup. You additionally want so as to add area person profiles for particular person customers to entry the SageMaker Profiler UI utility. For extra info, see Add and remove SageMaker Domain user profiles.

  • Permissions – The next checklist is the minimal set of permissions that must be assigned to the execution function for utilizing the SageMaker Profiler UI utility:

    • sagemaker:CreateApp

    • sagemaker:DeleteApp

    • sagemaker:DescribeTrainingJob

    • sagemaker:SearchTrainingJobs

    • s3:GetObject

    • s3:ListBucket

Put together and run a coaching job with SageMaker Profiler

To start out capturing kernel runs on GPUs whereas the coaching job is working, modify your coaching script utilizing the SageMaker Profiler Python modules. Import the library and add the start_profiling() and stop_profiling() strategies to outline the start and the tip of profiling. You too can use non-compulsory customized annotations so as to add markers within the coaching script to visualise {hardware} actions throughout specific operations in every step.

There are two approaches which you could take to profile your coaching scripts with SageMaker Profiler. The primary method relies on profiling full features; the second method relies on profiling particular code strains in features.

To profile by features, use the context supervisor smppy.annotate to annotate full features. The next instance script reveals learn how to implement the context supervisor to wrap the coaching loop and full features in every iteration:

import smppy

sm_prof = smppy.SMProfiler.occasion()
config = smppy.Config()
config.profiler = {
    "EnableCuda": "1",
}
sm_prof.configure(config)
sm_prof.start_profiling()

for epoch in vary(args.epochs):
    if world_size > 1:
        sampler.set_epoch(epoch)
    tstart = time.perf_counter()
    for i, knowledge in enumerate(trainloader, 0):
        with smppy.annotate("step_"+str(i)):
            inputs, labels = knowledge
            inputs = inputs.to("cuda", non_blocking=True)
            labels = labels.to("cuda", non_blocking=True)
    
            optimizer.zero_grad()
    
            with smppy.annotate("Ahead"):
                outputs = web(inputs)
            with smppy.annotate("Loss"):
                loss = criterion(outputs, labels)
            with smppy.annotate("Backward"):
                loss.backward()
            with smppy.annotate("Optimizer"):
                optimizer.step()

sm_prof.stop_profiling()

You too can use smppy.annotation_begin() and smppy.annotation_end() to annotate particular strains of code in features. For extra info, consult with documentation.

Configure the SageMaker coaching job launcher

After you’re performed annotating and establishing the profiler initiation modules, save the coaching script and put together the SageMaker framework estimator for coaching utilizing the SageMaker Python SDK.

  1. Arrange a profiler_config object utilizing the ProfilerConfig and Profiler modules as follows:from sagemaker import ProfilerConfig, Profilerprofiler_config = ProfilerConfig( profiler_params = Profiler(cpu_profiling_duration=3600))

  2. Create a SageMaker estimator with the profiler_config object created within the earlier step. The next code reveals an instance of making a PyTorch estimator:import sagemakerfrom sagemaker.pytorch import PyTorchestimator = PyTorch( framework_version="2.0.0", image_uri="763104351884.dkr.ecr.<area>.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker", function=sagemaker.get_execution_role(), entry_point="train_with_profiler_demo.py", # your coaching job entry level source_dir=source_dir, # supply dir in your coaching script output_path=output_path, base_job_name="sagemaker-profiler-demo", hyperparameters=hyperparameters, # if any instance_count=1, instance_type=ml.p4d.24xlarge, profiler_config=profiler_config)

If you wish to create a TensorFlow estimator, import sagemaker.tensorflow.TensorFlow as a substitute, and specify one of many TensorFlow variations supported by SageMaker Profiler. For extra details about supported frameworks and occasion sorts, see Supported frameworks.

  1. Begin the coaching job by working the match methodology:estimator.match(wait=False)

Launch the SageMaker Profiler UI

When the coaching job is full, you possibly can launch the SageMaker Profiler UI to visualise and discover the profile of the coaching job. You may entry the SageMaker Profiler UI utility by the SageMaker Profiler touchdown web page on the SageMaker console or by the SageMaker area.

To launch the SageMaker Profiler UI utility on the SageMaker console, full the next steps:

  1. On the SageMaker console, select Profiler within the navigation pane.

  2. Below Get began, choose the area through which you need to launch the SageMaker Profiler UI utility.

In case your person profile solely belongs to at least one area, you’ll not see the choice for choosing a site.

  1. Choose the person profile for which you need to launch the SageMaker Profiler UI utility.

If there isn’t any person profile within the area, select Create person profile. For extra details about creating a brand new person profile, see Add and Remove User Profiles.

  1. Select Open Profiler.

Achieve insights from the SageMaker Profiler

If you open the SageMaker Profiler UI, the Choose and cargo a profile web page opens, as proven within the following screenshot.

You may view a listing of all of the coaching jobs which were submitted to SageMaker Profiler and seek for a selected coaching job by its identify, creation time, and run standing (In Progress, Accomplished, Failed, Stopped, or Stopping). To load a profile, choose the coaching job you need to view and select Load. The job identify ought to seem within the Loaded profile part on the prime.

Select the job identify to generate the dashboard and timeline. Notice that if you select the job, the UI routinely opens the dashboard. You may load and visualize one profile at a time. To load one other profile, it’s essential to first unload the beforehand loaded profile. To unload a profile, select the trash bin icon within the Loaded profile part.

For this put up, we view the profile of an ALBEF coaching job on two ml.p4d.24xlarge cases.

After you end loading and choosing the coaching job, the UI opens the Dashboard web page, as proven within the following screenshot.

You may see the plots for key metrics, particularly the GPU energetic time, GPU utilization over time, CPU energetic time, and CPU utilization over time. The GPU energetic time pie chart reveals the share of GPU energetic time vs. GPU idle time, which allows us to verify if the GPUs are extra energetic than idle all through the complete coaching job. The GPU utilization over time timeline graph reveals the common GPU utilization fee over time per node, aggregating all of the nodes in a single chart. You may verify if the GPUs have an unbalanced workload, under-utilization points, bottlenecks, or idle points throughout sure time intervals. For extra particulars on deciphering these metrics, consult with documentation.

The dashboard supplies you with extra plots, together with time spent by all GPU kernels, time spent by the highest 15 GPU kernels, launch counts of all GPU kernels, and launch counts of the highest 15 GPU kernels, as proven within the following screenshot.

Lastly, the dashboard lets you visualize extra metrics, such because the step time distribution, which is a histogram that reveals the distribution of step durations on GPUs, and the kernel precision distribution pie chart, which reveals the share of time spent on working kernels in several knowledge sorts corresponding to FP32, FP16, INT32, and INT8.

You too can acquire a pie chart on the GPU exercise distribution that reveals the share of time spent on GPU actions, corresponding to working kernels, reminiscence (memcpy and memset), and synchronization (sync). You may visualize the share of time spent on GPU reminiscence operations from the GPU reminiscence operations distribution pie chart.

You too can create your individual histograms primarily based on a customized metric that you just annotated manually as described earlier on this put up. When including a customized annotation to a brand new histogram, choose or enter the identify of the annotation you added within the coaching script.

Timeline interface

The SageMaker Profiler UI additionally features a timeline interface, which supplies you with an in depth view into the compute sources on the degree of operations and kernels scheduled on the CPUs and run on the GPUs. The timeline is organized in a tree construction, providing you with info from the host degree to the system degree, as proven within the following screenshot.

For every CPU, you possibly can observe the CPU efficiency counters, corresponding to clk_unhalted_ref.tsc and itlb_misses.miss_causes_a_walk. For every GPU on the 2x p4d.24xlarge occasion, you possibly can see a number timeline and a tool timeline. Kernel launches are on the host timeline and kernel runs are on the system timeline.

You too can zoom in to the person steps. Within the following screenshot, now we have zoomed in to step_41. The timeline strip chosen within the following screenshot is the AllReduce operation, a vital communication and synchronization step in distributed coaching, run on GPU-0. Within the screenshot, be aware that the kernel launch within the GPU-0 host connects to the kernel run within the GPU-0 system stream 1, indicated with the arrow in cyan.

Availability and concerns

SageMaker Profiler is accessible in PyTorch (model 2.0.0 and 1.13.1) and TensorFlow (model 2.12.0 and a couple of.11.1). The next desk supplies the hyperlinks to the supported AWS Deep Learning Containers for SageMaker.

SageMaker Profiler is at the moment out there within the following Areas: US East (Ohio, N. Virginia), US West (Oregon), and Europe (Frankfurt, Eire).

SageMaker Profiler is accessible within the coaching occasion sorts ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.g4dn.12xlarge.

For the total checklist of supported frameworks and variations, consult with documentation.

SageMaker Profiler incurs costs after the SageMaker Free Tier or the free trial interval of the characteristic ends. For extra info, see Amazon SageMaker Pricing.

Efficiency of SageMaker Profiler

We in contrast the overhead of SageMaker Profiler towards numerous open-source profilers. The baseline used for the comparability was obtained from working the coaching job and not using a profiler.

Our key discovering revealed that SageMaker Profiler usually resulted in a shorter billable coaching length as a result of it had much less overhead time on the end-to-end coaching runs. It additionally generated much less profiling knowledge (as much as 10 instances much less) in comparison towards open-source alternate options. The smaller profiling artifacts generated by SageMaker Profiler require much less storage, thereby additionally saving on prices.

Conclusion

SageMaker Profiler lets you get detailed insights into the utilization of compute sources when coaching your deep studying fashions. This may allow you to resolve efficiency hotspots and bottlenecks to make sure environment friendly useful resource utilization that might in the end drive down coaching prices and scale back the general coaching length.

To get began with SageMaker Profiler, consult with documentation.

In regards to the Authors

 Roy Allela is a Senior AI/ML Specialist Options Architect at AWS primarily based in Munich, Germany. Roy helps AWS prospects—from small startups to massive enterprises—practice and deploy massive language fashions effectively on AWS. Roy is obsessed with computational optimization issues and bettering the efficiency of AI workloads.

Sushant Moon is a Information Scientist at AWS, India, specializing in guiding prospects by their AI/ML endeavors. With a various background spanning retail, finance, and insurance coverage domains, he delivers modern and tailor-made options. Past his skilled life, Sushant finds rejuvenation in swimming and seeks inspiration from his travels to various locales.

Diksha Sharma is an AI/ML Specialist Options Architect within the Worldwide Specialist Group. She works with public sector prospects to assist them architect environment friendly, safe, and scalable machine studying functions together with generative AI options on AWS. In her spare time, Diksha likes to learn, paint, and spend time together with her household.