Monitoring hardware utilization when training on GPUs in distributed machine learning

None, None

Monitoring hardware utilization when training on GPUs in distributed machine learning

Master Thesis (2022)

Author(s)

M.C. Provó Kluit (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Jan Rellermeyer – Mentor (TU Delft - Data-Intensive Systems)

Lydia Y. Chen – Graduation committee member (TU Delft - Data-Intensive Systems)

Burcu Külahçıoğlu Özkan – Graduation committee member (TU Delft - Software Engineering)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Machine Learning CUDA Performance Distributed Systems ResNet Das-5 GPU Nodes Energy consumption Supervised machine learning PyTorch Cluster ImageNet CIFAR NCCL NVML

To reference this document use:

https://resolver.tudelft.nl/uuid:abe43552-99e6-4993-8dea-3620d9396731

More Info

expand_more

Publication Year

2022

Language

English

Copyright

Graduation Date

29-08-2022

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Largescale machine learning frameworks can accelerate training of a neural network by per forming distributed training on a cluster using multiple GPUs per node and multiple nodes. Because distributed training on a cluster involves many nodes which need to communicate and load and exchange data, a machine learning framework may at certain times during training not fully utilize the available hardware of the system. Various techniques are as sessed in their capability to measure the performance of specific parts of the hardware of a cluster. We present ML Board, a tool that measures and visualizes the utilization of the system while training a neural network model using some of the previously assessed techniques, and does so without requiring any changes to the used machine learning framework. ML Board can be used to identify straggling nodes, and by subsequently letting the user select different nodes using the Slurm job scheduler, can help to decrease the training time of a ResNet model by between 15 and 45% when using an ImageNet or CIFAR10 dataset. Furthermore, the energy used by the GPUs can be measured and used to identify and replace GPUs to reduce the total used energy by between 5 to 16%.

Files

MSc_thesis_mprovokluit.pdf

(pdf | 3.31 Mb)

License info not available