Split Inference of Transformer
L. Hu (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Qing Wang – Mentor (TU Delft - Embedded Systems)
J. Yang – Graduation committee member (TU Delft - Web Information Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
With the increasing demand for artificial intelligence (AI), intelligent systems have become deeply integrated into various aspects of modern life, including autonomous driving, smart assistants on mobile devices, and powerful online language models such as ChatGPT. In addition, the emergence of generative models for text and image synthesis has significantly reduced the cost of ac cessing and interacting with information. However, the advancement of these AI applications comes at the cost of ever-growing computational and memory requirements. These requirements pose substantial challenges even for high end computing systems, and become prohibitive when deploying AI models on resource-constrained platforms such as embedded devices and Internet of Things (IoT) nodes. This thesis presents a distributed inference framework for Transformer mod els where we design a fine-grained, channel-wise parameter partitioning scheme. Importantly, the implementation of our framework is also independent of con ventional AI frameworks such as PyTorch, making it efficient, portable, and adaptable to virtually any compute-capable device. We begin by analyzing the computational and memory limitations of main stream hardware platforms, highlighting the motivation to aggregate multiple low-power devices to collectively execute AI workloads. Through software based simulation, we validate the correctness of the partitioned inference scheme and demonstrate that it introduces no functional deviation from unpartitioned single-device execution. The simulation also facilitates precise estimation of data flow and compute demand across multiple collaborating devices. Furthermore, we introduce a full-stack load balancing algorithm that enables adaptive task allocation based on heterogeneous hardware specifications, taking into account factors such as bandwidth, memory capacity, and communication latency. In summary, this thesis proposes a split, practical, and high-granularity Trans former inference framework that is compatible with heterogeneous hardware configurations, offering a promising step toward enabling AI inference distributively on a network of resource-constrained embedded platforms.
Files
File under embargo until 22-06-2027