Split Inference of Transformer

Master Thesis (2025)
Author(s)

L. Hu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Qing Wang – Mentor (TU Delft - Embedded Systems)

J. Yang – Graduation committee member (TU Delft - Web Information Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
07-07-2025
Awarding Institution
Delft University of Technology
Programme
['Electrical Engineering | Embedded Systems']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

With the increasing demand for artificial intelligence (AI), intelligent systems have become deeply integrated into various aspects of modern life, including autonomous driving, smart assistants on mobile devices, and powerful online language models such as ChatGPT. In addition, the emergence of generative models for text and image synthesis has significantly reduced the cost of ac cessing and interacting with information. However, the advancement of these AI applications comes at the cost of ever-growing computational and memory requirements. These requirements pose substantial challenges even for high end computing systems, and become prohibitive when deploying AI models on resource-constrained platforms such as embedded devices and Internet of Things (IoT) nodes. This thesis presents a distributed inference framework for Transformer mod els where we design a fine-grained, channel-wise parameter partitioning scheme. Importantly, the implementation of our framework is also independent of con ventional AI frameworks such as PyTorch, making it efficient, portable, and adaptable to virtually any compute-capable device. We begin by analyzing the computational and memory limitations of main stream hardware platforms, highlighting the motivation to aggregate multiple low-power devices to collectively execute AI workloads. Through software based simulation, we validate the correctness of the partitioned inference scheme and demonstrate that it introduces no functional deviation from unpartitioned single-device execution. The simulation also facilitates precise estimation of data flow and compute demand across multiple collaborating devices. Furthermore, we introduce a full-stack load balancing algorithm that enables adaptive task allocation based on heterogeneous hardware specifications, taking into account factors such as bandwidth, memory capacity, and communication latency. In summary, this thesis proposes a split, practical, and high-granularity Trans former inference framework that is compatible with heterogeneous hardware configurations, offering a promising step toward enabling AI inference distributively on a network of resource-constrained embedded platforms.

Files

Lejun_Hu_Thesis.pdf
(pdf | 0 Mb)
License info not available
warning

File under embargo until 22-06-2027