Split Inference of Transformer

None, None

Split Inference of Transformer

Master Thesis (2025)

Author(s)

L. Hu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Qing Wang – Mentor (TU Delft - Embedded Systems)

J. Yang – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Transformer LLM GPT-2 Split Inference Embedded Devices

To reference this document use:

https://resolver.tudelft.nl/uuid:68d83e40-ffb9-42c4-b66e-794cf6709935

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

07-07-2025

Awarding Institution

Delft University of Technology

Programme

['Electrical Engineering | Embedded Systems']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

With the increasing demand for artificial intelligence (AI), intelligent systems have become deeply integrated into various aspects of modern life, including autonomous driving, smart assistants on mobile devices, and powerful online language models such as ChatGPT. In addition, the emergence of generative models for text and image synthesis has significantly reduced the cost of ac cessing and interacting with information. However, the advancement of these AI applications comes at the cost of ever-growing computational and memory requirements. These requirements pose substantial challenges even for high end computing systems, and become prohibitive when deploying AI models on resource-constrained platforms such as embedded devices and Internet of Things (IoT) nodes. This thesis presents a distributed inference framework for Transformer mod els where we design a fine-grained, channel-wise parameter partitioning scheme. Importantly, the implementation of our framework is also independent of con ventional AI frameworks such as PyTorch, making it efficient, portable, and adaptable to virtually any compute-capable device. We begin by analyzing the computational and memory limitations of main stream hardware platforms, highlighting the motivation to aggregate multiple low-power devices to collectively execute AI workloads. Through software based simulation, we validate the correctness of the partitioned inference scheme and demonstrate that it introduces no functional deviation from unpartitioned single-device execution. The simulation also facilitates precise estimation of data flow and compute demand across multiple collaborating devices. Furthermore, we introduce a full-stack load balancing algorithm that enables adaptive task allocation based on heterogeneous hardware specifications, taking into account factors such as bandwidth, memory capacity, and communication latency. In summary, this thesis proposes a split, practical, and high-granularity Trans former inference framework that is compatible with heterogeneous hardware configurations, offering a promising step toward enabling AI inference distributively on a network of resource-constrained embedded platforms.

Files

Lejun_Hu_Thesis.pdf

(pdf | 0 Mb)

License info not available

File under embargo until 22-06-2027