Transformer Inference using MAD vs LUT Kernels

A Comparative Benchmark of MAD and LUT Kernels for Binary and Ternary Dot Products on CPU and Edge Platforms

Bachelor Thesis (2026)
Author(s)

M.B. Eren (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

B. Refalo – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Q. Wang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

I.M. Olkhovskaia – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
26-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
6
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Quantizing Transformer weights to binary or ternary values reduces the inner product to sign manipulation and zero masking, prompting two competing CPU kernel strategies: multiply-add (MAD) and table lookup (LUT). Prior work reports end-to-end speedups but confounds the comparison across data layout, quantization format, and table depth simultaneously.

This thesis isolates the trade-off by sweeping LUT depth as a single controlled variable, spanning matrix sizes across the cache hierarchy and attributing results through roofline analysis on an x86 platform with AVX2 and roofline plus hardware-counter analysis on an ARM edge platform with NEON. The LUT advantage proves conditional: binary throughput rises monotonically with depth to 104.4 GOPS, roughly 2.6 times the strongest MAD baseline, while ternary gains are narrower and erode once the table outgrows fast cache or forces a gather. Throughout, instruction throughput, not bandwidth, is the binding limit.

Files

Main.pdf
(pdf | 0.583 Mb)
License info not available