B. Refalo | TU Delft Repository

Efficient Embedded Intelligence

Exploring the Width-Precision Trade-Off in Binary-Quantized Vision Transformers

Bachelor thesis (2026) - I.S. van Loon, B. Refalo, Q. Wang, I.M. Olkhovskaia

Vision Transformers perform strongly across computer vision tasks but often require too much compute and memory for embedded deployment. Binary quantization cuts these costs by constraining weights and activations to a single bit, at the expense of accuracy. We investigate whether the budget freed by binarization can be reinvested into additional model width to recover that lost accuracy. Using the BHViT-Tiny architecture on the Oxford-IIIT Pet dataset, we first isolate the accuracy gap caused by quantization alone by comparing a full-precision reference against its binarized counterpart at identical width, and then scale width within the freed budget to measure how much of this gap can be recovered by width. We find that binarization at the base width costs 7.1 points of Top-1 accuracy, and that tripling the width recovers 4.9 of these points while remaining at a theoretical 3.5× and 6.7× reduction in memory and compute relative to the full-precision reference. The wider binary model thus approaches full-precision accuracy at a fraction of its cost. Additionally, keeping the downsampling layers in full precision recovers a further 1.1 points at a cost still well within budget, narrowing the gap to 1.1 points and indicating that part of the residual loss stems from a precision bottleneck rather than from a global lack of capacity. Our results establish width scaling as an effective strategy for reducing the binarization accuracy gap, offering a promising path toward the resource-constrained deployment of Vision Transformers. ...

Through-Screen Finger Localization and Tracking using Reflected Light

Bachelor thesis (2026) - A. Croitoru, Qing Wang, Braden Refalo, I.M. Olkhovskaia

Visible light positioning systems conventionally fix the light sources on a ceiling and let a receiver move through the scene. We invert this geometry by tracking a hovering finger above a transparent segmented OLED display placed above four photodiodes. From the four signals influenced by the reflected light from the finger we track its position. The motivating application is pre-touch sensing on mobile devices, where anticipating the user's next touch during the hover to touch window lets the system pre-load content. The central question is whether four under-screen photodiodes can localize and track a hovering finger in real time using only the microcontroller already driving the screen, avoiding the deep neural networks that prior through-screen sensing required. We collected a reflected light dataset of 199 finger captures across a 10×4 calibration grid and evaluated localization on a 5×2 cell grid. After subtracting a temporally interpolated no finger baseline, we build an 18-dimensional feature vector and classify the cell with a two-stage logistic-regression head that predicts column and row independently. This reaches 77.2% cell accuracy under a random split and 66.8% under a leave-one-calibration-dot-out protocol. The second one is more representative of deployment because every recording of the tested position is withheld from training. The complete pipeline runs on an Arduino Due that also drives the screen with sub-millisecond inference. We conclude that through-screen reflected light carries enough spatial information for cell-level finger localization without deep learning, on the same embedded hardware that runs the display. ...

Training Strategies for Binary/Ternary Neural Networks

Bachelor thesis (2026) - R.B. Kiemes, Q. Wang, B. Refalo, I.M. Olkhovskaia

Binary and ternary neural networks offer substantial reductions in memory and computational cost, making them attractive for deployment on resource-constrained devices. Training these networks remains challenging because quantization functions are non-differentiable, requiring gradient approximations such as the Straight-Through Estimator (STE).

This work presents a systematic ablation study of the effects of different training configurations on ResNet-20 on CIFAR-10. We evaluated eleven STE variants and independently examined the effects of weight clipping and batch normalization. All ternary variants perform within 0.73 percentage points of the 91.61% full-precision baseline, with the polynomial STE achieving the best result of 91.23%. For binary, all variants reach 1.66 percentage points below the baseline, with tanh STE being the highest performer (90.35%). We find that the choice of STE has only a minor impact on final accuracy; however, STEs differ in training stability, with smoother estimators providing more consistent convergence.

Batch normalization had the greatest effect on performance; removing it reduced accuracy by up to 8.66 percentage points. Weight clipping yielded a smaller but consistent benefit, with an optimal clipping factor of f = 4.0, improving accuracy by 0.26 and 0.5 percentage points, respectively. Combining these findings, we identified effective training configurations for both ternary and binary networks: the optimal ternary setup (Using Trained Ternary Quantization) achieved 91.52% accuracy on ResNet-20/CIFAR-10, while the optimal binary configuration (Using XNOR-Net quantization) reached 90.78% accuracy, an improvement over prior baselines in both cases.

...

Transformer Inference using MAD vs LUT Kernels

A Comparative Benchmark of MAD and LUT Kernels for Binary and Ternary Dot Products on CPU and Edge Platforms

Bachelor thesis (2026) - M.B. Eren, B. Refalo, Q. Wang, I.M. Olkhovskaia

Quantizing Transformer weights to binary or ternary values reduces the inner product to sign manipulation and zero masking, prompting two competing CPU kernel strategies: multiply-add (MAD) and table lookup (LUT). Prior work reports end-to-end speedups but confounds the comparison across data layout, quantization format, and table depth simultaneously.

This thesis isolates the trade-off by sweeping LUT depth as a single controlled variable, spanning matrix sizes across the cache hierarchy and attributing results through roofline analysis on an x86 platform with AVX2 and roofline plus hardware-counter analysis on an ARM edge platform with NEON. The LUT advantage proves conditional: binary throughput rises monotonically with depth to 104.4 GOPS, roughly 2.6 times the strongest MAD baseline, while ternary gains are narrower and erode once the table outgrows fast cache or forces a gather. Throughout, instruction throughput, not bandwidth, is the binding limit. ...