Dynamic Patch Focus for Transformer-based Gaze Estimation

Master Thesis (2025)
Author(s)

D. Sochirca (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

X. Zhang – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Nergis Tömen – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

H. Wang – Graduation committee member (TU Delft - Multimedia Computing)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
08-07-2025
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Appearance-based 3D gaze estimation must accommodate two conflicting needs: fine ocular detail and global facial context. Vanilla Vision Transformers (ViTs) struggle with both needs due to their fixed 16 × 16 patch grid that (i) fragments critical features like the eyes into multiple patches, and (ii) floods the self-attention mechanism with redundant information from the forehead, cheeks and background. We introduce FocusViT, a lightweight and end-to-end differentiable framework that enhances ViTs by first using a Patch Translation Module, based on SpatialTransformer Networks, to dynamically translate patches to center on content, and then employing a Perturbed Top-K operator to select only the most informative tokens for processing.

Our experiments show that combining translation and selection reduces the mean angular error (MAE) of a ViTS baseline on ETH-XGaze from 4.98◦ to 4.61◦ while using 75% fewer tokens. Furthermore, by leveraging this token reduction to enable a finer-grained, lossless 8x8 patch grid, we address a key information bottleneck in the ViT-S architecture, achieving a final MAE of 4.42◦. The framework also demonstrates consistent improvements on the MPIIFaceGaze dataset, reducing the baseline error from 5.72◦ to 5.36◦. Extensive ablation studies confirm our central finding: patch translation and token selection are complementary mechanisms that work in synergy to improve model performance.

Files

Paper_sochirca.pdf
(pdf | 2.09 Mb)
License info not available