Dynamic Patch Focus for Transformer-based Gaze Estimation
D. Sochirca (TU Delft - Electrical Engineering, Mathematics and Computer Science)
X. Zhang – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
Nergis Tömen – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
H. Wang – Graduation committee member (TU Delft - Multimedia Computing)
                                 More Info
                                
                                     expand_more
                                
                            
                            
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Appearance-based 3D gaze estimation must accommodate two conflicting needs: fine ocular detail and global facial context. Vanilla Vision Transformers (ViTs) struggle with both needs due to their fixed 16 × 16 patch grid that (i) fragments critical features like the eyes into multiple patches, and (ii) floods the self-attention mechanism with redundant information from the forehead, cheeks and background. We introduce FocusViT, a lightweight and end-to-end differentiable framework that enhances ViTs by first using a Patch Translation Module, based on SpatialTransformer Networks, to dynamically translate patches to center on content, and then employing a Perturbed Top-K operator to select only the most informative tokens for processing.
Our experiments show that combining translation and selection reduces the mean angular error (MAE) of a ViTS baseline on ETH-XGaze from 4.98◦ to 4.61◦ while using 75% fewer tokens. Furthermore, by leveraging this token reduction to enable a finer-grained, lossless 8x8 patch grid, we address a key information bottleneck in the ViT-S architecture, achieving a final MAE of 4.42◦. The framework also demonstrates consistent improvements on the MPIIFaceGaze dataset, reducing the baseline error from 5.72◦ to 5.36◦. Extensive ablation studies confirm our central finding: patch translation and token selection are complementary mechanisms that work in synergy to improve model performance.