Dynamic Patch Focus for Transformer-based Gaze Estimation

None, None

Dynamic Patch Focus for Transformer-based Gaze Estimation

Master Thesis (2025)

Author(s)

D. Sochirca (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

X. Zhang – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Nergis Tömen – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

H. Wang – Graduation committee member (TU Delft - Multimedia Computing)

Faculty

Electrical Engineering, Mathematics and Computer Science

Gaze Estimation Vision Transformer Dynamic Sampling Patch Selection Token Pruning

To reference this document use:

https://resolver.tudelft.nl/uuid:6e46bdf0-1657-451f-b832-55b48aee31b3

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

08-07-2025

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Appearance-based 3D gaze estimation must accommodate two conflicting needs: fine ocular detail and global facial context. Vanilla Vision Transformers (ViTs) struggle with both needs due to their fixed 16 × 16 patch grid that (i) fragments critical features like the eyes into multiple patches, and (ii) floods the self-attention mechanism with redundant information from the forehead, cheeks and background. We introduce FocusViT, a lightweight and end-to-end differentiable framework that enhances ViTs by first using a Patch Translation Module, based on SpatialTransformer Networks, to dynamically translate patches to center on content, and then employing a Perturbed Top-K operator to select only the most informative tokens for processing.

Our experiments show that combining translation and selection reduces the mean angular error (MAE) of a ViTS baseline on ETH-XGaze from 4.98◦ to 4.61◦ while using 75% fewer tokens. Furthermore, by leveraging this token reduction to enable a finer-grained, lossless 8x8 patch grid, we address a key information bottleneck in the ViT-S architecture, achieving a final MAE of 4.42◦. The framework also demonstrates consistent improvements on the MPIIFaceGaze dataset, reducing the baseline error from 5.72◦ to 5.36◦. Extensive ablation studies confirm our central finding: patch translation and token selection are complementary mechanisms that work in synergy to improve model performance.

Files

Paper_sochirca.pdf

(pdf | 2.09 Mb)

License info not available