FocusViT

None, None; None, None; None, None

FocusViT

dynamic patch focus for transformer-based gaze estimation

Journal Article (2026)

Author(s)

Dan Sochirca (Student TU Delft)

Jouh Yeong Chew (Honda Research Institute (HRI))

Xucong Zhang (TU Delft - Pattern Recognition and Bioinformatics)

Research Group

Pattern Recognition and Bioinformatics

Gaze estimation Transformer Region focus

DOI related publication

https://doi.org/10.1080/01691864.2026.2642636 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:6b8f14ed-f0fc-4a61-b02e-952a652b6586

More Info

expand_more

Publication Year

2026

Language

English

Research Group

Pattern Recognition and Bioinformatics

Journal title

Advanced Robotics

Article number

2642636

Downloads counter

11

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Eye gaze information is an important signal for the robot to understand the attention of the human user. Therefore, multiple advanced model architectures have been developed for the gaze estimation task, including the recent vision transformer (ViT). However, due to the patch grid input, vanilla ViTs breaks the fine ocular details into different patches and floods with redundant information from the forehead, cheeks, and background. In this paper, we introduce FocusViT, a lightweight and end-to-end differentiable framework that adapts ViT for the gaze estimation task. It uses a Patch Translation Module to translate patches on informative content dynamically, and then employs a Perturbed Top-K operator to select only the most informative patches for processing. In this way, the proposed method can efficiently use the most informative patches from the full-face image for the gaze estimation task. Our experiments show that combining patch translation and selection reduces the gaze angular error of the ViT model on both the ETH-XGaze and MPIIFaceGaze datasets. Extensive ablation studies confirm that patch translation and token selection are complementary mechanisms that work in synergy to improve model performance.

Files

FocusViT_dynamic_patch_focus_f... (pdf)

(pdf | 3.55 Mb)