FocusViT

dynamic patch focus for transformer-based gaze estimation

Journal Article (2026)
Author(s)

Dan Sochirca (Student TU Delft)

Jouh Yeong Chew (Honda Research Institute (HRI))

Xucong Zhang (TU Delft - Pattern Recognition and Bioinformatics)

Research Group
Pattern Recognition and Bioinformatics
DOI related publication
https://doi.org/10.1080/01691864.2026.2642636 Final published version
More Info
expand_more
Publication Year
2026
Language
English
Research Group
Pattern Recognition and Bioinformatics
Journal title
Advanced Robotics
Article number
2642636
Downloads counter
11
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Eye gaze information is an important signal for the robot to understand the attention of the human user. Therefore, multiple advanced model architectures have been developed for the gaze estimation task, including the recent vision transformer (ViT). However, due to the patch grid input, vanilla ViTs breaks the fine ocular details into different patches and floods with redundant information from the forehead, cheeks, and background. In this paper, we introduce FocusViT, a lightweight and end-to-end differentiable framework that adapts ViT for the gaze estimation task. It uses a Patch Translation Module to translate patches on informative content dynamically, and then employs a Perturbed Top-K operator to select only the most informative patches for processing. In this way, the proposed method can efficiently use the most informative patches from the full-face image for the gaze estimation task. Our experiments show that combining patch translation and selection reduces the gaze angular error of the ViT model on both the ETH-XGaze and MPIIFaceGaze datasets. Extensive ablation studies confirm that patch translation and token selection are complementary mechanisms that work in synergy to improve model performance.