L. Du | TU Delft Repository

Efficient and trustworthy gaze estimation

Doctoral thesis (2026) - L. Du, K.G. Langendoen, G. Lan

Eye gaze contains rich information about human attention and cognitive processes. This capability makes the underlying technology, known as gaze estimation, a critical enabler for many applications, ranging from human-computer interaction to cognitive sensing systems. With the development of deep learning, appearance-based gaze estimation has emerged as a promising solution due to its capability of using general-purpose cameras for non-intrusive and cost-effective gaze estimation.

To build applications based on appearance-based gaze estimation, developers can choose among three paradigms. One paradigm is to train gaze estimation models themselves, which allows developers to customize models to meet various application requirements. Another option is to adopt pre-trained gaze estimation models, which avoids the resource-intensive process for model training. The third paradigm is to call gaze estimation services running on the cloud, which are well-suited for developers who wish to reduce the resource consumption for model deployment. In this case, the full-face images of users are sent to the service provider, which returns estimated gaze directions.

Despite these paradigms offering flexible options to developers for building applications, each paradigm comes with distinct challenges that hinder widespread adoption. Training an accurate gaze estimation model requires the availability of large-scale gaze datasets and the adoption of complex neural networks. The former is sparse and difficult to collect, while the latter demands substantial computational resources. Adopting pre-trained models removes the resource burden of model training, but exposes gaze estimation systems to backdoor attacks, in which an adversary can inject a backdoor into the pre-trained model and manipulate its output with a visual trigger after deployment. This compromises the security of many gaze-based applications, e.g., causing the driving assistant system to fail in tracking the driver’s attention. Lastly, calling gaze estimation services raises severe privacy concerns. This is because these services often operate as black boxes, leaving users unaware of how their face images that contain sensitive attributes are processed or utilized.

Taking these paradigms together, we observe that they either require substantial resources for model training or raise trustworthiness concerns due to the involvement of third parties. This motivates the main research question of this dissertation: “How can we make gaze estimation systems both resource-efficient and trustworthy? ” This dissertation answers this question by addressing the challenges associated with each paradigm.

To reduce the resource burden of self-trained models, we present a resource-efficient framework that includes frequency-domain gaze estimation and gaze-aware contrastive learning. The frequency-domain gaze estimation exploits the feature extraction capability and the spectral compaction property of the discrete cosine transform to substantially reduce the computational cost of gaze estimation models. Meanwhile, gaze-aware contrastive learning enables learning gaze representations in an unsupervised manner to overcome the data labeling hurdle. We show that the proposed framework can achieve comparable gaze estimation performance to existing approaches that rely on a largescale, well-labeled dataset, while enabling up to 1.67 times speedup in inference latency.

For pre-trained gaze estimation models, we explore solutions to defend against backdoor attacks. We identify the key characteristics that distinguish backdoored gaze estimation models from benign ones, based on which we propose a novel approach to reverse-engineer the backdoor trigger that leads to the identified characteristics. Given a pre-trained model, we use the reverse-engineered trigger to determine whether it is backdoored or not. If it is identified as a compromised model, we further use the reverse engineered trigger to mitigate its backdoor behavior. We show that the proposed method can defend against various backdoor attacks.

To address privacy concerns in gaze estimation services, we develop a privacy preserver that converts privacy-sensitive full-face images into obfuscated images. The obfuscated versions are then shared with the service provider for gaze estimation. The privacy preserver is designed to generate obfuscated images that exhibit the same facial appearance for different users to protect user privacy, while preserving the gaze features of the raw images to remain effective for accurate gaze estimation. Our experiments show that obfuscated images can effectively protect user privacy while leading to comparable gaze estimation performance to the original images.

Overall, this dissertation contributes to the development of resource-efficient and trustworthy gaze estimation systems. We enhance the resource efficiency of using self-trained models, which typically demand substantial resources, while improving trustworthiness of the other two paradigms, where the resource burden is offloaded to external parties through the use of pre-trained models or vendor-provided services.
...

Eye gaze contains rich information about human attention and cognitive processes. This capability makes the underlying technology, known as gaze estimation, a critical enabler for many applications, ranging from human-computer interaction to cognitive sensing systems. With the development of deep learning, appearance-based gaze estimation has emerged as a promising solution due to its capability of using general-purpose cameras for non-intrusive and cost-effective gaze estimation.

To build applications based on appearance-based gaze estimation, developers can choose among three paradigms. One paradigm is to train gaze estimation models themselves, which allows developers to customize models to meet various application requirements. Another option is to adopt pre-trained gaze estimation models, which avoids the resource-intensive process for model training. The third paradigm is to call gaze estimation services running on the cloud, which are well-suited for developers who wish to reduce the resource consumption for model deployment. In this case, the full-face images of users are sent to the service provider, which returns estimated gaze directions.

Despite these paradigms offering flexible options to developers for building applications, each paradigm comes with distinct challenges that hinder widespread adoption. Training an accurate gaze estimation model requires the availability of large-scale gaze datasets and the adoption of complex neural networks. The former is sparse and difficult to collect, while the latter demands substantial computational resources. Adopting pre-trained models removes the resource burden of model training, but exposes gaze estimation systems to backdoor attacks, in which an adversary can inject a backdoor into the pre-trained model and manipulate its output with a visual trigger after deployment. This compromises the security of many gaze-based applications, e.g., causing the driving assistant system to fail in tracking the driver’s attention. Lastly, calling gaze estimation services raises severe privacy concerns. This is because these services often operate as black boxes, leaving users unaware of how their face images that contain sensitive attributes are processed or utilized.

Taking these paradigms together, we observe that they either require substantial resources for model training or raise trustworthiness concerns due to the involvement of third parties. This motivates the main research question of this dissertation: “How can we make gaze estimation systems both resource-efficient and trustworthy? ” This dissertation answers this question by addressing the challenges associated with each paradigm.

To reduce the resource burden of self-trained models, we present a resource-efficient framework that includes frequency-domain gaze estimation and gaze-aware contrastive learning. The frequency-domain gaze estimation exploits the feature extraction capability and the spectral compaction property of the discrete cosine transform to substantially reduce the computational cost of gaze estimation models. Meanwhile, gaze-aware contrastive learning enables learning gaze representations in an unsupervised manner to overcome the data labeling hurdle. We show that the proposed framework can achieve comparable gaze estimation performance to existing approaches that rely on a largescale, well-labeled dataset, while enabling up to 1.67 times speedup in inference latency.

For pre-trained gaze estimation models, we explore solutions to defend against backdoor attacks. We identify the key characteristics that distinguish backdoored gaze estimation models from benign ones, based on which we propose a novel approach to reverse-engineer the backdoor trigger that leads to the identified characteristics. Given a pre-trained model, we use the reverse-engineered trigger to determine whether it is backdoored or not. If it is identified as a compromised model, we further use the reverse engineered trigger to mitigate its backdoor behavior. We show that the proposed method can defend against various backdoor attacks.

To address privacy concerns in gaze estimation services, we develop a privacy preserver that converts privacy-sensitive full-face images into obfuscated images. The obfuscated versions are then shared with the service provider for gaze estimation. The privacy preserver is designed to generate obfuscated images that exhibit the same facial appearance for different users to protect user privacy, while preserving the gaze features of the raw images to remain effective for accurate gaze estimation. Our experiments show that obfuscated images can effectively protect user privacy while leading to comparable gaze estimation performance to the original images.

Overall, this dissertation contributes to the development of resource-efficient and trustworthy gaze estimation systems. We enhance the resource efficiency of using self-trained models, which typically demand substantial resources, while improving trustworthiness of the other two paradigms, where the resource burden is offloaded to external parties through the use of pre-trained models or vendor-provided services.

Through the Eyes of Emotion

A Multi-faceted Eye Tracking Dataset for Emotion Recognition in Virtual Reality

Journal article (2025) - Tongyun Yang, Bishwas Regmi, Lingyu Du, Andreas Bulling, Xucong Zhang, Guohao Lan

Virtual Reality (VR) is transforming cognitive and psychological research by enabling immersive simulations that elicit authentic emotional responses. The high demand for VR-based emotion recognition is also evident in fields such as mental healthcare, education, and entertainment, where understanding users' emotional states can enhance user experience and system effectiveness. However, the lack of comprehensive datasets hinders progress in VR-based emotion recognition. In this paper, we present a comprehensive, multi-faceted eye-tracking dataset collected from 26 participants using 28 emotional video stimuli rendered in a custom virtual environment. Our dataset is the first to incorporate high-frame-rate periocular videos, capturing subtle motions, such as micro-expressions and eyebrow shifts, which are critical for emotion analysis. Additionally, it includes high-frequency eye-tracking data, offering gaze direction and pupil dynamics at four times the frequency of existing datasets. Our dataset is also unique in providing emotion annotations according to Ekman's emotion model and, as such, offering experiments impossible using existing datasets. Our benchmark evaluations show that fusing the multi-faceted eye-tracking signals in our dataset significantly improves emotion recognition accuracy. As such, our work has the potential to significantly accelerate and enable entirely new research on emotion-aware VR applications. ...

SecureGaze: Defending Gaze Estimation Against Backdoor Attacks

Conference paper (2025) - L. Du, Yupei Liu, Jinyuan Jia, G. Lan

Gaze estimation models are widely used in applications such as driver attention monitoring and human-computer interaction. While many methods for gaze estimation exist, they rely heavily on data-hungry deep learning to achieve high performance. This reliance often forces practitioners to harvest training data from unverified public datasets, outsource model training, or rely on pre-trained models. However, such practices expose gaze estimation models to backdoor attacks. In such attacks, adversaries inject backdoor triggers by poisoning the training data, creating a backdoor vulnerability: the model performs normally with benign inputs, but produces manipulated gaze directions when a specific trigger is present. This compromises the security of many gaze-based applications, such as causing the model to fail in tracking the driver's attention. To date, there is no defense that addresses backdoor attacks on gaze estimation models. In response, we introduce SecureGaze, the first solution designed to protect gaze estimation models from such attacks. Unlike classification models, defending gaze estimation poses unique challenges due to its continuous output space and globally activated backdoor behavior. By identifying distinctive characteristics of backdoored gaze estimation models, we develop a novel and effective approach to reverse-engineer the trigger function for reliable backdoor detection. Extensive evaluations in both digital and physical worlds demonstrate that SecureGaze effectively counters a range of backdoor attacks and outperforms seven state-of-the-art defenses adapted from classification models. ...

PrivateGaze

Preserving User Privacy in Black-box Mobile Gaze Tracking Services

Journal article (2024) - Lingyu Du, Jinyuan Jia, Xucong Zhang, Guohao Lan

Eye gaze contains rich information about human attention and cognitive processes. This capability makes the underlying technology, known as gaze tracking, a critical enabler for many ubiquitous applications and has triggered the development of easy-to-use gaze estimation services. Indeed, by utilizing the ubiquitous cameras on tablets and smartphones, users can readily access many gaze estimation services. In using these services, users must provide their full-face images to the gaze estimator, which is often a black box. This poses significant privacy threats to the users, especially when a malicious service provider gathers a large collection of face images to classify sensitive user attributes. In this work, we present PrivateGaze, the first approach that can effectively preserve users’ privacy in black-box gaze tracking services without compromising gaze estimation performance. Specifically, we proposed a novel framework to train a privacy preserver that converts full-face images into obfuscated counterparts, which are effective for gaze estimation while containing no privacy information. Evaluation on four datasets shows that the obfuscated image can protect users’ private information, such as identity and gender, against unauthorized attribute classification. Meanwhile, when used directly by the black-box gaze estimator as inputs, the obfuscated images lead to comparable tracking performance to the conventional, unprotected full-face images. ...