X. Liu | TU Delft Repository

Efficiency in Deep Learning

Image and Video Deep Model Efficiency

Doctoral thesis (2024) - X. Liu, M.J.T. Reinders, J.C. van Gemert, S. Pintea

Deep learning is the core algorithmic tool for automatically processing large amounts of data. Deep learning models are defined as a stack of functions (called layers) with millions of parameters, that are updated during training by fitting them to data. Deep learning models have show remarkable accuracy gains on visual problems in video and images. Yet at the same time, this comes at a considerable computational cost that raises concerns about energy consumption. The escalation in the number of parameters and the surging demand for extensive data exacerbate these concerns. This thesis delves into the core of these concerns, proposing innovative techniques to enhance the efficiency of deep learning models. This thesis starts with exploring efficient deep learning models for video data, followed by efficient models for image data..... ...

Video BagNet

Short temporal receptive fields increase robustness in long-term action recognition

Conference paper (2023) - Ombretta Strafforello, Xin Liu, Klamer Schutte, Jan van Gemert

Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video Bag-Net on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order. ...

Objects do not disappear

Video object detection by single-frame object location anticipation

Conference paper (2023) - Xin Liu, Jan C. van Gemert, Fatemeh Karimi Nejadasl, Olaf Booij, Silvia L. Pintea

Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes and Waymo Open dataset. Our source code is available at https://github.com/L-KID/Video-object-detection-by-location-anticipation. ...

No frame left behind

Full Video Action Recognition

Conference paper (2021) - Xin Liu, Silvia L. Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, Jan C. van Gemert

Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using these to recognize the action. Instead, here we propose full video action recognition and consider all video frames. To make this computational tractable, we first cluster all frame activations along the temporal dimension based on their similarity with respect to the classification task, and then temporally aggregate the frames in the clusters into a smaller number of representations. Our method is end-to-end trainable and computationally efficient as it relies on temporally localized clustering in combination with fast Hamming distances in feature space. We evaluate on UCF101, HMDB51, Breakfast, and Something-Something V1 and V2, where we compare favorably to existing heuristic frame sampling methods. ...

Cross Domain Image Matching in Presence of Outliers

Conference paper (2019) - Xin Liu, Seyran Khademi, Jan C. van Gemert

Cross domain image matching between image collections from different source and target domains is challenging in times of deep learning due to i) limited variation of image conditions in a training set, ii) lack of paired-image labels during training, iii) the existing of outliers that makes image matching domains not fully overlap. To this end, we propose an end-to-end architecture that can match cross domain images without labels in the target domain and handle non-overlapping domains by outlier detection. We leverage domain adaptation and triplet constraints for training a network capable of learning domain invariant and identity distinguishable representations, and iteratively detecting the outliers with an entropy loss and our proposed weighted MK-MMD. Extensive experimental evidence on Office [17] dataset and our proposed datasets Shape, Pitts-CycleGAN shows that the proposed approach yields state-of-the-art cross domain image matching and outlier detection performance on different benchmarks. The code will be made publicly available. ...