S. Pintea | TU Delft Repository

A step towards understanding why classification helps regression

Conference paper (2023) - Silvia L. Pintea, Yancong Lin, Jouke Dijkstra, Jan C. van Gemert

A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbalanced data. We explain these empirical findings by formalizing the relation between the balanced and imbalanced regression losses. Finally, we show that our findings hold on two real imbalanced image datasets for depth estimation (NYUD2-DIR), and age estimation (IMDB-WIKI-DIR), and on the problem of imbalanced video progress prediction (Breakfast). Our main takeaway is: for a regression task, if the data sampling is imbalanced, then add a classification loss. ...

Is there progress in activity progress prediction?

Conference paper (2023) - Frans de Boer, Jan C. van Gemert, Jouke Dijkstra, Silvia L. Pintea

Activity progress prediction aims to estimate what percentage of an activity has been completed. Currently this is done with machine learning approaches, trained and evaluated on complicated and realistic video datasets. The videos in these datasets vary drastically in length and appearance. And some of the activities have unanticipated developments, making activity progression difficult to estimate. In this work, we examine the results obtained by existing progress prediction methods on these datasets. We find that current progress prediction methods seem not to extract useful visual information for the progress prediction task. Therefore, these methods fail to exceed simple frame-counting baselines. We design a precisely controlled dataset for activity progress prediction and on this synthetic dataset we show that the considered methods can make use of the visual information, when this directly relates to the progress prediction. We conclude that the progress prediction task is ill-posed on the currently used real-world datasets. Moreover, to fairly measure activity progression we advise to consider a, simple but effective, frame-counting baseline. ...

Objects do not disappear

Video object detection by single-frame object location anticipation

Conference paper (2023) - Xin Liu, Jan C. van Gemert, Fatemeh Karimi Nejadasl, Olaf Booij, Silvia L. Pintea

Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes and Waymo Open dataset. Our source code is available at https://github.com/L-KID/Video-object-detection-by-location-anticipation. ...

Deep Vanishing Point Detection

Geometric priors make dataset variations vanish

Conference paper (2022) - Yancong Lin, Ruben Wiersma, Silvia L. Pintea, Klaus Hildebrandt, Elmar Eisemann, Jan C. van Gemert

Deep learning has improved vanishing point detection in images. Yet, deep networks require expensive annotated datasets trained on costly hardware and do not generalize to even slightly different domains, and minor problem variants. Here, we address these issues by injecting deep vanishing point detection networks with prior knowledge. This prior knowledge no longer needs to be learned from data, saving valuable annotation efforts and compute, unlocking realistic few-sample scenarios, and reducing the impact of domain changes. Moreover, the interpretability of the priors allows to adapt deep networks to minor problem variations such as switching between Manhattan and non-Manhattan worlds. We seamlessly incorporate two geometric priors: (i) Hough Transform -- mapping image pixels to straight lines, and (ii) Gaussian sphere -- mapping lines to great circles whose intersections denote vanishing points. Experimentally, we ablate our choices and show comparable accuracy to existing models in the large-data setting. We validate our model's improved data efficiency, robustness to domain changes, adaptability to non-Manhattan settings. ...

Seismic inversion with deep learning

A proposal for litho-type classification

Journal article (2021) - Silvia L. Pintea, Siddharth Sharma, Femke C. Vossepoel, Jan C. van Gemert, Marco Loog, Dirk J. Verschuur

This article investigates bypassing the inversion steps involved in a standard litho-type classification pipeline and performing the litho-type classification directly from imaged seismic data. We consider a set of deep learning methods that map the seismic data directly into litho-type classes, trained on two variants of synthetic seismic data: (i) one in which we image the seismic data using a local Radon transform to obtain angle gathers, (ii) and another in which we start from the subsurface-offset gathers, based on correlations over the seismic data. Our results indicate that this single-step approach provides a faster alternative to the established pipeline while being convincingly accurate. We observe that adding the background model as input to the deep network optimization is essential in correctly categorizing litho-types. Also, starting from the angle gathers obtained by imaging in the Radon domain is more informative than using the subsurface offset gathers as input. ...

No frame left behind

Full Video Action Recognition

Conference paper (2021) - Xin Liu, Silvia L. Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, Jan C. van Gemert

Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using these to recognize the action. Instead, here we propose full video action recognition and consider all video frames. To make this computational tractable, we first cluster all frame activations along the temporal dimension based on their similarity with respect to the classification task, and then temporally aggregate the frames in the clusters into a smaller number of representations. Our method is end-to-end trainable and computationally efficient as it relies on temporally localized clustering in combination with fast Hamming distances in feature space. We evaluate on UCF101, HMDB51, Breakfast, and Something-Something V1 and V2, where we compare favorably to existing heuristic frame sampling methods. ...

Semi-Supervised Lane Detection With Deep Hough Transform

Conference paper (2021) - Yancong Lin, Silvia-Laura Pintea, Jan van Gemert

Current work on lane detection relies on large manually annotated datasets. We reduce the dependency on annotations by leveraging massive cheaply available unlabelled data. We propose a novel loss function exploiting geometric knowledge of lanes in Hough space, where a lane can be identified as a local maximum. By splitting lanes into separate channels, we can localize each lane via simple global max-pooling. The location of the maximum encodes the layout of a lane, while the intensity indicates the the probability of a lane being present. Maximizing the log-probability of the maximal bins helps neural networks find lanes without labels. On the CULane and TuSimple datasets, we show that the proposed Hough Transform loss improves performance significantly by learning from large amounts of unlabelled images. ...

Resolution learning in deep convolutional networks using scale-space theory

Journal article (2021) - Silvia L. Pintea, Nergis Tömen, Stanley F. Goes, Marco Loog, Jan van Gemert

Resolution in deep convolutional neural networks (CNNs) is typically bounded by the receptive field size through filter sizes, and subsampling layers or strided convolutions on feature maps. The optimal resolution may vary significantly depending on the dataset. Modern CNNs hard-code their resolution hyper-parameters in the network architecture which makes tuning such hyper-parameters cumbersome. We propose to do away with hard-coded resolution hyper-parameters and aim to learn the appropriate resolution from data. We use scale-space theory to obtain a self-similar parametrization of filters and make use of the N-Jet: a truncated Taylor series to approximate a filter by a learned combination of Gaussian derivative filters. The parameter σ of the Gaussian basis controls both the amount of detail the filter encodes and the spatial extent of the filter. Since σ is a continuous parameter, we can optimize it with respect to the loss. The proposed N-Jet layer achieves comparable performance when used in state-of-the art architectures, while learning the correct resolution in each layer automatically. We evaluate our N-Jet layer on both classification and segmentation, and we show that learning σ is especially beneficial when dealing with inputs at multiple sizes. ...

Top-down networks

A coarse-to-fine reimagination of CNNs

Conference paper (2020) - Ioannis Lelekas, Nergis Tömen, Silvia L. Pintea, Jan C. van Gemert

Biological vision adopts a coarse-to-fine information processing pathway, from initial visual detection and binding of salient features of a visual scene, to the enhanced and preferential processing given relevant stimuli. On the contrary, CNNs employ a fine-to-coarse processing, moving from local, edge-detecting filters to more global ones extracting abstract representations of the input. In this paper we reverse the feature extraction part of standard bottom-up architectures and turn them upside-down: We propose top-down networks. Our proposed coarse-to-fine pathway, by blurring higher frequency information and restoring it only at later stages, offers a line of defence against adversarial attacks that introduce high frequency noise. Moreover, since we increase image resolution with depth, the high resolution of the feature map in the final convolutional layer contributes to the explainability of the network's decision making process. This favors object-driven decisions over context driven ones, and thus provides better localized class activation maps. This paper offers empirical evidence for the applicability of the top-down resolution processing to various existing architectures on multiple visual tasks. ...

Deep Hough-Transform Line Priors

Conference paper (2020) - Y. Lin, S. Pintea, J.C. van Gemert

Classical work on line segment detection is knowledge-based; it uses carefully designed geometric priors using either image gradients, pixel groupings, or Hough transform variants. Instead, current deep learning methods do away with all prior knowledge and replace priors by training deep networks on large manually annotated datasets. Here, we reduce the dependency on labeled data by building on the classic knowledge-based priors while using deep networks to learn features. We add line priors through a trainable Hough transform block into a deep network. Hough transform provides the prior knowledge about global line parameterizations, while the convolutional layers can learn the local gradient-like line features. On the Wireframe (ShanghaiTech) and York Urban datasets we show that adding prior knowledge improves data efficiency as line priors no longer need to be learned from data. ...

Divide and Count

Generic Object Counting by Image Divisions

Journal article (2019) - Tobias Stahl, Silvia L. Pintea, Jan C. Van Gemert

We propose a general object counting method that does not use any prior category information. We learn from local image divisions to predict global image-level counts without using any form of local annotations. Our method separates the input image into a set of image divisions - each fully covering the image. Each image division is composed of a set of region proposals or uniform grid cells. Our approach learns in an end-to-end deep learning architecture to predict global image-level counts from local image divisions. The method incorporates a counting layer which predicts object counts in the complete image, by enforcing consistency in counts when dealing with overlapping image regions. Our counting layer is based on the inclusion-exclusion principle from set theory. We analyze the individual building blocks of our proposed approach on Pascal-VOC2007 and evaluate our method on the MS-COCO large scale generic object data set as well as on three class-specific counting data sets: UCSD pedestrian data set, and CARPK, and PUCPR+ car data sets. ...

Using Phase Instead of Optical Flow for Action Recognition

Conference paper (2019) - Omar Hommos, Silvia L. Pintea, Pascal S.M. Mettes, Jan C. van Gemert

Currently, the most common motion representation for action recognition is optical flow. Optical flow is based on particle tracking which adheres to a Lagrangian perspective on dynamics. In contrast to the Lagrangian perspective, the Eulerian model of dynamics does not track, but describes local changes. For video, an Eulerian phase-based motion representation, using complex steerable filters, has been successfully employed recently for motion magnification and video frame interpolation. Inspired by these previous works, here, we proposes learning Eulerian motion representations in a deep architecture for action recognition. We learn filters in the complex domain in an end-to-end manner. We design these complex filters to resemble complex Gabor filters, typically employed for phase-information extraction. We propose a phase-information extraction module, based on these complex filters, that can be used in any network architecture for extracting Eulerian representations. We experimentally analyze the added value of Eulerian motion representations, as extracted by our proposed phase extraction module, and compare with existing motion representations based on optical flow, on the UCF101 dataset. ...

Hand-tremor frequency estimation in videos

Conference paper (2019) - Silvia L. Pintea, Jian Zheng, Xilin Li, Paulina J.M. Bank, Jacobus J. van Hilten, Jan C. van Gemert

We focus on the problem of estimating human hand-tremor frequency from input RGB video data. Estimating tremors from video is important for non-invasive monitoring, analyzing and diagnosing patients suffering from motor-disorders such as Parkinson’s disease. We consider two approaches for hand-tremor frequency estimation: (a) a Lagrangian approach where we detect the hand at every frame in the video, and estimate the tremor frequency along the trajectory; and (b) an Eulerian approach where we first localize the hand, we subsequently remove the large motion along the movement trajectory of the hand, and we use the video information over time encoded as intensity values or phase information to estimate the tremor frequency. We estimate hand tremors on a new human tremor dataset, TIM-Tremor, containing static tasks as well as a multitude of more dynamic tasks, involving larger motion of the hands. The dataset has 55 tremor patient recordings together with: associated ground truth accelerometer data from the most affected hand, RGB video data, and aligned depth data. ...

Recurrent Knowledge Distillation

Conference paper (2018) - Silvia L. Pintea, Yue Liu, Jan van Gemert

Knowledge distillation compacts deep networks by letting a small student network learn from a large teacher network. The accuracy of knowledge distillation recently benefited from adding residual layers. We propose to reduce the size of the student network even further by recasting multiple residual layers in the teacher network into a single recurrent student layer. We propose three variants of adding recurrent connections into the student network, and show experimentally on CIFAR-10, Scenes and MiniPlaces, that we can reduce the number of parameters at little loss in accuracy. ...

Asymmetric kernel in Gaussian Processes for learning target variance

Journal article (2018) - S.L. Pintea, J.C. van Gemert, A.W.M. Smeulders

This work incorporates the multi-modality of the data distribution into a Gaussian Process regression model. We approach the problem from a discriminative perspective by learning, jointly over the training data, the target space variance in the neighborhood of a certain sample through metric learning. We start by using data centers rather than all training samples. Subsequently, each center selects an individualized kernel metric. This enables each center to adjust the kernel space in its vicinity in correspondence with the topology of the targets — a multi-modal approach. We additionally add descriptiveness by allowing each center to learn a precision matrix. We demonstrate empirically the reliability of the model. ...

One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network

Conference paper (2017) - Vedran Vukotic, Silvia Pintea, Christian Raymond, Guillaume Gravier, Jan van Gemert

There is an inherent need for autonomous cars, drones, and other robots to have a notion of how their environment behaves and to anticipate changes in the near future. In this work, we focus on anticipating future appearance given the current frame of a video. Existing work focuses on either predicting the future appearance as the next frame of a video, or predicting future motion as optical flow or motion trajectories starting from a single video frame. This work stretches the ability of CNNs (Convolutional Neural Networks) to predict an anticipation of appearance at an arbitrarily given future time, not necessarily the next video frame. We condition our predicted future appearance on a continuous time variable that allows us to anticipate future frames at a given temporal distance, directly from the input video frame. We show that CNNs can learn an intrinsic representation of typical appearance changes over time and successfully generate realistic predictions at a deliberate time difference in the near future. ...

Video Acceleration Magnification

Conference paper (2017) - Yichao Zhang, Silvia Pintea, Jan van Gemert

The ability to amplify or reduce subtle image changes over time is useful in contexts such as video editing, medical video analysis, product quality control and sports. In these contexts there is often large motion present which severely distorts current video amplification methods that magnify change linearly. In this work we propose a method to cope with large motions while still magnifying small changes. We make the following two observations: i) large motions are linear on the temporal scale of the small changes, ii) small changes deviate from this linearity. We ignore linear motion and propose to magnify acceleration. Our method is pure Eulerian and does not require any optical flow, temporal alignment or region annotations. We link temporal second-order derivative filtering to spatial acceleration magnification. We apply our method to moving objects where we show motion magnification and color magnification. We provide quantitative as well as qualitative evidence for our method while comparing to the state-of-the-art. ...

Making a Case for Learning Motion Representations with Phase

Conference paper (2016) - Silvia Pintea, Jan van Gemert

This work advocates Eulerian motion representation learning over the current standard Lagrangian optical flow model. Eulerian motion is well captured by using phase, as obtained by decomposing the image through a complex-steerable pyramid. We discuss the gain of Eulerian motion in a set of practical use cases: (i) action recognition, (ii) motion prediction in static images, (iii) motion transfer in static images and, (iv) motion transfer in video. For each task we motivate the phase-based direction and provide a possible approach. ...

Large scale Gaussian Process for overlap-based object proposal scoring

Journal article (2016) - Silvia Pintea, S. Karaoğlu, Jan van Gemert, AWM Smeulders

This work considers the task of object proposal scoring by integrating the consistency between state- of-the-art object proposal algorithms. It represents a novel way of thinking about proposals, as it starts with the assumption that consistent proposals are most likely centered on objects in the image. We pose the box-consistency problem as a large-scale regression task. The approach starts from existing popular object proposal algorithms and assigns scores to these proposals based on the consistency within and be- tween algorithms. Rather than generating new proposals, we focus on the consistency of state-of-the-art ones and score them on the assumption that mutually agreeing proposals usually indicate the location of objects. This work performs large-scale regression by starting from the strong Gaussian Process model, renowned for its power as a regressor. We extend the model in a natural manner to make effective use of the large number of training samples. We achieve this through metric learning for reshaping the kernel space, while maintaining the kernel-matrix size fixed. We validated the new Gaussian Process models on a standard regression dataset —Airfoil Self-Noise —to prove the generality of the method. Further- more, we test the suitability of the proposed approach for the undertaken box scoring task on Pascal- VOC2007. We conclude that box scoring is possible by employing overlap statistics in a new Gaussian Process model, fine tuned to handle large amounts of data. ...

Featureless

Bypassing feature extraction in action categorization

Conference paper (2016) - Silvia Pintea, Pascal Mettes, Jan van Gemert, AWM Smeulders

This method introduces an efficient manner of learning action categories without the need of feature estimation. The approach starts from low-level values, in a similar style to the successful CNN methods. However, rather than extracting general image features, we learn to predict specific video representations from raw video data. The benefit of such an approach is that at the same computational expense it can predict 2D video representations as well as 3D ones, based on motion. The proposed model relies on discriminative Wald-boost, which we enhance to a multiclass formulation for the purpose of learning video representations. The suitability of the proposed approach as well as its time efficiency are tested on the UCF11 action recognition dataset. ...