S. Pintea
Please Note
20 records found
1
Objects do not disappear
Video object detection by single-frame object location anticipation
Deep Vanishing Point Detection
Geometric priors make dataset variations vanish
Seismic inversion with deep learning
A proposal for litho-type classification
No frame left behind
Full Video Action Recognition
Top-down networks
A coarse-to-fine reimagination of CNNs
Biological vision adopts a coarse-to-fine information processing pathway, from initial visual detection and binding of salient features of a visual scene, to the enhanced and preferential processing given relevant stimuli. On the contrary, CNNs employ a fine-to-coarse processing, moving from local, edge-detecting filters to more global ones extracting abstract representations of the input. In this paper we reverse the feature extraction part of standard bottom-up architectures and turn them upside-down: We propose top-down networks. Our proposed coarse-to-fine pathway, by blurring higher frequency information and restoring it only at later stages, offers a line of defence against adversarial attacks that introduce high frequency noise. Moreover, since we increase image resolution with depth, the high resolution of the feature map in the final convolutional layer contributes to the explainability of the network's decision making process. This favors object-driven decisions over context driven ones, and thus provides better localized class activation maps. This paper offers empirical evidence for the applicability of the top-down resolution processing to various existing architectures on multiple visual tasks.
Classical work on line segment detection is knowledge-based; it uses carefully designed geometric priors using either image gradients, pixel groupings, or Hough transform variants. Instead, current deep learning methods do away with all prior knowledge and replace priors by training deep networks on large manually annotated datasets. Here, we reduce the dependency on labeled data by building on the classic knowledge-based priors while using deep networks to learn features. We add line priors through a trainable Hough transform block into a deep network. Hough transform provides the prior knowledge about global line parameterizations, while the convolutional layers can learn the local gradient-like line features. On the Wireframe (ShanghaiTech) and York Urban datasets we show that adding prior knowledge improves data efficiency as line priors no longer need to be learned from data.
Divide and Count
Generic Object Counting by Image Divisions
We propose a general object counting method that does not use any prior category information. We learn from local image divisions to predict global image-level counts without using any form of local annotations. Our method separates the input image into a set of image divisions - each fully covering the image. Each image division is composed of a set of region proposals or uniform grid cells. Our approach learns in an end-to-end deep learning architecture to predict global image-level counts from local image divisions. The method incorporates a counting layer which predicts object counts in the complete image, by enforcing consistency in counts when dealing with overlapping image regions. Our counting layer is based on the inclusion-exclusion principle from set theory. We analyze the individual building blocks of our proposed approach on Pascal-VOC2007 and evaluate our method on the MS-COCO large scale generic object data set as well as on three class-specific counting data sets: UCSD pedestrian data set, and CARPK, and PUCPR+ car data sets.
Currently, the most common motion representation for action recognition is optical flow. Optical flow is based on particle tracking which adheres to a Lagrangian perspective on dynamics. In contrast to the Lagrangian perspective, the Eulerian model of dynamics does not track, but describes local changes. For video, an Eulerian phase-based motion representation, using complex steerable filters, has been successfully employed recently for motion magnification and video frame interpolation. Inspired by these previous works, here, we proposes learning Eulerian motion representations in a deep architecture for action recognition. We learn filters in the complex domain in an end-to-end manner. We design these complex filters to resemble complex Gabor filters, typically employed for phase-information extraction. We propose a phase-information extraction module, based on these complex filters, that can be used in any network architecture for extracting Eulerian representations. We experimentally analyze the added value of Eulerian motion representations, as extracted by our proposed phase extraction module, and compare with existing motion representations based on optical flow, on the UCF101 dataset.
We focus on the problem of estimating human hand-tremor frequency from input RGB video data. Estimating tremors from video is important for non-invasive monitoring, analyzing and diagnosing patients suffering from motor-disorders such as Parkinson’s disease. We consider two approaches for hand-tremor frequency estimation: (a) a Lagrangian approach where we detect the hand at every frame in the video, and estimate the tremor frequency along the trajectory; and (b) an Eulerian approach where we first localize the hand, we subsequently remove the large motion along the movement trajectory of the hand, and we use the video information over time encoded as intensity values or phase information to estimate the tremor frequency. We estimate hand tremors on a new human tremor dataset, TIM-Tremor, containing static tasks as well as a multitude of more dynamic tasks, involving larger motion of the hands. The dataset has 55 tremor patient recordings together with: associated ground truth accelerometer data from the most affected hand, RGB video data, and aligned depth data.
There is an inherent need for autonomous cars, drones, and other robots to have a notion of how their environment behaves and to anticipate changes in the near future. In this work, we focus on anticipating future appearance given the current frame of a video. Existing work focuses on either predicting the future appearance as the next frame of a video, or predicting future motion as optical flow or motion trajectories starting from a single video frame. This work stretches the ability of CNNs (Convolutional Neural Networks) to predict an anticipation of appearance at an arbitrarily given future time, not necessarily the next video frame. We condition our predicted future appearance on a continuous time variable that allows us to anticipate future frames at a given temporal distance, directly from the input video frame. We show that CNNs can learn an intrinsic representation of typical appearance changes over time and successfully generate realistic predictions at a deliberate time difference in the near future.
Featureless
Bypassing feature extraction in action categorization