XL

11 records found

Authored

Efficiency in Deep Learning

Image and Video Deep Model Efficiency

Deep learning is the core algorithmic tool for automatically processing large amounts of data. Deep learning models are defined as a stack of functions (called layers) with millions of parameters, that are updated during training by fitting them to data. Deep learning models have ...

Objects do not disappear

Video object detection by single-frame object location anticipation

Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyfram ...

Video BagNet

Short temporal receptive fields increase robustness in long-term action recognition

Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model ...

No frame left behind

Full Video Action Recognition

Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using thes ...
Cross domain image matching between image collections from different source and target domains is challenging in times of deep learning due to i) limited variation of image conditions in a training set, ii) lack of paired-image labels during training, iii) the existing of outlier ...

Contributed

Object detectors have come a long way and are used for various applications. In pictures and videos, an object detector must deal with the background. In some settings, this background is indicative of the object; in others, it’s not and can even be disruptive. For models trained ...

Explaining overthinking in Multi-Scale Dense networks

Why more computation does not always lead to better results

Traditional convolutional neural networks exhibit an inherent limitation, they can not adapt their computation to the input while some inputs require less computation to arrive at an accurate prediction than others. Early-exiting setups exploit this fact by only spending as much ...
Event-based cameras do not capture frames like an RGB camera, only data from pixels that detect a change in light intensity, making it a better alternative for processing videos. The sparse data acquired from event-based video only captures movement in an asynchronous way. In thi ...
Event-based cameras represent a new alternative to traditional frame based sensors, with advantages in lower output bandwidth, lower latency and higher dynamic range, thanks to their independent, asynchronous pixels. These advantages prompted the development of computer vision me ...
Instance segmentation on data from Dynamic Vision Sensors (DVS) is an important computer vision task that needs to be tackled in order to push the research forward on these types of inputs. This paper aims to show that deep learning based techniques can be used to solve the task ...
The event-based camera represents a revolutionary concept, having an asynchronous output. The pixels of dynamic vision sensors react to the brightness change, resulting in streams of events at very small intervals of time. This paper provides a model to track objects in neuromorp ...