Multimodal Deep Learning for the Classification of Human Activity

Radar and Video data fusion for the classification of human activity

Master Thesis (2019)
Author(s)

R.J. de Jong (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Faruk Uysal – Mentor

Olexander Yarovoy – Mentor

Jacco de Wit – Mentor

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2019 Richard de Jong
More Info
expand_more
Publication Year
2019
Language
English
Copyright
© 2019 Richard de Jong
Graduation Date
14-01-2019
Awarding Institution
Delft University of Technology
Sponsors
TNO
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Persistent surveillance is an urgent proficiency. For security, surveillance cameras are a strong asset as they support the automatic tracking of people and are directly interpretable by a human operator. Radar on the other hand can be used under a broad range of circumstances: radar can penetrate mediums such as clouds, fogs, mist and snow, and it can be used when it gets dark.
However radar data, compared to an optical sensor as video, is not as easily interpretable by a human operator. This thesis explores the potential of multimodal deep learning with a radar and video sensor to improve the classification accuracy of human activity. A recorded and labelled dataset is created that contains three different human activities: walking, walking with a metal pole and walking with a backpack (10 kg). A Single Shot Detector is used to process the video data. The cropped frames are then associated with the start of a radar micro-Doppler signature with a duration of 1.28 seconds. The dataset is split in a training (80 %) and validation (20 %) set such that no data from a person in the training set is in the validation set. Implementations of convolutional neural networks for the video frames and micro-Doppler signatures obtain classification accuracies of 85.78 % and 63.12 % respectively for previously mentioned activities. It was not possible to distinguish a person walking and walking carrying a backpack on basis of the micro-Doppler signatures. The synchronised dataset is used to investigate different fusion methods. Both early and late fusion methods show an improvement in classification accuracy. The best obtained early fusion model achieves a classification accuracy of 90.60 %. Omitting the radar data however shows a drop in classification accuracy of just 0.9 %, identifying the video data as the dominant modality in this particular setup.

Files

Multimodal_Deep_Learning_for_t... (pdf)
(pdf | 5.86 Mb)
- Embargo expired in 30-11-2019
License info not available