This thesis addresses the topic of visual person detection and pose estimation. While these tasks are relevant for a broad range of applications, this thesis focuses on the domain of intelligent vehicles in urban traffic scenes. This domain is particularly interesting due to specific challenges related to visual perception from a moving vehicle. Accident statistics show that a great proportion of traffic fatalities affect vulnerable road users such as pedestrians and riders. This motivates the interest in reproducing or even surpassing the capabilities of an attentive human driver for driver assistance systems and fully automated driving to improve safety. Deep learning contributed to narrowing the performance gap between computer visionmethods and human visual perception. Especially the capability of convolutional neural networks to learn powerful features is helpful for person detection and pose estimation. Throughout this thesis new deep learning methods for these tasks will be presented. The thesis not only focuses on methodical extensions but also on the creation of new datasets for training, evaluation, and benchmarking in the intelligent vehicles domain.
First, a novel approach for joint object detection and orientation estimation with a single deep convolutional neural network is presented. The orientation estimation is implemented by extending an existing convolutional network architecture with several carefully designed layers and an appropriate loss function. The network depends on external proposals for object candidate regions, whose accuracy is crucial for the overall performance. Therefore, two proposal methods are introduced that make use of 3D sensor data - precisely stereo as well as lidar data. The KITTI dataset, which is commonly used for object detection benchmarking in the automotive domain, serves for training and evaluation. The experiments on the KITTI dataset show that by combining proposals of both sensor modalities, high recall can be achieved while keeping the number of proposals low. Furthermore, the method for joint detection and orientation estimation is competitive with other state of the art approaches. It outperforms the state of the art for a test scenario of the bicycle class.
Big data has had a great share in the success of deep learning in computer vision. Still, the number of pedestrians and riders in the KITTI dataset is rather limited and previous works suggest that there is significant further potential to increase object detection performance by utilizing bigger datasets. Regarding benchmarking, small datasets are prone to dataset bias and overfitting.
Therefore, the second part of this thesis introduces the EuroCity Persons dataset, which provides a large number of highly diverse, accurate, and detailed annotations of pedestrians, cyclists, and other riders in urban traffic scenes. The images for this dataset were collected onboard a moving vehicle in 31 cities of 12 European countries. With over 238200 person instances manually labeled in over 47300 images, EuroCity Persons is nearly one order of magnitude larger than datasets used previously for person detection in traffic scenes. The dataset furthermore contains a large number of person orientation annotations (over 211200). Four state of the art deep learning approaches are thoroughly optimized to serve as baselines for the new object detection benchmark. In experiments with previous datasets, the generalization capabilities of these detectors when trained with the new dataset are analyzed. Furthermore, this thesis studies the effect of the training set size, the dataset diversity (day- vs. night-time, geographical region), the dataset detail (i.e., availability of object orientation information), and the annotation quality on the detector performance.
The qualitative and quantitative analysis of error sources for the best-performing detector reveals methodical weaknesses in dense traffic scenes. For these, the commonly used (greedy) implementation of non-maximum suppression, which is needed in the post-processing of the analyzed deep learning methods, poses a tradeoff between recall and precision.
As the robustness of detection and pose estimation is also important in dense groups of persons, the third part of the thesis focuses on improving both tasks for such scenarios. Learning the task of non-maximumsuppression with a neural network architecture incorporating the head boxes of pedestrians as further attributes to discriminate persons in groups does not improve performance. Yet, the experiments reveal issues with ambiguities in detection and attribute estimation (e.g. head box estimation) for pedestrians that highly overlap each other. To solve this ambiguity for pairwise constellations of persons a new pose estimation method is proposed that relies on pairwise detections as input and jointly estimates the two poses of such pairs in a single forward pass within a deep convolutional neural network. As the availability of automotive datasets providing poses and a fair amount of crowded scenes is limited, the EuroCity Persons dataset is extended by additional images and pose annotations, which are made publicly available as the EuroCity Persons Dense Pose dataset. This dataset is the largest pose dataset recorded from a moving vehicle. The experiments on this dataset with the new method show improved performance for poses of pedestrian pairs in comparison with a state of the art method for human pose estimation in crowds.
The final chapter of the thesis draws conclusions from the content of the previous chapters of the thesis and discusses the required performance for automated driving. Furthermore, it reasons about efficiency aspects regarding the collection, annotation, and usage of data for deep learning and presents potential future work regarding methodical improvements and end-to-end training of the functional chain for automated driving including the integration of multiple sensors.