Pose regression of 3D objects in monocular framework using a Convolutional Neural Network

Tracking multiple objects in real-time

More Info
expand_more

Abstract

In computer vision pose estimation of objects in everyday scenes is a basic need for a clearunderstanding of the surrounding environment, fields of interests include; augmented reality,surveillance, navigation, manipulation, and robotics in general. Pose estimation is a wellstudied topic, however fast and robust solutions are still hard to obtain. The goal of thisresearch is to robustly and efficiently perform 3D pose estimation of multiple objects withina single RGB image in real-time (>24framespersecond (fps)).

To achieve this goal an existing CNN is utilized, more specifically the YOLO network is used.This provides a stable platform for object detection and classification, the network is onlyslightly modified to include pose regression. The YOLOv2 network was originally designedto generate bounding boxes around objects and classify the objects within the boundingboxes. This research shows that using a single confidence value rather than 4 boundingbox parameters is sufficient to determine the relative location of objects within the image,limiting the number of parameters that need to be trained. This has allowed to make thenetwork more efficient and to make the network focus more on training the pose parameters(azimuth,elevation and distance) rather than the bounding box parameters.

Using several techniques like data augmentation, data clustering and data selection the state-of-the-art AVP of 50.1% was achieved on the azimuth estimation problem. For the full 3Dpose (azimuth, elevation and distance) problem the AVP is limited to 30.1%, with no directcomparison this is still considered to be state-of-the-art. However, normalizing the confidenceoutput of each image in the post-processing step has increased this accuracy further, improvingbeyond the state-of-the-art results. With a normalization step the AVP score has reached 63.0% and 40.4% respectively. These results show that the pose estimation problem hasimproved significantly, getting closer to a viable solution for real-world applications.

However, further improvements can be made in the post-processing step, the data augmenta-tion step and the data selection step. The research conducted has shown that accuracy gainsare not only achieved through better network architecture but are also highly dependent onthe training and processing techniques used. This is evident from the accuracy increase of 38% from an AVP of 25% to an AVP of 63%. Optimizing these techniques specifically for theYOLOposearchitecture, might result in a solution that can be used in real-world applications.