Semantic Segmentation of RGB-Z Aerial Imagery Using Convolutional Neural Networks

More Info
expand_more

Abstract

Semantic segmentation (or pixel-level classification) of remotely sensed imagery has shown to be useful for applications in fields as mapping of land cover, object detection, change detection and land-use analysis. Deep learning algorithms called convolutional neural networks (CNNs) have shown to outperform traditional computer vision and machine learning approaches in tackling semantic segmentation tasks. Furthermore, addition of height information (Z) to aerial imagery (RGB) is believed to improve segmentation results. However, discussion remains on the following: to what extent height information adds value; the best way to combine RGB information with height information; and what type of height information can best be used. This study aims to answer these questions. In this research work, the CNN architectures FCN-8s, SegNet, U-Net and FuseNet-SF5 are trained to semantically segment 10 cm resolution true ortho imagery of Haarlem, potentially augmented with height information. The outputted topographic maps contain the classes building, road, water and other. Experiments are conducted that allow for the comparison of 1) models trained on RGB and on RGB-Z, 2) models combining RGB and height information through data fusion and through data stacking, and 3) models trained using different types of absolute and relative height approaches. Performances are compared based on scores on the performance measure (mean) intersection over union (IoU) and through visual assessment of outputted prediction maps. The results indicated that on average segmentation performance improves by approximately 1 percent when absolute height information is added. The class building showed to benefit the most from the addition of height information. Furthermore, extracting features from height information in a separate encoder and fusing these into RGB feature maps, led to a higher overall segmentation quality than when height information is provided as a stacked extra band and processed in the same encoder as the RGB information. Finally, models using relative height delivered a higher quality segmentation than when absolute height approaches were used, especially for large objects. The best performing model; FuseNet-SF5 trained on RGB imagery and pixel-level, relative height, retrieved a mean IoU of 0.8427 and IoUs of 0.8744, 0.7865, 0.9131 and 0.7966 for the classes building, road, water and other respectively. This model was able to correctly classify over 90% of the pixels of 67% of all the objects present in the ground truth. Overall, this study showed that, when considering semantic segmentation of aerial RGB imagery, 1) height information can improve segmentation results, 2) adding height information through data fusion can result in a higher segmentation quality than when data stacking is used, and 3) providing relative height to a network, rather than absolute height, can improve semantic segmentation quality.