Semantic segmentation of roof superstructures

More Info
expand_more

Abstract

Automated reconstruction of detailed semantic 3D city models is challenging due to the need for high-resolution (HR) and large-scale input datasets, the ambiguous definition of the ensuing model, the intricacy of the processing pipeline, and its costs. Furthermore, existing methods mainly focus on geometry rather than semantics. Detailed semantic models may include roof installations whose size and function vary: dormers, windows, chimneys, etc. All elements visible on the roof from an aerial view are called ‘‘superstructures”. Deep Learning techniques can facilitate their modelization. This work inscribes itself in a project developed at the Technical University of Munich. The existing pipeline employs a convolutional neural network (CNN) on aerial images segmenting roof superstructures. These results can then be vectorized, extruded in 3D with their semantic description, and added to a simple 3D model.

This thesis demonstrates that building height data fused to a CNN on RGB aerial images improves the semantic segmentation of roof superstructures for classes with relief. Fusion of absolute and relative height data with different interpolation methods applied to LiDAR point cloud data is achieved through a fusion network from the state-of-the-art (FuseNet). First, experiments prove that detection accuracies increase by 11% on average for dormers and 12% for chimneys compared to U-Net output on the same dataset. Best performance is reached with the fusion of absolute height (rather than normalized) and IDW or NN interpolation technique (rather than none). However, although superstructure types are better recognized, their boundaries are fuzzier due to data input mismatches, and more background pixels are classified. Secondly, the predictions and modelization of both a Bavarian and Dutch test set prove the technique scalability. However, a training set annotated for Bavaria and applied to a test set in the Netherlands yields inaccurate results due to local architectural typologies and different input data characteristics.