We present a novel 3-D instance segmentation framework for multiview stereo (MVS) buildings in urban scenes. Unlike existing works focusing on semantic segmentation of urban scenes, the emphasis of this work lies in detecting and segmenting 3-D building instances even if they are
...
We present a novel 3-D instance segmentation framework for multiview stereo (MVS) buildings in urban scenes. Unlike existing works focusing on semantic segmentation of urban scenes, the emphasis of this work lies in detecting and segmenting 3-D building instances even if they are attached and embedded in a large and imprecise 3-D surface model. Multiview red green blue (RGB) images are first enhanced to RGB height (RGBH) images by adding a heightmap and are segmented to obtain all roof instances using a fine-tuned 2-D instance segmentation neural network. Instance masks from different multiview images are then clustered into global masks. Our mask clustering accounts for spatial occlusion and overlapping, which can eliminate segmentation ambiguities among multiview images. Based on these global masks, 3-D roof instances are segmented out by mask back-projections and extended to the entire building instances through a Markov random field optimization. A new dataset that contains instance-level annotation for both 3-D urban scenes (roofs and buildings) and drone images (roofs) is provided. To the best of our knowledge, it is the first outdoor dataset dedicated to 3-D instance segmentation with much more annotations of attached 3-D buildings than existing datasets.1 Quantitative evaluations and ablation studies have shown the effectiveness of all major steps and the advantages of our multiview framework over the orthophoto-based method.@en