Interpretable Sewer Defect Detection with Large Multimodal Models †

Journal Article (2024)
Author(s)

Riccardo Taormina (TU Delft - Sanitary Engineering)

Job Augustijn van der Werf (TU Delft - Sanitary Engineering)

Research Group
Sanitary Engineering
DOI related publication
https://doi.org/10.3390/engproc2024069158 Final published version
More Info
expand_more
Publication Year
2024
Language
English
Research Group
Sanitary Engineering
Journal title
Engineering Proceedings
Issue number
1
Volume number
69
Article number
158
Event
3rd International Joint Conference on Water Distribution Systems Analysis & Computing and Control for the Water Industry (2024-07-01 - 2024-07-04), Ferrara, Italy
Downloads counter
171
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Multimodal Models are emerging general AI models capable of processing and analyzing diverse data streams, including text, imagery, and sequential data. This paper explores the possibility of exploiting multimodality to develop more interpretable AI-based predictive tools for the water sector, with a first application for sewer defect detection from CCTV imagery. To this aim, we test the zero-shot generalization performance of three generalist large language-vision models for binary sewer defect detection on a subset of the SewerML dataset. We compared the LMMs against a state-of-the-art unimodal Deep Learning approach which has been trained and validated on >1 million SewerML images. Unsurprisingly, the chosen benchmark showcases the best performances, with an overall F1 Score of 0.80. Nonetheless, OpenAI GPT4-V demonstrates relatively good performances with an overall F1 Score of 0.61, displaying equal or better results than the benchmark for some defect classes. Furthermore, GPT4-V often provides text descriptions aligned with the provided prediction, accurately describing the rationale behind a certain decision. Similarly, GPT4-V displays interesting emerging behaviors for trustworthiness, such as refusing to classify images that are too blurred or unclear. Despite the significantly lower performance from the open-source models CogVLM and LLaVA, some preliminary successes suggest good potential for enhancement through fine-tuning, agentic workflows, or retrieval-augmented generation.