Interpretable Sewer Defect Detection with Large Multimodal Models †

None, None; None, None

Interpretable Sewer Defect Detection with Large Multimodal Models †

Journal Article (2024)

Author(s)

Riccardo Taormina (TU Delft - Sanitary Engineering)

J.A. van der Werf (TU Delft - Sanitary Engineering)

Research Group

Sanitary Engineering

DOI related publication

https://doi.org/10.3390/engproc2024069158

Artificial intelligence Asset management Generative AI Sewer defect classification

To reference this document use:

https://resolver.tudelft.nl/uuid:d7d57a8f-368d-48fb-b08c-caa32bf89a36

More Info

expand_more

Publication Year

2024

Language

English

Research Group

Sanitary Engineering

Issue number

1

Volume number

69

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Multimodal Models are emerging general AI models capable of processing and analyzing diverse data streams, including text, imagery, and sequential data. This paper explores the possibility of exploiting multimodality to develop more interpretable AI-based predictive tools for the water sector, with a first application for sewer defect detection from CCTV imagery. To this aim, we test the zero-shot generalization performance of three generalist large language-vision models for binary sewer defect detection on a subset of the SewerML dataset. We compared the LMMs against a state-of-the-art unimodal Deep Learning approach which has been trained and validated on >1 million SewerML images. Unsurprisingly, the chosen benchmark showcases the best performances, with an overall F1 Score of 0.80. Nonetheless, OpenAI GPT4-V demonstrates relatively good performances with an overall F1 Score of 0.61, displaying equal or better results than the benchmark for some defect classes. Furthermore, GPT4-V often provides text descriptions aligned with the provided prediction, accurately describing the rationale behind a certain decision. Similarly, GPT4-V displays interesting emerging behaviors for trustworthiness, such as refusing to classify images that are too blurred or unclear. Despite the significantly lower performance from the open-source models CogVLM and LLaVA, some preliminary successes suggest good potential for enhancement through fine-tuning, agentic workflows, or retrieval-augmented generation.

Files

Engproc-69-00158.pdf

(pdf | 0.921 Mb)