Blind Reverberation Time Estimation using A Convolutional Neural Network with Encoder

None, None

Blind Reverberation Time Estimation using A Convolutional Neural Network with Encoder

Bachelor Thesis (2024)

Author(s)

X. Han (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Jorge Martinez – Mentor (TU Delft - Multimedia Computing)

Dimme de Groot – Mentor (TU Delft - Multimedia Computing)

Maria Soledad Pera – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Transformer Convolutional Neural Networks (CNNs) Acoustic environment Blind Estimation Reverberation Time Estimation. Signal-to-Noise Ratio

To reference this document use:

https://resolver.tudelft.nl/uuid:6957b936-be7a-4581-beea-4166bdff557c

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

27-06-2024

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Estimating reverberation time (RT60) accurately is crucial for enhancing the acoustic quality of various environments as it decides how you feel the sound fades away subjectively. Traditional methods, such as Sabine's equation, require extensive prior knowledge and assume ideal conditions, limiting their practicality. To address these limitations, this paper explores the application of convolutional neural networks (CNNs) enhanced with an encoder architecture based on transformer mechanisms for blind RT60 estimation. The proposed model leverages simulated and real-world datasets, incorporating environmental noise to improve robustness. Results indicate that the CNN-Encoder model achieves superior performance, with a mean squared error (MSE) as low as 0.0006 seconds for pure room impulse responses (RIRs) and 0.0011 seconds under +30dB signal-to-noise ratio (SNR) conditions. It also demonstrates potential in practical usage achieving an MSE of 0.0282 seconds under audio recordings. This approach offers a significant reduction in estimation error compared to the CNN-only architecture, demonstrating the potential for improved acoustic parameter estimation in varied environments. Future work will focus on further optimizing the model for real-world applications and reducing computational complexity while maintaining high accuracy.

Files

Thesis_final_version.pdf

(pdf | 0.356 Mb)

License info not available