Gunshot Sound Onset Detection on MCUs with Tiny Conv-GRU
T.Y. Huang (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Qing Wang – Graduation committee member (TU Delft - Embedded Systems)
G. Gaydadjiev – Graduation committee member (TU Delft - Computer Engineering)
D. Danaei – Mentor (Alten)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Gunshot detection plays a critical role in protecting African wildlife and reducing illegal poaching activities. The decline of keystone species disrupts ecosystems and inhibits forest CO₂ absorption, contributing to climate change. To support conservation efforts, this thesis contributes to the development of an embedded acoustic surveillance system that detects gunshot sounds and locates their sources, enabling rangers to respond in real time. However, realizing such a system is challenging due to limitations in data, resources, and infrastructure.
This thesis proposes a lightweight convolutional recurrent neural network, combining depthwise separable convolutions (DSConv) and a gated recurrent unit (GRU), designed to detect the onset of gunshot sounds for trilaterating the shooter's position. The model was trained on a real-world gunshot dataset and optimized for deployment on the ultra-low-power STM32U5 microcontroller. A series of incremental experiments on architectural design, data manipulation, and feature selection aimed to improve performance and efficiency.
To evaluate detection performance under class imbalance, independent of confidence thresholds, a novel onset-based area under the precision–recall curve (AUPRC) metric was proposed. Computational cost was evaluated through STM32U5 inference benchmarks. Experimental results showed that time-shift augmentation provided the largest performance gain, followed by modest improvements from regularization techniques. Class rebalancing and background noise augmentation had minor effects. Replacing standard convolutions with DSConv substantially improved efficiency. Finally, using ∆mel-frequency cepstral coefficients (∆MFCCs) as input features further improved both performance and efficiency.
Overall, the final ∆MFCC DSConv-GRU model outperformed the quasi-DenseNet baseline in F1-score (+2.6%) while reducing multiply-accumulate operations (-96.9%), RAM usage (-95.8%), and runtime (-97.3%). These improvements enabled real-time inference on microcontrollers, demonstrating that a lightweight deep learning model can perform effectively under strict resource constraints. Hence, this work provides a foundational step toward future embedded gunshot detection systems for wildlife monitoring and anti-poaching applications.
Files
File under embargo until 10-10-2027