Q. Wang
Please Note
60 records found
1
Adapting Mamba Models for Deployment on Microcontrollers
Enabling Linear-Time Sequence Modeling on Ultra-Low-Power Tiny Devices
The Mamba architecture, built around State-Space Model, is a promising candidate due to its compact parameterization and strong performance on long-context tasks. Nevertheless, Mamba was originally designed for highly parallelized GPUs, making its adaptation for TinyML non-trivial. This paper evaluates Mamba deployment strategies on microcontrollers using TensorFlow Lite Micro.
We propose architecture modifications and optimization techniques tailored specifically to microcontroller constraints. Our deployment of a quantized Mamba model achieves a 60.4~KB peak RAM footprint on a Keyword Spotting task, a 74\% memory reduction compared to state-of-the-art work (MambaLite-Micro). Furthermore, we analyze the trade-offs of quantization, demonstrating that while it substantially reduces memory, it can introduce latency overhead on hardware lacking acceleration of INT8 operations.
To mitigate code size and loop-unrolling overheads, we introduce a model-splitting technique that enables the execution of larger models. Our findings demonstrate that while Mamba is a viable architecture for TinyML, further research is required to fully optimize State Space Model implementations for edge hardware. ...
The Mamba architecture, built around State-Space Model, is a promising candidate due to its compact parameterization and strong performance on long-context tasks. Nevertheless, Mamba was originally designed for highly parallelized GPUs, making its adaptation for TinyML non-trivial. This paper evaluates Mamba deployment strategies on microcontrollers using TensorFlow Lite Micro.
We propose architecture modifications and optimization techniques tailored specifically to microcontroller constraints. Our deployment of a quantized Mamba model achieves a 60.4~KB peak RAM footprint on a Keyword Spotting task, a 74\% memory reduction compared to state-of-the-art work (MambaLite-Micro). Furthermore, we analyze the trade-offs of quantization, demonstrating that while it substantially reduces memory, it can introduce latency overhead on hardware lacking acceleration of INT8 operations.
To mitigate code size and loop-unrolling overheads, we introduce a model-splitting technique that enables the execution of larger models. Our findings demonstrate that while Mamba is a viable architecture for TinyML, further research is required to fully optimize State Space Model implementations for edge hardware.
Efficient Embedded Intelligence
Exploring the Width-Precision Trade-Off in Binary-Quantized Vision Transformers
Multi-Object State Estimation using Probabilistic Belief-Based Trackers
Connecting Low-Frequency Detection and High-Rate Prediction on Embedded Devices
Transformer Inference using MAD vs LUT Kernels
A Comparative Benchmark of MAD and LUT Kernels for Binary and Ternary Dot Products on CPU and Edge Platforms
This thesis isolates the trade-off by sweeping LUT depth as a single controlled variable, spanning matrix sizes across the cache hierarchy and attributing results through roofline analysis on an x86 platform with AVX2 and roofline plus hardware-counter analysis on an ARM edge platform with NEON. The LUT advantage proves conditional: binary throughput rises monotonically with depth to 104.4 GOPS, roughly 2.6 times the strongest MAD baseline, while ternary gains are narrower and erode once the table outgrows fast cache or forces a gather. Throughout, instruction throughput, not bandwidth, is the binding limit. ...
This thesis isolates the trade-off by sweeping LUT depth as a single controlled variable, spanning matrix sizes across the cache hierarchy and attributing results through roofline analysis on an x86 platform with AVX2 and roofline plus hardware-counter analysis on an ARM edge platform with NEON. The LUT advantage proves conditional: binary throughput rises monotonically with depth to 104.4 GOPS, roughly 2.6 times the strongest MAD baseline, while ternary gains are narrower and erode once the table outgrows fast cache or forces a gather. Throughout, instruction throughput, not bandwidth, is the binding limit.
Structured Degradation in Visible Light Positioning
Modeling and Compensation of Long-Term Degradation in RSS-Based VLP System
The proposed method combines scaling-based compensation for gradual degradation with anomaly detection for sudden degradation events such as broken LEDs. This method is tested through a long-term deployment simulation using the DenseVLC dataset and is also implemented on a Raspberry Pi Pico to assess embedded feasibility. The results show that VLP systems suffer increasing errors over time, while degradation-aware compensation improves long-term robustness. However, embedded deployment introduces accuracy trade-offs due to quantization and memory constraints.
These results show that modeling and compensating for degradation mechanisms is important for reliable long-term VLP deployment, and that compensation methods need to account for both gradual and sudden changes in received signal strength. ...
The proposed method combines scaling-based compensation for gradual degradation with anomaly detection for sudden degradation events such as broken LEDs. This method is tested through a long-term deployment simulation using the DenseVLC dataset and is also implemented on a Raspberry Pi Pico to assess embedded feasibility. The results show that VLP systems suffer increasing errors over time, while degradation-aware compensation improves long-term robustness. However, embedded deployment introduces accuracy trade-offs due to quantization and memory constraints.
These results show that modeling and compensating for degradation mechanisms is important for reliable long-term VLP deployment, and that compensation methods need to account for both gradual and sudden changes in received signal strength.
Embedded Trustworthy AI for Healthcare
A Multi-Objective Study of Fairness, Privacy, and Efficiency under TinyML Constraints
Exploring the feasibility of short-range VLC schemes in MIMO systems
Otsu Thresholding and Sliding Window Protocols
With radio communication bandwidth becoming increasingly scarce and expensive, researchers have turned toward the light medium, namely the field of Visible Light Communication (VLC). Although the field of Visible Light Communication (VLC) was pioneered in the late 1800s, it faced criticism from scientists of that era, with radio communications being preferred instead. VLC has since regained attention by complementing existing radio communication methods. This research paper focuses on exploring different short-range multiple-input multiple-output (MIMO) screen-to-camera VLC schemes operating solely on the red optical channel. The transmitting screen is a 4×6 LED grid on a prototype board, while the receiver is an off-the-shelf smartphone back camera. The chosen modulation technique is on-off keying (OOK) with Manchester encoding (ME), while demodulation is performed using three different strategies, the first two using Otsu thresholding and the last using a sliding window approach. Our experiments show that, while the modulation scheme achieves a transmission rate of 6 symbols per LED per frame (up to 144 symbols per frame) and a bit error rate (BER) of less than 10⁻¹, the limited resolution and frame rate make it difficult to reliably include important data frame header fields such as the sequence number. ...
With radio communication bandwidth becoming increasingly scarce and expensive, researchers have turned toward the light medium, namely the field of Visible Light Communication (VLC). Although the field of Visible Light Communication (VLC) was pioneered in the late 1800s, it faced criticism from scientists of that era, with radio communications being preferred instead. VLC has since regained attention by complementing existing radio communication methods.
This research paper focuses on exploring different short-range multiple-input multiple-output (MIMO) screen-to-camera VLC schemes operating solely on the red optical channel. The transmitting screen is a 4×6 LED grid on a prototype board, while the receiver is an off-the-shelf smartphone back camera. The chosen modulation technique is on-off keying (OOK) with Manchester encoding (ME), while demodulation is performed using three different strategies, the first two using Otsu thresholding and the last using a sliding window approach.
Our experiments show that, while the modulation scheme achieves a transmission rate of 6 symbols per LED per frame (up to 144 symbols per frame) and a bit error rate (BER) of less than 10⁻¹, the limited resolution and frame rate make it difficult to reliably include important data frame header fields such as the sequence number.
Embedded Spacecraft Fault Detection
A Hitchhiker's Guide to Explainable Thermal Anomaly Alerts for Downlink-Constrained Space Missions
This thesis addresses these challenges by proposing a novel, purely event-based eye tracking pipeline designed for high-frequency performance and robust accuracy within a strict computational budget. The pipeline accepts only event streams and estimates the pupil region in the field of view. The core contribution is a dual-state framework that synergistically combines a deep learning-based pupil detector with a lightweight, rapid template updater. For robust detection, a lightweight, attention-augmented segmentation network, named PupilUNet, is developed. It leverages a truncated MobileNetV3 Small encoder and a parameter-free attention mechanism to accurately segment the pupil boundary from Speed-Invariant Time Surface (SITS) representations, which provide a stable input by normalizing for motion speed. To overcome the scarcity of annotated data, a comprehensive framework is introduced to augment a large-scale training dataset from limited initial labels. Once a high-confidence pupil template is detected, the system transitions to a rapid updating mode, employing an optimized, vectorized point-to-edge matching algorithm to track the pupil at
kilo-Hertz frequencies with millisecond latency. A dynamic control logic monitors tracking quality and seamlessly reverts to the robust detection mode when necessary, ensuring both speed and resilience.
Experimental results on the EV-Eye dataset validate the pipeline’s effectiveness. The PupilUNet detector achieves a P5 accuracy of 96.3% (pupil center error < 5 pixels), while the rapid updater operates with an average latency of approximately 1 ms. The lightweight PupilUNet model contains merely 0.177 M parameters and inferences within 0.553 GFLOPs. The fully integrated system sustains a P5 accuracy of 85.2% while achieving a peak tracking frequency of over 960 Hz. This work demonstrates a practical and efficient solution that successfully navigates the trade-offs between accuracy and latency, establishing a new baseline for high-performance, event-based eye tracking on mobile and embedded systems. ...
This thesis addresses these challenges by proposing a novel, purely event-based eye tracking pipeline designed for high-frequency performance and robust accuracy within a strict computational budget. The pipeline accepts only event streams and estimates the pupil region in the field of view. The core contribution is a dual-state framework that synergistically combines a deep learning-based pupil detector with a lightweight, rapid template updater. For robust detection, a lightweight, attention-augmented segmentation network, named PupilUNet, is developed. It leverages a truncated MobileNetV3 Small encoder and a parameter-free attention mechanism to accurately segment the pupil boundary from Speed-Invariant Time Surface (SITS) representations, which provide a stable input by normalizing for motion speed. To overcome the scarcity of annotated data, a comprehensive framework is introduced to augment a large-scale training dataset from limited initial labels. Once a high-confidence pupil template is detected, the system transitions to a rapid updating mode, employing an optimized, vectorized point-to-edge matching algorithm to track the pupil at
kilo-Hertz frequencies with millisecond latency. A dynamic control logic monitors tracking quality and seamlessly reverts to the robust detection mode when necessary, ensuring both speed and resilience.
Experimental results on the EV-Eye dataset validate the pipeline’s effectiveness. The PupilUNet detector achieves a P5 accuracy of 96.3% (pupil center error < 5 pixels), while the rapid updater operates with an average latency of approximately 1 ms. The lightweight PupilUNet model contains merely 0.177 M parameters and inferences within 0.553 GFLOPs. The fully integrated system sustains a P5 accuracy of 85.2% while achieving a peak tracking frequency of over 960 Hz. This work demonstrates a practical and efficient solution that successfully navigates the trade-offs between accuracy and latency, establishing a new baseline for high-performance, event-based eye tracking on mobile and embedded systems.
between the root-mean-square error and maximum error of the single-antenna baseline and the proposed multi-antenna solution for both spatial and sequential consistency in a complex multipath office environment shows that there is, on average, a 58% reduction in error metrics when the optimal multi-antenna setup is used. The performance of the optimal multi-antenna channel sounding setup
in the complex environment approaches the single-antenna baseline performance
in an ideal outdoor environment. This shows that the added antenna diversity
successfully overcomes the negative effects due to multipath propagation. ...
between the root-mean-square error and maximum error of the single-antenna baseline and the proposed multi-antenna solution for both spatial and sequential consistency in a complex multipath office environment shows that there is, on average, a 58% reduction in error metrics when the optimal multi-antenna setup is used. The performance of the optimal multi-antenna channel sounding setup
in the complex environment approaches the single-antenna baseline performance
in an ideal outdoor environment. This shows that the added antenna diversity
successfully overcomes the negative effects due to multipath propagation.
TinyML-Empowered Indoor Positioning with Light
Model Optimization using Neural Architecture Search
A received signal strength (RSS) based VLP system's accuracy is heavily dependent on the density of collected fingerprints, being a very labor-intensive process.
In this study, we focus on RSS fingerprints to achieve centimetre level positioning accuracy, while addressing the challenges of labor-intensive fingerprint collection and deployment on resource-constrained devices like the Raspberry Pi Pico microcontroller.
We found different neural network architectures using Neural Architecture Search (NAS) to optimize the VLP system, which achieve on average $12mm$ positioning error with low inference latency around $50ms$ on the Raspberry Pi Pico. ...
A received signal strength (RSS) based VLP system's accuracy is heavily dependent on the density of collected fingerprints, being a very labor-intensive process.
In this study, we focus on RSS fingerprints to achieve centimetre level positioning accuracy, while addressing the challenges of labor-intensive fingerprint collection and deployment on resource-constrained devices like the Raspberry Pi Pico microcontroller.
We found different neural network architectures using Neural Architecture Search (NAS) to optimize the VLP system, which achieve on average $12mm$ positioning error with low inference latency around $50ms$ on the Raspberry Pi Pico.
TinyML-Based Adaptive Speed Control for Car Robot
A Comparative Approach
TinyML-Empowered Line Following for a Car Robot
Evaluating the Capabilities of Various Lane Detection Models on Microcontrollers
...
TinyML-Empowered Indoor Positioning with Light
A Study on the Impact of LED Aging and Failure
In our simulations, this approach maintains the original level of accuracy despite aging effects. In some cases, it yields up to a 95% improvement when evaluated over longer timespans. Furthermore, our preprocessing contributions have led to a 30% improvement to baseline performance without aging. Our results demonstrate a path toward scalable, self-sustaining VLP systems suitable for real-world deployment. ...
In our simulations, this approach maintains the original level of accuracy despite aging effects. In some cases, it yields up to a 95% improvement when evaluated over longer timespans. Furthermore, our preprocessing contributions have led to a 30% improvement to baseline performance without aging. Our results demonstrate a path toward scalable, self-sustaining VLP systems suitable for real-world deployment.