C. Gao
Please Note
27 records found
1
Radar-based human activity recognition (RadHAR) is an attractive alternative to wearables and cameras because it preserves privacy, is contactless, and is robust to occlusions. However, dominant convolutional neural network (CNN)- and recurrent neural network (RNN)-based solutions are computationally intensive at deployment, and recent lightweight vision transformer (ViT) and state-space model (SSM) variants still exhibit substantial complexity. In this article, we present RadMamba, a parameter-efficient, micro-Doppler-oriented Mamba SSM tailored to radar HAR under on-sensor compute, latency, and energy constraints typical of distributed radar systems. RadMamba combines 1) channel fusion with downsampling; 2) Doppler-aligned segmentation that preserves the physical continuity of Doppler over time; and 3) convolutional token projections that better capture Doppler-span variations, thereby retaining temporal–Doppler structure while reducing the number of Floating-point Operations/Inference (\# FLOP/Inf.). Evaluated across three datasets with different radars and types of activities, RadMamba matches the prior best 99.8% accuracy of a recent SSM-based model on the continuous wave (CW) radar dataset, while requiring only 1/400 of its parameters. On a dataset of non-continuous activities with frequency-modulated continuous wave (FMCW) radar, RadMamba remains competitive with leading 92.0% results using about 1/10 of the parameters, and on a continuous FMCW radar dataset it surpasses methods with far more parameters by at least 3%, using only 6.7k parameters.
FACET
Fast and Accurate Event-Based Eye Tracking Using Ellipse Modeling for Extended Reality
Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR's demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end-to-end neural network that directly outputs pupil ellipse parameters from event data, optimized for real-time XR applications. The ellipse output can be directly used in subsequent ellipse-based pupil trackers. We enhance the EV-Eye dataset by expanding annotated data and converting original mask labels to ellipse-based annotations to train the model. Besides, a novel trigonometric loss is adopted to address angle discontinuities and a fast causal event volume event representation method is put forward. On the enhanced EV-Eye test set, FACET achieves an average pupil center error of 0.20 pixels and an inference time of 0.53 ms, reducing pixel error and inference time by 1.6 × and 1.8 × compared to the prior art, EV-Eye, with 4.4 × and 11.7 × less parameters and arithmetic operations. The code is available at https://github.com/DeanJY/FACET.
SlimSeiz
Efficient Channel-Adaptive Seizure Prediction Using a Mamba-Enhanced Network
Epileptic seizures cause abnormal brain activity, and their unpredictability can lead to accidents, underscoring the need for long-term seizure prediction. Although seizures can be predicted by analyzing electroencephalogram (EEG) signals, existing methods often require too many channels or larger models, limiting mobile usability. This paper introduces a SlimSeiz framework that utilizes adaptive channel selection with a lightweight neural network model. SlimSeiz operates in two states: the first stage selects the optimal channel set for seizure prediction using machine learning algorithms, and the second stage employs a lightweight neural network based on convolution and Mamba for prediction. On the Children's Hospital Boston-MIT (CHB-MIT) EEG dataset, SlimSeiz can reduce channels from 22 to 8 while claiming a satisfactory result of 94.8% accuracy, 95.5% sensitivity, and 94.0% specificity with only 21.2 K model parameters, matching or outperforming larger models' performance. We also validate SlimSeiz on a new EEG dataset, SRH-LEI, collected from Shanghai Renji Hospital, demonstrating its effectiveness across different patients. The code and SRH-LEI dataset are available at https://github.com/guoruilu/SlimSeiz.
CleanUMamba
A Compact Mamba Network for Speech Denoising using Channel Pruning
This paper presents CleanUMamba, a time-domain neural network architecture designed for real-time causal audio denoising directly applied to raw waveforms. CleanUMamba leverages a U-Net encoder-decoder structure, incorporating the Mamba state-space model in the bottleneck layer. By replacing conventional self-attention and LSTM mechanisms with Mamba, our architecture offers superior denoising performance while maintaining a constant memory footprint, enabling streaming operation. To enhance efficiency, we applied structured channel pruning, achieving an 8X reduction in model size without compromising audio quality. Our model demonstrates strong results in the Interspeech 2020 Deep Noise Suppression challenge. Specifically, CleanUMamba achieves a PESQ score of 2.42 and STOI of 95.1% with only 442K parameters and 468M MACs, matching or outperforming larger models in real-time performance. Code will be available at: https://github.com/lab-emi/CleanUMamba
TCN-DPD
Parameter-Efficient Temporal Convolutional Networks for Wideband Digital Predistortion
Digital predistortion (DPD) is essential for mitigating nonlinearity in RF power amplifiers, particularly for wideband applications. This paper presents TCN-DPD, a parameter-efficient architecture based on temporal convolutional networks, integrating noncausal dilated convolutions with optimized activation functions. Evaluated on the OpenDPD framework with the DPA_200 MHz dataset, TCN-DPD achieves simulated ACPRs of -51.58 /-49.26dBc (L/R), EVM of -47.52 dB, and NMSE of -44.61 dB with 500 parameters and maintain superior linearization than prior models down to 200 parameters, making it promising for efficient wideband PA linearization.
DeltaDPD
Exploiting Dynamic Temporal Sparsity in Recurrent Neural Networks for Energy-Efficient Wideband Digital Predistortion
Digital predistortion (DPD) is a popular technique to enhance signal quality in wideband radio frequency (RF) power amplifiers (PAs). With increasing bandwidth and data rates, DPD faces significant energy consumption challenges during deployment, contrasting with its efficiency goals. State-of-the-art DPD models rely on recurrent neural networks (RNNs), whose computational complexity hinders system efficiency. This letter introduces DeltaDPD, exploring the dynamic temporal sparsity of input signals and neuronal hidden states in RNNs for energy-efficient DPD, reducing arithmetic operations and memory accesses while preserving satisfactory linearization performance. Applying a TM3.1a 200 MHz-BW 256-QAM OFDM signal to a 3.5-GHz GaN Doherty RF PA, DeltaDPD achieves −50.03 dBc in adjacent channel power ratio (ACPR), −37.22dB in normalized mean square error (NMSE) and −38.52 dB in error vector magnitude (EVM) with 52% temporal sparsity, leading to a 1.8\times reduction in estimated inference power.
DPD-NeuralEngine
A 22-nm 6.6-TOPS/W/mm2 Recurrent Neural Network Accelerator for Wideband Power Amplifier Digital Pre-Distortion
The increasing adoption of Deep Neural Network (DNN)-based Digital Pre-distortion (DPD) in modern communication systems necessitates efficient hardware implementations. This paper presents DPD-NeuralEngine, an ultra-fast, tiny-area, and power-efficient DPD accelerator based on a Gated Recurrent Unit (GRU) neural network (NN). Leveraging a co-designed software and hardware approach, our 22 nm CMOS implementation operates at 2 GHz, capable of processing I/Q signals up to 250 MSps. Experimental results demonstrate a throughput of 256.5 GOPS and power efficiency of 1.32 TOPS/W with DPD linearization performance measured in Adjacent Channel Power Ratio (ACPR) of -45.3 dBc and Error Vector Magnitude (EVM) of -39.8 dB. To our knowledge, this work represents the first AI-based DPD application-specific integrated circuit (ASIC) accelerator, achieving a power-area efficiency (PAE) of 6.6
Artificial intelligence (AI) has made significant strides towards efficient online processing of sensory signals at the edge through the use of deep neural networks with ever-expanding size. However, this trend has brought with it escalating computational costs and energy consumption, which have become major obstacles to the deployment and further upscaling of these models. In this Perspective, we present a neuro-inspired vision to boost the energy efficiency of AI for perception by leveraging brain-like dynamic sparsity. We categorize various forms of dynamic sparsity rooted in data redundancy and discuss potential strategies to enhance and exploit it through algorithm-hardware co-design. Additionally, we explore the technological, architectural, and algorithmic challenges that need to be addressed to fully unlock the potential of dynamic-sparsity-aware neuro-inspired AI for energy-efficient perception.
This article introduces a 4 x 2 -way Doherty power amplifier (PA) tailored for millimeter-wave (mm-wave) 5G applications. It incorporates an advanced output combiner that consists of four differential 2-way Doherty networks, two quadrature hybrid couplers (QHCs), and a balun to enhance the output power Pout and improves power back-off (PBO) efficiency. Realized in 40 nm CMOS bulk technology with a core area of 1.54 mm2, the prototype delivers a saturated power/peak gain surpassing 25.2 dBm/25.5 dB, and it demonstrates a drain efficiency (DE) exceeding 17.5%/10% at 0 dB/6 dB PBO across a 26–32 GHz band. The proposed mm-wave PA achieves error vector magnitude (EVM)/adjacent channel leakage ratio (ACLR) values of −25 dB/−33 dBc for a 2 GHz 64-quadrature amplitude modulation (QAM) orthogonal frequency-division multiplexing (OFDM) signal with 9.6 dB PAPR, operating at an average output power (Pavg) of 11.3 dBm with an average drain efficiency (DEavg) of 4% without using digital predistortion (DPD). For a 50 MHz 1024-QAM OFDM signal with 10 dB PAPR, it achieves a Pavg/DEavg of 7.2 dBm/2% with EVM/ACLR of −35 dB/−42 dBc without DPD.
This study presents a novel image-based machine learning (ML) method for automating I–V parameter extraction in gallium nitride (GaN) devices. Using Ampleon’s GEAR model, a dataset of 100000 simulated I–V curves are converted into I–V images through specifically designed transfer functions to train a convolutional neural network. The proposed method outperforms the existing ML method based on a fully connected neural network, particularly for I–V curves in the subthreshold region. Validation with measured pulse I–V data shows its superior accuracy, achieving a normalized mean square error (NMSE) of −30 dB compared with −24 dB with the existing ML method. The proposed method demonstrates a strong potential to accelerate the extraction and enhance the accuracy of GaN device modeling.
HengNet
An Ultra-lightweight Model with Two-level Reuse Algorithm for Seizure Detection and Prediction
Traditional models based on electroencephalographic (EEG) signals for seizure monitoring encounter difficulties in simultaneously optimizing accuracy, response latency, and computational load. These challenges hinder their deployment in edge computing environments, where real-time local inference is critical. To address these issues, we introduce a novel network architecture, designated as HengNet. This architecture integrates a Two-level Reuse Algorithm (TRA), which strategically reutilizes outputs from intermediate layers, considerably reducing the average computational load per inference - vital for scenarios requiring frequent inferences. When tested on the CHB-MIT dataset, this patient-specific model attains classification accuracies of 95.67% and 99.60% for seizure prediction and detection, respectively. Notably, it maintains an average computational load of merely 0.05 million multiply-accumulate operations (MACs) per inference and has a compact model size of 6.87 K parameters. These results represent a significant advancement compared with existing methods. Operating at a rate of 32 inferences per second, the computational load of the model for seizure prediction has been reduced by more than 19.4 times, and for seizure detection, by more than 6.4 times.
We present a sub-10-µW fully integrated SoC for on-device spoken language understanding (SLU). Its analog feature extractor (FEx) applies global and per-channel automatic gain control (AGC) to extend the system’s dynamic range (DR)—a critical requirement for real-world scenarios, including far-field operations. The on-chip streaming-mode recurrent neural network (RNN) accelerator exploits temporal sparsity and pooling, reducing its power by 2.3x. By combining hardware-aware training with a behavioral model of the FEx that captures circuit nonidealities, the network is trained to maintain SLU accuracy despite chip-to-chip variation. Fabricated in a 65-nm CMOS process, the SoC occupies 2.23 mm 2 and consumes 8.62 µW for end-to-end SLU. The 16-channel FEx achieves 93-dB DR while dissipating 1.85 µW at 100-Hz feature frame rate. The SoC is evaluated on the 32-class Fluent Speech Commands dataset (FSCD), achieving 92.9% accuracy for 2.8-mV rms inputs while maintaining >85% accuracy over a 75-dB input range.
Recurrent Neural Networks (RNNs) are useful in temporal sequence tasks. However, training RNNs involves dense matrix multiplications which require hardware that can support a large number of arithmetic operations and memory accesses. Implementing online training of RNNs on the edge calls for optimized algorithms for an efficient deployment on hardware. Inspired by the spiking neuron model, the Delta RNN exploits temporal sparsity during inference by skipping over the update of hidden states from those inactivated neurons whose change of activation across two timesteps is below a defined threshold. This work describes a training algorithm for Delta RNNs that exploits temporal sparsity in the backward propagation phase to reduce computational requirements for training on the edge. Due to the symmetric computation graphs of forward and backward propagation during training, the gradient computation of inactivated neurons can be skipped. Results show a reduction of ∼80% in matrix operations for training a 56k parameter Delta LSTM on the Fluent Speech Commands dataset with negligible accuracy loss. Logic simulations of a hardware accelerator designed for the training algorithm show 2-10X speedup in matrix computations for an activation sparsity range of 50%-90%. Additionally, we show that the proposed Delta RNN training will be useful for online incremental learning on edge devices with limited computing resources.
OpenDPD
An Open-Source End-to-End Learning & Benchmarking Framework for Wideband Power Amplifier Modeling and Digital Pre-Distortion
With the rise in communication capacity, deep neural networks (DNN) for digital pre-distortion (DPD) to correct non-linearity in wideband power amplifiers (PAs) have become prominent. Yet, there is a void in open-source and measurement-setup-independent platforms for fast DPD exploration and objective DPD model comparison. This paper presents an open-source framework, OpenDPD, crafted in PyTorch, with an associated dataset for PA modeling and DPD learning. We introduce a Dense Gated Recurrent Unit (DGRU)-DPD, trained via a novel end-to-end learning architecture, outperforming previous DPD models on a digital PA (DPA) in the new digital transmitter (DTX) architecture with unconventional transfer characteristics compared to analog PAs. Measurements show our DGRU-DPD achieves an ACPR of -44.69/-44.47dBc and an EVM of -35.22dB for 200MHz OFDM signals. OpenDPD code, datasets and documentation are publicly available at https://github.com/lab-emi/OpenDPD
Epilepsy is a common disease of the nervous system. Timely prediction of seizures and intervention treatment can significantly reduce the accidental injury of patients and protect the life and health of patients. This paper presents a tiny neuromorphic Spiking Convolutional Transformer, named Spiking Conformer, to detect and predict epileptic seizure segments from scalped long-term electroencephalogram (EEG) recordings. We report evaluation results from the Spiking Conformer model using the Boston Children's Hospital-MIT (CHB-MIT) EEG dataset. By leveraging spike-based addition operations, the Spiking Conformer significantly reduces the classification computational cost compared to the non-spiking model. Additionally, we introduce an approximate spiking neuron layer to further reduce spike-triggered neuron updates by nearly 38% without sacrificing accuracy. Using raw EEG data as input, the proposed Spiking Conformer achieved an average sensitivity rate of 94.9% and a specificity rate of 99.3% for the seizure detection task, and 96.8%, 89.5% for the seizure prediction task, and needs >10x fewer operations compared to the non-spiking equivalent model.
Digital predistortion (DPD) enhances signal quality in wideband radio frequency (RF) power amplifiers (PAs). As signal bandwidths expand in modern radio systems, DPD's energy consumption increasingly impacts overall system efficiency. Deep neural networks (DNNs) offer promising advancements in DPD, yet their high complexity hinders their practical deployment. This article introduces open-source mixed-precision (MP) neural networks that employ quantized low-precision fixed-point parameters for energy-efficient DPD. This approach reduces computational complexity and memory footprint, thereby lowering power consumption without compromising linearization efficacy. Applied to a 160-MHz-BW 1024-QAM OFDM signal from a digital RF PA, MP-DPD gives no performance loss against 32-bit floating-point precision DPDs, while achieving -43.75 (L)/-45.27 (R) dBc in the adjacent channel power ratio (ACPR) and -38.72 dB in error vector magnitude (EVM). A 16-bit fixed-point-precision MP-DPD enables a 2.8× reduction in estimated inference power. The DPD code in PyTorch is publicly available on GitHub.