Lipeng Gu | TU Delft Repository

CrossTracker

Robust Multi-Modal 3D Multi-Object Tracking via Cross Correction

Journal article (2026) - Lipeng Gu, Xuefeng Yan, Weiming Wang, Honghua Chen, Dingkun Zhu, Liangliang Nan, Mingqiang Wei

Inaccurate detections remain a critical bottleneck in 3D multi-object tracking (MOT). Recent detection fusion-based methods incorporate camera detections as supplementary to reduce false detections and compensate for missing ones in LiDAR. However, their unidirectional camera-LiDAR correction lacks a feedback mechanism, precluding iterative mutual refinement between modalities for more robust LiDAR-based tracking. Inspired by the coarse-to-fine strategy in two-stage object detection, we introduce CrossTracker, a novel two-stage framework for online multi-modal 3D MOT. CrossTracker first constructs coarse camera and LiDAR trajectories independently, then performs trajectory fusion using both current and historical frames, without requiring future data. This ensures more robust mutual refinement between modalities. Specifically, CrossTracker comprises three core modules: i) the multi-modal modeling (M3) module, which fuses data from images, point clouds, and even planar geometry derived from images to establish a robust tracking constraint; ii) the coarse trajectory generation (C-TG) module, which independently generates coarse trajectories for both modalities using the M3 constraint; and iii) the trajectory fusion (TF) module, which applies mutual refinement between coarse LiDAR and camera trajectories through cross correction to ensure robust LiDAR trajectories. Extensive experiments show that CrossTracker outperforms 19 state-of-the-art methods, highlighting its effectiveness in leveraging the synergistic strengths of camera and LiDAR sensors for robust multi-modal 3D MOT. The code is available at https://github.com/lipeng-gu/CrossTracker. ...

SimLOG

Simultaneous Local-Global Feature Learning for 3D Object Detection in Indoor Point Clouds

Journal article (2024) - Mingqiang Wei, Baian Chen, Liangliang Nan, Haoran Xie, Lipeng Gu, Dening Lu, Fu Lee Wang, Qing Li

The acquisition of both local and global features from irregular point clouds is crucial for 3D object detection (3DOD). Current mainstream 3D detectors neglect significant local features during pooling operations or disregard many global features of the overall scene context. This paper proposes new techniques for simultaneously learning local-global features of scene point clouds to enhance 3DOD. Specifically, we propose an efficient 3DOD network in indoor point clouds, named SimLOG, which utilizes simultaneous local-global feature learning. SimLOG has two main contributions: a Dynamic Points Interaction (DPI) module to recover local features lost during pooling, and a Global Context Aggregation(GCA) module to aggregate multi-scale features from various layers of the encoder to improve scene context awareness. Unlike traditional local-global feature learning methods, our DPI and GCA modules are integrated into a single feature learning module, making it easily detachable and able to be incorporated into existing 3DOD networks to enhance their performance. SimLOG demonstrates superior performance over twenty competitors in terms of detection accuracy and robustness on both the SUN RGB-D and ScanNet V2 datasets. Specifically, SimLOG boosts the baseline VoteNet by 8.1% of mAP@0.25 on ScanNet V2 and by 3.9% of mAP@0.25 on SUN RGB-D. ...

PointeNet

A lightweight framework for effective and efficient point cloud analysis

Journal article (2024) - Lipeng Gu, Xuefeng Yan, Liangliang Nan, Dingkun Zhu, Honghua Chen, Weiming Wang, Mingqiang Wei

The conventional wisdom in point cloud analysis predominantly explores 3D geometries. It is often achieved through the introduction of intricate learnable geometric extractors in the encoder or by deepening networks with repeated blocks. However, these methods contain a significant number of learnable parameters, resulting in substantial computational costs and imposing memory burdens on CPU/GPU. Moreover, they are primarily tailored for object-level point cloud classification and segmentation tasks, with limited extensions to crucial scene-level applications, such as autonomous driving. To this end, we introduce PointeNet, an efficient network designed specifically for point cloud analysis. PointeNet distinguishes itself with its lightweight architecture, low training cost, and plug-and-play capability, while also effectively capturing representative features. The network consists of a Multivariate Geometric Encoding (MGE) module and an optional Distance-aware Semantic Enhancement (DSE) module. MGE employs operations of sampling, grouping, pooling, and multivariate geometric aggregation to lightweightly capture and adaptively aggregate multivariate geometric features, providing a comprehensive depiction of 3D geometries. DSE, designed for real-world autonomous driving scenarios, enhances the semantic perception of point clouds, particularly for distant points. Our method demonstrates flexibility by seamlessly integrating with a classification/segmentation head or embedding into off-the-shelf 3D object detection networks, achieving notable performance improvements at a minimal cost. Extensive experiments on object-level datasets, including ModelNet40, ScanObjectNN, ShapeNetPart, and the scene-level dataset KITTI, demonstrate the superior performance of PointeNet over state-of-the-art methods in point cloud analysis. Notably, PointeNet outperforms PointMLP with significantly fewer parameters on ModelNet40, ScanObjectNN, and ShapeNetPart, and achieves a substantial improvement of over 2% in 3DAP_R40 for PointRCNN on KITTI with a minimal parameter cost of 1.4 million. Code is publicly available at https://github.com/lipeng-gu/PointeNet ...

The conventional wisdom in point cloud analysis predominantly explores 3D geometries. It is often achieved through the introduction of intricate learnable geometric extractors in the encoder or by deepening networks with repeated blocks. However, these methods contain a significant number of learnable parameters, resulting in substantial computational costs and imposing memory burdens on CPU/GPU. Moreover, they are primarily tailored for object-level point cloud classification and segmentation tasks, with limited extensions to crucial scene-level applications, such as autonomous driving. To this end, we introduce PointeNet, an efficient network designed specifically for point cloud analysis. PointeNet distinguishes itself with its lightweight architecture, low training cost, and plug-and-play capability, while also effectively capturing representative features. The network consists of a Multivariate Geometric Encoding (MGE) module and an optional Distance-aware Semantic Enhancement (DSE) module. MGE employs operations of sampling, grouping, pooling, and multivariate geometric aggregation to lightweightly capture and adaptively aggregate multivariate geometric features, providing a comprehensive depiction of 3D geometries. DSE, designed for real-world autonomous driving scenarios, enhances the semantic perception of point clouds, particularly for distant points. Our method demonstrates flexibility by seamlessly integrating with a classification/segmentation head or embedding into off-the-shelf 3D object detection networks, achieving notable performance improvements at a minimal cost. Extensive experiments on object-level datasets, including ModelNet40, ScanObjectNN, ShapeNetPart, and the scene-level dataset KITTI, demonstrate the superior performance of PointeNet over state-of-the-art methods in point cloud analysis. Notably, PointeNet outperforms PointMLP with significantly fewer parameters on ModelNet40, ScanObjectNN, and ShapeNetPart, and achieves a substantial improvement of over 2% in 3DAP_R40 for PointRCNN on KITTI with a minimal parameter cost of 1.4 million. Code is publicly available at https://github.com/lipeng-gu/PointeNet