Classification of Flying Objects using Multi-Camera 4D Gaussian Splatting
A.A.F. Verdiesen (TU Delft - Electrical Engineering, Mathematics and Computer Science)
H.P. Hofstee – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Wim Bos – Mentor (Lumiad BV)
M. Weinmann – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Z. Al-Ars – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Monitoring the lower airspace for small drones and distinguishing them from birds, helicopters and airplanes, is a growing security need that radar, radio-frequency, and acoustic sensors meet only at considerable cost. This thesis asks whether a ground-based network of synchronized, overlapping RGB cameras can instead reconstruct and classify flying objects directly in 3D, recovering range through multi-view geometry rather than a long-range sensor. The central hypothesis is that the temporal evolution of a 3D Gaussian Splatting representation carries motion cues more discriminative than per-frame 2D or static 3D appearance.
Four contributions support this investigation, which, to our knowledge, is the first to classify flying objects
using temporal 4D Gaussian features. AeroSplat-4D is a synthetic multi-camera dataset and NVIDIA Isaac Sim pipeline emitting synchronized RGB, instance masks, depth, 3D trajectories, and exact calibration across the four classes, with class-balanced, identity-disjoint splits. DepthSplat-OC adapts feed-forward Gaussian splatting to thin, distant targets against a texture-less sky via a mask-gated photometric loss. MambaSplat-4D,
the main contribution, classifies the temporal Gaussian sequences by pairing a rotation-equivariant Vector-Neuron Transformer with a linear-time Mamba temporal encoder, enforcing SO(3) invariance architecturally rather than through augmentation.
In an augmentation-free ablation, aggregating a 24-frame clip rather than classifying a single frame raises accuracy from 59.1 % to 78.8 %, confirming that motion, not single-frame appearance, drives discrimination. Because SO(3) invariance is enforced architecturally, the full-attribute model attains the same 70.2 % four-class accuracy on clean and arbitrarily rotated data, about eight percentage points above a position-only baseline; it trails the strongest temporal baseline by roughly five points on clean data but is uniquely robust under rotation, with zero classification changes across 9600 rotated forward passes. DepthSplat-OC surpasses the closest-protocol baseline (24.65 versus 21.44 PSNR) despite roughly two orders of magnitude less training compute, and the compact 1.9 M-parameter classifier runs in under a millisecond per frame. On the out-of-distribution probe the pipeline does not yet surpass the 2D baselines, a gap that likely reflects their ImageNet-pretrained (∼1.2M-image) backbones rather than a limit of the 3D representation; real-camera transfer remains open, and and the core of the pipeline is released as open-source software at github.com/lumiad-bv/MambaSplat-4D.
This work thereby points toward multi-view 3D reconstruction and temporal reasoning as an effective alternative to the per-frame 2D detection that currently dominates aerial object classification.