Recent progress in visual imitation learning has shown that diffusion models are a powerful tool for training robots to perform complex manipulation tasks. While 3D Diffusion Policy uses a point cloud representation to improve spatial reasoning and sample efficiency, it still str
...
Recent progress in visual imitation learning has shown that diffusion models are a powerful tool for training robots to perform complex manipulation tasks. While 3D Diffusion Policy uses a point cloud representation to improve spatial reasoning and sample efficiency, it still struggles to generalize across novel objects and environments due to spurious correlations learned from irrelevant visual features. In this work, a novel approach, Affordance-guided 3D Diffusion Policy (ADP3) is introduced, which integrates task-relevant affordance cues into the policy’s point cloud input. By conditioning the policy on 3D affordance heatmaps instead of raw point clouds, the policy is biased to attend to task-relevant object regions. Using affordance heatmaps reduced the success rate drop to just 3% on unseen objects in 4 Meta-World tasks, compared to a 35% drop when using raw point clouds. ADP3 also demonstrates impressive performance in our real-world experiments, showing resilience to cluttered scenes and novel object orientations.