Dexterous grasping in cluttered scenes presents significant challenges due to diverse object geometries, occlusions, and potential collisions. Existing methods primarily focus on single-object grasping or grasp-pose prediction without interaction, which are insufficient for complex, cluttered scenes. Recent vision-language-action models offer a potential solution but require extensive real-world demonstrations, making them costly and difficult to scale. To address these limitations, we revisit the sim-to-real transfer pipeline and develop key techniques that enable zero-shot deployment in reality while maintaining robust generalization.
We propose ClutterDexGrasp a two-stage teacher-student framework for closed-loop target-oriented dexterous grasping in cluttered scenes. The framework features a teacher policy trained in simulation using clutter density curriculum learning, incorporating both a novel geometry- and spatially-embedded scene representation and a comprehensive safety curriculum, enabling general, dynamic, and safe grasping behaviors. Through imitation learning, we distill the teacher's knowledge into a student 3D diffusion policy (DP3) that operates on partial point cloud observations.
To the best of our knowledge, this represents the first zero-shot sim-to-real closed-loop system for target-oriented dexterous grasping in cluttered scenes, demonstrating robust performance across diverse objects and layouts.
Videos below are recorded with a single end-to-end policy trained in Simulation.
41 objects with diverse shapes, sizes, and materials were tested under three clutter densities:
The entire object set is covered across all densities. The policy grasps until 3 consecutive failures, with target objects randomly selected from visible masks, resulting increased difficulty. This encourages interaction with occlusions, resembling real-world scenarios where humans retrieve buried items.
Below are three UNCUT videos demonstrating the policy's performance in three novel cluttered scenes.
Our system demonstrates robust generalization across cluttered environments of varying densities. From sparse to ultra-dense scenes, ClutterDexGrasp efficiently navigates occlusions, adapts grasp strategies, and performs safe, reliable grasps in real-world conditions.
Prior works would fail in these scenes, where grasp pose estimation and direct grasping is infeasible due to heavy occlusions.
(1) Robust against occlusion caused by waving hand in front of the camera.
(2) Grasp heavily-occluded objects beneath the cluttered scene.
We visualize human-like behavior by comparing the same policy in two scenarios with the target at the same location:
As shown in the video, our policy demonstrates:
Note that these behaviors emerge automatically and composed seamlessly, without heuristic mode identification or switching, enabling effective, adaptive, and collision-minimized grasping.
Our system demonstrates diverse strategies based on the cluttered scene.
Videos below are recorded with a single end-to-end policy trained in Simulation.
Our policy remains robust against:
(0:01, 0:52) Tiny Object (Grasping such small object in real-world was not seen in any of prior works)
(0:08, 0:17) Occlusion caused by waiving hand in front of the camera.
(0:11) Perturbation during grasping.
(0:30) Irregular Object
(1:20) Moving Object
Our system demonstrates diverse strategies based on the cluttered scene.
Our system demonstrates diverse strategies based on the cluttered scene.
Our system demonstrates diverse strategies based on the cluttered scene.