ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes

Dexterous grasping in cluttered scenes presents significant challenges due to diverse object geometries, occlusions, and potential collisions. Existing methods primarily focus on single-object grasping or grasp-pose prediction without interaction, which are insufficient for complex, cluttered scenes. Recent vision-language-action models offer a potential solution but require extensive real-world demonstrations, making them costly and difficult to scale. To address these limitations, we revisit the sim-to-real transfer pipeline and develop key techniques that enable zero-shot deployment in reality while maintaining robust generalization.

We propose ClutterDexGrasp a two-stage teacher-student framework for closed-loop target-oriented dexterous grasping in cluttered scenes. The framework features a teacher policy trained in simulation using clutter density curriculum learning, incorporating both a novel geometry- and spatially-embedded scene representation and a comprehensive safety curriculum, enabling general, dynamic, and safe grasping behaviors. Through imitation learning, we distill the teacher's knowledge into a student 3D diffusion policy (DP3) that operates on partial point cloud observations.

To the best of our knowledge, this represents the first zero-shot sim-to-real closed-loop system for target-oriented dexterous grasping in cluttered scenes, demonstrating robust performance across diverse objects and layouts.

Highlights

Videos below are recorded with a single end-to-end policy trained in Simulation.

Video

Generalization Across Clutter Densities, Novel Scenes with Novel Objects (Uncut Videos)

41 objects with diverse shapes, sizes, and materials were tested under three clutter densities:

9 Sparse Scenes
5 Dense Scenes
3 Ultra-dense Scenes

The entire object set is covered across all densities. The policy grasps until 3 consecutive failures, with target objects randomly selected from visible masks, resulting increased difficulty. This encourages interaction with occlusions, resembling real-world scenarios where humans retrieve buried items.

Below are three UNCUT videos demonstrating the policy's performance in three novel cluttered scenes.

Sparse

8x

Dense

8x

Ultra-Dense

8x

Our system demonstrates robust generalization across cluttered environments of varying densities. From sparse to ultra-dense scenes, ClutterDexGrasp efficiently navigates occlusions, adapts grasp strategies, and performs safe, reliable grasps in real-world conditions.

Click to see all ClutterDexGrasp experiment videos from policy observation view.

Human-Like Behavior Visualization

We visualize human-like behavior by comparing the same policy in two scenarios with the target at the same location:

Clutter-Free: Desires directly grasping the target without considering surroundings.
Cluttered: Requires navigating around clutter and clears obstacles to reach the target.

As shown in the video, our policy demonstrates:

Direct Grasping: In clutter-free scenes.
Gentle Clutter Clearance: Stratigicall nudges overlying objects instead of forcefully pushing them.
Clutter-Aware Grasp: Grasp the target from the side to avoid collisions.

Note that these behaviors emerge automatically and composed seamlessly, without heuristic mode identification or switching, enabling effective, adaptive, and collision-minimized grasping.