ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes

Teaser image of the paper.

Dexterous grasping in cluttered scenes presents significant challenges due to diverse object geometries, occlusions, and potential collisions. Existing methods primarily focus on single-object grasping or grasp-pose prediction without interaction, which are insufficient for complex, cluttered scenes. Recent vision-language-action models offer a potential solution but require extensive real-world demonstrations, making them costly and difficult to scale. To address these limitations, we revisit the sim-to-real transfer pipeline and develop key techniques that enable zero-shot deployment in reality while maintaining robust generalization.

We propose ClutterDexGrasp a two-stage teacher-student framework for closed-loop target-oriented dexterous grasping in cluttered scenes. The framework features a teacher policy trained in simulation using clutter density curriculum learning, incorporating both a novel geometry- and spatially-embedded scene representation and a comprehensive safety curriculum, enabling general, dynamic, and safe grasping behaviors. Through imitation learning, we distill the teacher's knowledge into a student 3D diffusion policy (DP3) that operates on partial point cloud observations.

To the best of our knowledge, this represents the first zero-shot sim-to-real closed-loop system for target-oriented dexterous grasping in cluttered scenes, demonstrating robust performance across diverse objects and layouts.


Highlights

Videos below are recorded with a single end-to-end policy trained in Simulation.


Video


Generalization Across Clutter Densities, Novel Scenes with Novel Objects (Uncut Videos)

41 objects with diverse shapes, sizes, and materials were tested under three clutter densities:

  • 9 Sparse Scenes
  • 5 Dense Scenes
  • 3 Ultra-dense Scenes

The entire object set is covered across all densities. The policy grasps until 3 consecutive failures, with target objects randomly selected from visible masks, resulting increased difficulty. This encourages interaction with occlusions, resembling real-world scenarios where humans retrieve buried items.

Below are three UNCUT videos demonstrating the policy's performance in three novel cluttered scenes.

Sparse
8x
Dense
8x
Ultra-Dense
8x

Our system demonstrates robust generalization across cluttered environments of varying densities. From sparse to ultra-dense scenes, ClutterDexGrasp efficiently navigates occlusions, adapts grasp strategies, and performs safe, reliable grasps in real-world conditions.

More Ultra-dense Grasping

Prior works would fail in these scenes, where grasp pose estimation and direct grasping is infeasible due to heavy occlusions.

Highlights

(1) Robust against occlusion caused by waving hand in front of the camera.

(2) Grasp heavily-occluded objects beneath the cluttered scene.


Human-Like Behavior Visualization

We visualize human-like behavior by comparing the same policy in two scenarios with the target at the same location:

  • Clutter-Free: Desires directly grasping the target without considering surroundings.
  • Cluttered: Requires navigating around clutter and clears obstacles to reach the target.

As shown in the video, our policy demonstrates:

  • Direct Grasping: In clutter-free scenes.
  • Gentle Clutter Clearance: Stratigicall nudges overlying objects instead of forcefully pushing them.
  • Clutter-Aware Grasp: Grasp the target from the side to avoid collisions.

Note that these behaviors emerge automatically and composed seamlessly, without heuristic mode identification or switching, enabling effective, adaptive, and collision-minimized grasping.

Clutter-free Scene
1x
Cluttered Scene
1x

Our system demonstrates diverse strategies based on the cluttered scene.


More Capabilities

Videos below are recorded with a single end-to-end policy trained in Simulation.

Continuous Grasping-and-Transport of 30 Unique Objects (Uncut Video)

Our policy remains robust against:

(0:01, 0:52) Tiny Object (Grasping such small object in real-world was not seen in any of prior works)

(0:08, 0:17) Occlusion caused by waiving hand in front of the camera.

(0:11) Perturbation during grasping.

(0:30) Irregular Object

(1:20) Moving Object

8x

Our system demonstrates diverse strategies based on the cluttered scene.

3D-Space (Uncut Video)

8x
8x

Our system demonstrates diverse strategies based on the cluttered scene.

Dynamic Grasping

2x
2x

Our system demonstrates diverse strategies based on the cluttered scene.