GraspGen Explained: A Friendly Guide to 6-DOF Robot Grasping for Everyone

A Diffusion-based Framework for 6-DOF Grasping

“

How a new open-source framework lets robots pick up almost anything—without weeks of re-engineering.

1. Why Better Grasping Still Matters

Pick-and-place sounds simple, yet warehouse robots still drop mugs, kitchen assistants miss forks, and lunar rovers struggle with oddly shaped rocks. Three stubborn problems keep coming back:

Different grippers → one change of hardware and yesterday’s code is useless.
Cluttered scenes → toys on a rug, tools in a drawer; the camera never sees the whole object.
Unknown objects → you can’t label every future item the robot will meet.

GraspGen, released by NVIDIA in July 2025, was built to tackle all three at once. The project ships with:

ready-to-use models for three common grippers (Franka Panda two-finger, Robotiq-2F-140, and a 30 mm suction cup);
a 53-million-grasp dataset covering 8 515 objects;
a new training trick called On-Generator Training that teaches the robot to ignore its own mistakes.

2. What “6-DOF Grasping” Really Means

Imagine giving a friend directions to pick up a coffee cup:

Direction you give	In robot math	Symbol
Move forward 10 cm	Translation along X	+x
Slide right 5 cm	Translation along Y	+y
Lift 8 cm	Translation along Z	+z
Twist palm down	Rotation around X	roll
Turn hand left	Rotation around Y	pitch
Rotate wrist	Rotation around Z	yaw

Those six numbers together are called a 6-DOF pose (DOF = degrees of freedom). GraspGen’s job is to predict many such poses for any object point cloud the robot sees.

3. How GraspGen Works—In Plain English

3.1 Start With a Diffusion Model

You may know diffusion models from image generation apps. Instead of turning noise into a picture, GraspGen turns noise into grasp poses:

Training
- We take successful grasps → add noise → teach a neural net to remove that noise.
Inference
- Feed the network a new object point cloud → start with pure noise → let the network clean it into valid poses.

Because a grasp pose has only six numbers (x, y, z, roll, pitch, yaw) the process is fast: 10 denoising steps are enough, compared with 50-100 steps for images.

3.2 Handle Different Grippers Without Rewriting Code

The framework keeps the object encoder (a PointTransformerV3 backbone) fixed and swaps only a small gripper-specific head. That means you can:

re-use the same weights for the Franka gripper and the suction cup;
add a new gripper by training only the lightweight head.

4. The On-Generator Training Trick

Once the diffusion model is trained, it still occasionally invents impossible grasps (floating in mid-air or colliding with the object). Classic work trains a separate discriminator to score these poses, but it uses only offline success/failure labels.

GraspGen does something smarter:

Run the diffusion model on 7 000 training objects → create about 14 million fresh candidate grasps.
Re-simulate every pose in Isaac Sim → obtain new success/failure labels.
Retrain the discriminator on this model-generated data.

Because the discriminator now sees the exact mistakes the diffusion model makes, it filters them far better at test time. Numbers:

Training Data	Discriminator AUC	Memory Use
Offline labels only	0.886	100 %
On-Generator labels	0.947	4.7 % (21× smaller)

5. The Released Assets—What You Get Today

Asset	Size	Purpose
GraspGen dataset (Franka)	17 M grasps	Train or fine-tune
GraspGen dataset (Robotiq-2F-140)	17 M grasps	Train or fine-tune
GraspGen dataset (suction)	17 M grasps	Train or fine-tune
Pre-trained checkpoints	3× models	Zero-shot inference
Docker image	3 GB	Reproduce all results
Python demo scripts	10 files	Real-camera examples

Total download ≈ 200 GB. One command fetches everything:

git clone https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GraspGen

6. Quick Start—Run Your First Inference in 5 Minutes

6.1 Install (Docker, Recommended)

git clone https://github.com/NVlabs/GraspGen.git && cd GraspGen
bash docker/build.sh   # builds an image with all deps

6.2 Download the trained weights

git clone git@hf.co:adithyamurali/GraspGenModels  # ~1 GB

6.3 Visualize on a sample scene

# Terminal 1: start a 3-D viewer
meshcat-server

# Terminal 2: run the demo inside docker
bash docker/run.sh <local_graspgen_path> --models <path_to_models>
cd /code && python scripts/demo_scene_pc.py \
  --sample_data_dir /models/sample_data/real_scene_pc \
  --gripper_config /models/checkpoints/graspgen_robotiq_2f_140.yml

Open http://localhost:7000 in your browser—you will see green arrows (grasps) on top of a real tabletop scene.

7. Training Your Own Model—A Step-by-Step Recipe

7.1 When Do You Need to Train?

Your gripper geometry is not Franka, Robotiq, or the 30 mm suction cup.
You want to specialize on a narrow domain (e.g., only metal tools).

7.2 What You Need

File	Description	Example
`gripper.urdf`	kinematic + collision model	provided in `assets/`
`gripper.yml`	GraspGen config	same folder
`object_dataset/`	watertight `.obj` meshes	download via helper script
`grasp_dataset/`	JSON lines with 6-DOF pose + success label	generate via Isaac Sim

7.3 Cache the Dataset (One-Time)

bash docker/run.sh <code> --grasp_dataset <ds> --object_dataset <obj> --results <logs>
cd /code && python train_graspgen.py \
  task=robotiq_2f_140_gen \
  hydra.run.dir=/results/exp_01

The script first builds a compressed HDF5 cache (fast I/O) and then starts training automatically.

7.4 Typical Training Times

Hardware	Epochs	Wall-Clock
8×A100 80 GB	3 000	40 h (generator) + 90 h (discriminator)
1×RTX 4090 24 GB	3 000	8 days (batch size 16)

8. Benchmarks—Numbers You Can Trust

8.1 Object-Centric Test (Franka, ACRONYM split)

Model	AUC ↑	Coverage ↑
SE3-Diff baseline	0.200	25 %
DexDiffuser	0.344	48 %
M2T2	0.636	67 %
GraspGen (ours)	0.947	85 %

8.2 Cluttered-Scene Test (FetchBench)

Model	Task Success	Grasp Success
M2T2	52.6 %	60 %
AnyGrasp	63.7 %	70 %
GraspGen	81.3 %	90.5 %

9. Real-World Deployment Tips

9.1 Camera Setup

One RealSense D435 mounted 0.6 m above the table is enough.
Calibrate intrinsics and extrinsics once; store in camera_params.yaml.

9.2 Software Pipeline

RGB-D stream
     ↓ SAM2 instance segmentation
     ↓ GraspGen inference (top-100 grasps)
     ↓ cuRobo motion planning
     ↓ Robot execution

9.3 Common Failure Modes

Symptom	Quick Fix
Grasps hover 2 cm above object	Add z-offset = −0.02 m in post-processing
Small objects ignored	Increase point-cloud density (move camera closer by 10 cm)
Shelf scenes fail	Lower collision-check safety margin in cuRobo

10. Frequently Asked Questions

Q1. My gripper looks like Franka but has 5 mm more stroke. Do I retrain?
A: Probably not. Apply a fixed z-offset of −5 mm after inference. Measure 20 test grasps; if success ≥ 85 %, you’re good.

Q2. Can I run this on an edge GPU?
A: Yes. The released TensorRT engine runs at 20 Hz on a Jetson AGX Orin 64 GB (batch size 1, 10 denoising steps).

Q3. How do I add a new object category?
A:

Place meshes in object_dataset/new_category/.
Run python scripts/generate_grasps.py --category new_category.
Append the new labels to grasp_dataset/train.jsonl.
Resume training from a checkpoint with train.checkpoint=/path/to/latest.ckpt.

Q4. The training script dies with “Killed” and no traceback.
A: Increase Docker memory limit (--memory=32g) or set NUM_REDUNDANT_DATAPOINTS=3.

Q5. Why do suction grasps have large rotation error?
A: Suction is rotationally symmetric; the metric is ill-defined. Focus on translation error.

11. Citation & License

If you use GraspGen in your research or product, please cite:

@article{murali2025graspgen,
  title={GraspGen: A Diffusion-based Framework for 6-DOF Grasping with On-Generator Training},
  author={Murali, Adithyavairavan and Sundaralingam, Balakumar and others},
  journal={arXiv preprint arXiv:2507.13097},
  year={2025}
}

Dataset license: CC-BY 4.0.
Code license: NVIDIA Source Code License (see repo).

12. Where to Go Next

Project page: https://graspgen.github.io
Video walkthrough: https://youtu.be/gM5fgK2aZ1Y
Issue tracker: https://github.com/NVlabs/GraspGen/issues

Happy grasping!

6-DOF Grasping Revolution: How NVIDIA’s GraspGen Framework Transforms Robot Pick-and-Place