VITRA Unpacked: How 1 Million Casual Hand-Held Videos Can Teach a Robot to Grab With 6 cm Accuracy Keywords naturally used: vision-language-action model, VITRA, robotic manipulation, human-hand pre-training, zero-shot action prediction, casual video dataset, diffusion transformer, Paligemma-2, single-camera 3D, egocentric video, dexterous robot hand, real-world robot, data scaling, open source. What this post answers in one sentence By treating everyday, unscripted hand-held videos as robot demonstrations, VITRA produces a 3-billion-parameter model that predicts 3-D hand actions in brand-new scenes with only a single photo and a sentence—and after light fine-tuning on a handful of real-robot trajectories, it doubles task success …