TraceGen: World Modeling in 3D Trace-Space Enables Learning from Cross-Embodiment Videos

1 University of Maryland, College Park 2 New York University
* Equal contribution.

Abstract

Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments—humans and different robots—are abundant, differences in embodiment, camera, and environment hinder their direct use. We address this small-data problem by introducing a unifying, symbolic representation: a compact 3D “trace-space” of scene-level trajectories that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation–trace–language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50–600× faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen’s ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.

TraceGen Overview

Human Demonstrations

"Open the drawer"

"Drop the LEGO block in the drawer"

"Close the drawer"

Human → Robot Transfer

Unified 3D trace

Robot execution with the unified 3D trace

TraceGen is a world model that operates in 3D trace-space, enabling learning from cross-embodiment videos.

Robot Manipulation Tasks

Task 1: Folding a garment (Clothes)
Task 2: Inserting a tennis ball into a box (Ball)
Task 3: Sweeping trash into a dustpan with a brush (Brush)
Task 4: Placing a LEGO block in the purple region (Block)

Training Dataset

Our dataset (TraceForge-123K) consists of a diverse set of skills and embodiments.


Results

1. Comparison with Baselines

main_result

Success rate vs. inference efficiency (predictions per minute; higher and rightward is better). TraceGen achieves the best combination of success and efficiency, outperforming both video and trace-based baselines by a large margin.


1.1. Comparison with Video Generation Baselines (NovaFlow)
Comparison result


1.2. Failure Cases of Baselines

Generated video (Wan 2.2): Hallucinated embodiment.


Generated video (Veo 3.1): Not physically grounded.


failure cases

Failure cases of existing embodied world models.
(a) NovaFlow (Wan2.2-I2V): Video-based models can hallucinate geometry or affordance.
(b) Gemini Robotics-ER 1.5: VLM token outputs fail to capture fine motion.
(c) 3DFlowAction: Bounding boxes miss the tool.
(d) Im2Flow2Act: Bounding boxes become overly broad.



2. Human → Robot Skill Transfer

human_to_robot

Human-to-robot skill transfer using human demo videos. TraceGen, finetuned on 5 in-the-wild handheld phone videos, successfully executes four manipulation tasks with a success rate of 67.5%.


3. Role of Pretraining and Warmup

Table 1: Effect of cross-embodiment pretraining under 5-video and 15-video warmup.
Pretraining significantly improves success rates compared to training from scratch.
Warm-up Pretraining Clothes Ball Brush Block Overall SR (%)
5 robot videos Random init. 10/10 0/10 0/10 0/10 25.0%
TraceGen 10/10 6/10 8/10 8/10 80%
15 robot videos Random init. 10/10 0/10 0/10 2/10 30.0%
TraceGen 10/10 9/10 8/10 6/10 82.5%
Table 2: Effect of pretraining source on 5-video warmup performance.
Cross-embodiment pretraining with a larger dataset (TraceForge-123K) yields substantially higher success than single-source pretraining and full scratch training.
Task From scratch SSV2 only Agibot only TraceForge-123K
Ball 0/10 3/10 4/10 6/10
Block 0/10 2/10 5/10 8/10
Overall SR (%) 0% 25% 45% 70%

BibTeX

@article{lee2025tracegen,
  title={TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos},
  author={Lee, Seungjae and Jung, Yoonkyo and Chun, Inkook and Lee, Yao-Chih and Cai, Zikui and Huang, Hongjia and Talreja, Aayush and Dao, Tan Dat and Liang, Yongyuan and Huang, Jia-Bin and Huang, Furong},
  journal={arXiv preprint arXiv:2511.21690},
  year={2025}
}