HunyuanWorld-Voyager Voyager is an interactive RGB-D (color + depth) video generation model that creates world-consistent 3D point-cloud sequences from a single image and a defined camera path. It supports real-time 3D reconstruction and allows exploring virtual worlds via user-specified camera trajectories. Key Features World-Consistent Video Diffusion: Generates aligned RGB and depth video sequences conditioned on the existing world observation for global 3D coherence. Long-Range World Exploration: Uses an efficient world cache and auto-regressive inference with smooth video sampling to iteratively extend scenes while maintaining consistency. Dataset Creation: Employs a scalable data engine that automatically estimates camera poses and metric depth for videos, enabling large-scale training data without manual 3D annotation (over 100,000 clips combining real and synthetic scenes). Multiple Applications: Supports video reconstruction, image-to-3D generation, and video depth estimation. Repository Overview Owner/Organization: Tencent-Hunyuan Language: Primarily Python (98.6%), some Shell (1.4%) Stars: 391 Forks: 22 Watchers: 4 Commits: 18 License: View license file Project Page: https://3d-models.hunyuan.tencent.com/world/ Recent News Sep 2, 2025: Code and model weights of HunyuanWorld-Voyager released. Download Join Wechat and Discord groups for support and discussion. Demo Videos Provided demo videos showcasing: Camera-controllable video generation from single images using custom camera paths. Video reconstruction and 3D point cloud reconstruction. Video depth estimation. Architecture Highlights Two main components: Video diffusion model producing aligned RGB-D sequences conditioned on world observations. Efficient world cache and auto-regressive inference for smooth and consistent scene expansion. Training relies on a large dataset generated by the video reconstruction pipeline combining real-world and synthetic renders. Performance Evaluated on the WorldScore benchmark. Voyager ranks top in overall world score and several sub-metrics such as camera control, object control, content alignment, and subjective quality. Requirements GPU: NVIDIA GPU with CUDA support. Minimum GPU memory: 60GB for 540p video generation. Recommended GPU memory: 80GB for improved quality. Tested on Linux OS. Installation and Setup Clone the repository: Recommended CUDA versions: 12.4 or 11.8. On Linux, create and activate a conda environment: Install PyTorch and CUDA toolkit: Install required Python packages: Additional dependencies for creating input conditions: Pretrained Models Download pretrained checkpoints via HuggingFace CLI: Inference Creating Input Conditions Use provided examples in examples/. To generate your own input conditions from an image: