Skip to main content

Tencent Releases HunyuanWorld-Voyager for World-Consistent 3D Video Generation

Tencent’s AI research team has officially released HunyuanWorld-Voyager, a video diffusion framework designed to transform single images into world-consistent 3D point-cloud sequences
Created on September 3|Last edited on September 3
Tencent’s AI research team has officially released HunyuanWorld-Voyager, a video diffusion framework designed to transform single images into world-consistent 3D point-cloud sequences. The framework enables the creation of explorable 3D scene videos by allowing users to define camera trajectories, making it possible to navigate through generated worlds while preserving spatial consistency. Along with RGB video, Voyager also outputs aligned depth data, which simplifies direct 3D reconstruction.

Core Architecture

Voyager is built on two main components. The first is its world-consistent video diffusion system, which simultaneously generates RGB and depth sequences with global scene coherence. The second is its long-range world exploration module, which relies on a caching system with point culling and an auto-regressive inference approach. This design allows for smooth video sampling and iterative scene expansion, maintaining consistent geometry and appearance over extended sequences.


Data Engine and Training

To train Voyager, the team developed an automated data engine that processes raw video into 3D-compatible training material. This pipeline estimates camera poses and metric depth from arbitrary videos without requiring manual 3D annotations. By combining real-world captures with synthetic Unreal Engine renders, Tencent compiled a dataset of more than 100,000 video clips to support large-scale training.

Performance and Benchmarking

Voyager has been tested against leading video generation frameworks on the WorldScore Benchmark. It achieved the highest average score of 77.62, outperforming systems such as WonderJourney, Gen-3, and CogVideoX-I2V. The model excelled in categories like content alignment, object control, and photometric consistency, while also maintaining strong results in 3D consistency and style coherence.


System Requirements

Running Voyager requires significant computational resources. A minimum of 60GB GPU memory is needed for 540p video generation, with 80GB recommended for optimal results. The system has been tested on Linux with NVIDIA GPUs supporting CUDA 11.8 or 12.4. Additional dependencies, including PyTorch 2.4.0, flash-attn v2, and xfuser for parallel inference, are required.

System Requirements

Users can generate videos from a single input image by defining camera paths such as forward, backward, or rotational movements. Voyager supports both single-GPU inference and parallel inference across multiple GPUs through xDiT, a scalable engine optimized for diffusion transformers. Demonstrations highlight applications in video reconstruction, image-to-3D generation, and video depth estimation, making Voyager suitable for research and creative industries alike.

Availability and Community

The code, pretrained models, and demo tools for Voyager are now available on GitHub, with support for installation via conda and pip. Tencent also provides a Gradio demo for users to interactively test the model. The research team encourages collaboration and discussion through WeChat and Discord groups.

Conclusion

HunyuanWorld-Voyager represents a step forward in 3D-aware generative AI, bridging video diffusion with spatially consistent world modeling. Its release not only provides researchers with a powerful tool for scene exploration and reconstruction but also signals Tencent’s ongoing investment in advancing large-scale 3D generation systems.