Skip to main content

DeepSeek's new multi-modal image generation model: Janus Pro

Created on January 28|Last edited on January 28
The original Janus model was an ambitious attempt to unify multimodal understanding and generation. It introduced a decoupled architecture to independently handle these two tasks, avoiding conflicts in representation. While innovative, its limitations included a modest 1.5 billion parameters, unstable text-to-image generation, inefficiencies in training, and reliance on noisy datasets. Despite these challenges, Janus laid the groundwork for future advancements in multimodal AI.
Janus-Pro builds on this foundation, incorporating a more robust architecture, improved training strategies, expanded datasets, and increased scale with a 7-billion-parameter model. Developed by DeepSeek-AI, Janus-Pro delivers state-of-the-art performance in both multimodal understanding and text-to-image generation, setting a new standard in unified AI systems.

Architecture

Janus-Pro employs a decoupled architecture, where visual encoding for multimodal understanding and generation is handled separately. This design eliminates representation conflicts that arise when using a shared encoder for both tasks. For understanding tasks, the SigLIP encoder extracts high-dimensional semantic features from images. These features are flattened into a one-dimensional sequence and mapped into the input space of the language model using an understanding adaptor.
For text-to-image generation, a VQ tokenizer is used to convert images into discrete IDs, which represent small patches of the image. These IDs are flattened into a one-dimensional sequence and processed by a generation adaptor. This adaptor maps the codebook embeddings associated with the IDs into the input space of the language model. Both the understanding and generation feature sequences are concatenated and fed into a unified autoregressive transformer, which processes the multimodal data.

Once the transformer generates an output sequence for text-to-image tasks, the image is reconstructed using a dedicated image decoder. The discrete IDs are mapped back to the codebook embeddings, which are then reassembled into a latent grid. This grid represents the image in feature space. A decoder network reconstructs the final image from the latent grid, refining details and ensuring spatial coherence. While Janus-Pro is currently limited to a resolution of 384 x 384, the decoding process reliably produces semantically accurate and visually appealing outputs.

Advancements in Janus-Pro

Janus-Pro introduces an optimized training strategy, addressing inefficiencies in the original Janus model. Stage I now includes extended training on ImageNet data, allowing better pixel-dependence modeling. Stage II focuses on training with detailed text-to-image datasets, avoiding reliance on ImageNet prompts. In Stage III, the data mix is adjusted, reducing the proportion of text-to-image data slightly to improve multimodal understanding without sacrificing generation quality.
The datasets used for Janus-Pro are significantly expanded. For multimodal understanding, the model incorporates 90 million additional samples, including datasets for image captions, table and chart recognition, and document analysis. For text-to-image generation, 72 million high-quality synthetic aesthetic samples are added, balancing the real-to-synthetic data ratio at 1:1. This expansion addresses issues with noisy data and improves both convergence and output stability.
Janus-Pro scales up its architecture to a 7-billion-parameter model, compared to the original 1.5 billion. This increase in scale, combined with the decoupling of visual encoding, enables it to handle complex multimodal tasks with greater efficiency and accuracy.

Performance on Multimodal Understanding

Janus-Pro achieves leading results on multimodal benchmarks, outperforming previous models. On MMBench, a benchmark for multimodal understanding, Janus-Pro-7B scored 79.2, compared to the original Janus at 69.4, TokenFlow-XL at 68.9, and MetaMorph at 75.2. Its decoupled architecture allows for optimal processing of multimodal inputs, delivering competitive results even when compared to larger models.


Performance in Text-to-Image Generation

Janus-Pro excels in text-to-image generation tasks, demonstrating superior instruction-following and output quality. On the GenEval leaderboard, Janus-Pro-7B scored 80 percent overall accuracy, surpassing DALL-E 3 at 67 percent and Stable Diffusion 3 Medium at 74 percent. On DPG-Bench, a benchmark designed to evaluate dense prompts, Janus-Pro achieved an industry-leading score of 84.19, showcasing its ability to generate intricate and semantically aligned images.


Janus-Pro's qualitative results highlight its ability to produce visually appealing outputs, even for challenging prompts. While the resolution is limited to 384 x 384, the images maintain strong semantic coherence and visual detail, addressing many of the shortcomings seen in the original model.

Conclusion

Janus-Pro represents a significant step forward in unified multimodal AI. By addressing the limitations of the original Janus model through optimized training, expanded datasets, and a scalable architecture, it has set new benchmarks for performance in both multimodal understanding and text-to-image generation.
As a product of DeepSeek-AI, Janus-Pro demonstrates the potential of scalable, efficient multimodal systems. It lays the foundation for further advancements in this rapidly evolving field, bridging the gap between understanding and generation tasks in artificial intelligence.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.