OmniGen2: Exploration to Advanced Multimodal Generation paper code

💡 Quick Tips for Best Results (see our github for more details)

  • Image Quality: Use high-resolution images (at least 512x512 recommended).
  • Be Specific: Instead of "Add bird to desk", try "Add the bird from image 1 to the desk in image 2".
  • Use English: English prompts currently yield better results.
  • Increase image_guidance_scale for better consistency with the reference image:
    • Image Editing: 1.3 - 2.0
    • In-context Generation: 2.0 - 3.0
  • For in-context edit (edit based multiple images), we recommend using the following prompt format: "Edit the first image: add/replace (the [object] with) the [object] from the second image. [descripton for your target image]." For example: "Edit the first image: add the man from the second image. The man is talking with a woman in the kitchen"

Compared to OmniGen 1.0, although OmniGen2 has made some improvements, some issues still remain. It may take multiple attempts to achieve a satisfactory result.

256 1024
256 1024
1 8
1 3
0 1
0 1
Scheduler

The scheduler to use for the model.

20 100
1 4
-1 2147483647
256 2048
65536 2359296
Examples
Enter your prompt. Use "first/second image" or “第一张图/第二张图” as reference. Width Height Scheduler Inference Steps First Image Second Image Third Image Enter your negative prompt Text Guidance Scale Image Guidance Scale CFG Range Start CFG Range End Number of images per prompt max_input_image_side_length max_pixels Seed
Pages:
@article{wu2025omnigen2,
  title={OmniGen2: Exploration to Advanced Multimodal Generation},
  author={Chenyuan Wu and Pengfei Zheng and Ruiran Yan and Shitao Xiao and Xin Luo and Yueze Wang and Wanli Li and Xiyan Jiang and Yexin Liu and Junjie Zhou and Ze Liu and Ziyi Xia and Chaofan Li and Haoge Deng and Jiahao Wang and Kun Luo and Bo Zhang and Defu Lian and Xinlong Wang and Zhongyuan Wang and Tiejun Huang and Zheng Liu},
  journal={arXiv preprint arXiv:2506.18871},
  year={2025}
}