metadata

license: apache-2.0

🥯 BAGEL • Unified Model for Multimodal Understanding and Generation

We present BAGEL, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3. Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models. Below is a showcase of BAGEL's qualitative performance.

📊 Benchmarks

1. Visual Understanding

Model (≈ 7 B class)	MMBench-C ↑	MMMU ↑	MM-Vet ↑	MathVista ↑
Janus-Pro-7B	79.2	41.0	50.0	–
Qwen2.5-VL	83.5	58.6	67.1	–
BAGEL (ours)	85.0	55.3	67.2	73.1

2. Text-to-Image Generation · GenEval

Model	Overall ↑
FLUX-1-dev	0.82
SD3-Medium	0.74
Janus-Pro-7B	0.80
BAGEL	0.88

3. Image Editing

Benchmark	Step1X-Edit	Gemini-2-exp.	BAGEL	BAGEL + CoT
GEdit-Bench-EN (↑)	7.09	–	7.36	–
IntelligentBench (↑)	14.9	57.6	44.0	55.3

License

BAGEL is licensed under the Apache 2.0 license. It is finetuned from Qwen2.5-7B-Instruct, and uses the FLUX.1-schnell VAE model and the siglip-so400m-14-980-flash-attn2-navit model, all under Apache 2.0.

✍️ Citation

@article{deng2025bagel,
  title   = {Emerging Properties in Unified Multimodal Pretraining},
  author  = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
  journal = {arXiv preprint arXiv:TODO},
  year    = {2025}
}