UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Abstract
UniRL is a self-improving post-training method for unified multimodal large language models that uses generated images as training data, enhancing both generation and understanding tasks without external data.
Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.
Community
Unified multimodal large language models such as Show-o and Janus have achieved
strong performance across both generation and understanding tasks. However, these
models typically rely on large-scale datasets and require substantial computation
during the pretraining stage. In addition, several post-training methods have been
proposed, but they often depend on external data or are limited to task-specific
customization. In this work, we introduce UniRL, a self-improving post-training
approach. Our approach enables the model to generate images from prompts and
use them as training data in each iteration, without relying on any external image
data. Moreover, it enables the two tasks to enhance each other: the generated images
are used for understanding, and the understanding results are used to supervise
generation. We explore supervised fine-tuning (SFT) and Group Relative Policy
Optimization (GRPO) to optimize the models. UniRL offers three key advantages:
(1) it requires no external image data, as all training samples are generated by the
model itself during training; (2) it not only improves individual task performance,
but also reduces the imbalance between generation and understanding; and (3)
it requires only several additional training steps during the post-training stage.
We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score
of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in
https://github.com/showlab/UniRL.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO (2025)
- UFT: Unifying Supervised and Reinforcement Fine-Tuning (2025)
- OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation (2025)
- Co-Reinforcement Learning for Unified Multimodal Understanding and Generation (2025)
- Emerging Properties in Unified Multimodal Pretraining (2025)
- T2I-ConBench: Text-to-Image Benchmark for Continual Post-training (2025)
- BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper