Abstract
Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed URAE. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, i.e., setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are available https://github.com/Huage001/URAE{here}.
Community
🚀Github: https://github.com/Huage001/URAE
🚀URAE-Demo: https://huggingface.co/spaces/Yuanshi/URAE
🚀URAE-dev-Demo: https://huggingface.co/spaces/Yuanshi/URAE_dev
🚀Model-Weights: https://huggingface.co/Huage001/URAE
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Augmented Conditioning Is Enough For Effective Training Image Generation (2025)
- CascadeV: An Implementation of Wurstchen Architecture for Video Generation (2025)
- Masked Autoencoders Are Effective Tokenizers for Diffusion Models (2025)
- Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation (2025)
- IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models (2025)
- CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models (2025)
- Efficient Transformer for High Resolution Image Motion Deblurring (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper