VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning Paper • 2504.07960 • Published Apr 10 • 48
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step Paper • 2504.01956 • Published Apr 2 • 40
view article Article Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM Mar 12 • 412
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer Paper • 2503.07027 • Published Mar 10 • 29
microsoft/Phi-4-multimodal-instruct Automatic Speech Recognition • Updated 11 days ago • 277k • 1.38k
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation Paper • 2502.13128 • Published Feb 18 • 42
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation Paper • 2502.07870 • Published Feb 11 • 45
On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices Paper • 2502.04363 • Published Feb 5 • 12
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces Paper • 2501.12909 • Published Jan 22 • 70