VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning Paper • 2504.08837 • Published Apr 10 • 42
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model Paper • 2504.10068 • Published Apr 14 • 30
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations Paper • 2504.10481 • Published Apr 14 • 84
Efficient Generative Model Training via Embedded Representation Warmup Paper • 2504.10188 • Published Apr 14 • 12
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce Paper • 2504.11343 • Published Apr 15 • 17
NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors Paper • 2504.11427 • Published Apr 15 • 19
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs Paper • 2504.11536 • Published Apr 15 • 61
D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation Paper • 2504.09454 • Published Apr 13 • 12
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning Paper • 2504.08672 • Published Apr 11 • 55
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling Paper • 2504.13169 • Published Apr 17 • 39
DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging Paper • 2504.12364 • Published Apr 16 • 21
Iterative Self-Training for Code Generation via Reinforced Re-Ranking Paper • 2504.09643 • Published Apr 13 • 34
DataDecide: How to Predict Best Pretraining Data with Small Experiments Paper • 2504.11393 • Published Apr 15 • 17
Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution Paper • 2504.09566 • Published Apr 13 • 10
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference Paper • 2504.10326 • Published Apr 14 • 25
InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework Paper • 2504.12395 • Published Apr 16 • 17
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation Paper • 2504.13055 • Published Apr 17 • 19
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models Paper • 2504.10449 • Published Apr 14 • 12
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models Paper • 2504.05303 • Published Apr 7 • 5
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments Paper • 2504.06827 • Published Apr 9
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation Paper • 2504.14899 • Published Apr 21 • 21
Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations Paper • 2504.13816 • Published Apr 18 • 17
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents Paper • 2504.13203 • Published Apr 15 • 31
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space Paper • 2504.13835 • Published Apr 18 • 38
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs Paper • 2504.14655 • Published Apr 20 • 19
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks Paper • 2504.15521 • Published Apr 22 • 64
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities Paper • 2504.16078 • Published Apr 22 • 20
Vidi: Large Multimodal Models for Video Understanding and Editing Paper • 2504.15681 • Published Apr 22 • 15
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning Paper • 2504.16080 • Published Apr 22 • 15
Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model Paper • 2504.15843 • Published Apr 22 • 18
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models Paper • 2504.15279 • Published Apr 21 • 74
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation Paper • 2504.17207 • Published Apr 24 • 29
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation Paper • 2504.17502 • Published Apr 24 • 54
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models Paper • 2504.17789 • Published Apr 24 • 23
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs Paper • 2504.17768 • Published Apr 24 • 13
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs Paper • 2504.18415 • Published Apr 25 • 42
DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models Paper • 2504.15716 • Published Apr 22 • 9
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark Paper • 2504.16427 • Published Apr 23 • 17
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities Paper • 2504.20734 • Published Apr 29 • 61
WebThinker: Empowering Large Reasoning Models with Deep Research Capability Paper • 2504.21776 • Published Apr 30 • 54
Sadeed: Advancing Arabic Diacritization Through Small Language Model Paper • 2504.21635 • Published Apr 30 • 58
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math Paper • 2504.21233 • Published Apr 30 • 44
RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning Paper • 2504.18904 • Published Apr 26 • 9
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT Paper • 2505.00703 • Published May 1 • 41
Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks Paper • 2505.00234 • Published May 1 • 26
ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction Paper • 2504.21855 • Published Apr 30 • 12
Improving Editability in Image Generation with Layer-wise Memory Paper • 2505.01079 • Published about 1 month ago • 28
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts Paper • 2504.21117 • Published Apr 29 • 25
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction Paper • 2505.02471 • Published 28 days ago • 12
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning Paper • 2505.01441 • Published Apr 28 • 36
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis Paper • 2505.02625 • Published 27 days ago • 22
A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency Paper • 2505.01658 • Published 30 days ago • 35
Absolute Zero: Reinforced Self-play Reasoning with Zero Data Paper • 2505.03335 • Published 27 days ago • 166
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference Paper • 2505.02922 • Published 27 days ago • 26
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant Paper • 2505.05467 • Published 24 days ago • 13
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training Paper • 2505.00358 • Published May 1 • 24
OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents Paper • 2505.03570 • Published 26 days ago • 7
LLM-Independent Adaptive RAG: Let the Question Speak for Itself Paper • 2505.04253 • Published 26 days ago • 12
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models Paper • 2505.02847 • Published May 1 • 27
X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains Paper • 2505.03981 • Published 26 days ago • 14
Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers Paper • 2505.04842 • Published 25 days ago • 12
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning Paper • 2505.04601 • Published 25 days ago • 26
HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation Paper • 2504.21650 • Published Apr 30 • 15
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder Paper • 2505.07916 • Published 20 days ago • 120
Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis Paper • 2505.09358 • Published 18 days ago • 24
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning Paper • 2505.10557 • Published 17 days ago • 44
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning Paper • 2505.10320 • Published 17 days ago • 22
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning Paper • 2505.08617 • Published 19 days ago • 40
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models Paper • 2505.10554 • Published 17 days ago • 115
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis Paper • 2505.10046 • Published 18 days ago • 9
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly Paper • 2505.10610 • Published 17 days ago • 52
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via texttt{D}ual-texttt{H}ead texttt{O}ptimization Paper • 2505.07675 • Published 20 days ago • 19
MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision Paper • 2505.13427 • Published 13 days ago • 25
CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models Paper • 2505.12504 • Published 14 days ago • 23
Improving Assembly Code Performance with Large Language Models via Reinforcement Learning Paper • 2505.11480 • Published 16 days ago • 8
Emerging Properties in Unified Multimodal Pretraining Paper • 2505.14683 • Published 12 days ago • 124
Vid2World: Crafting Video Diffusion Models to Interactive World Models Paper • 2505.14357 • Published 12 days ago • 25
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Paper • 2505.13227 • Published 13 days ago • 44
Think Only When You Need with Large Hybrid-Reasoning Models Paper • 2505.14631 • Published 12 days ago • 18
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective Paper • 2505.15045 • Published 12 days ago • 52
Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen! Paper • 2505.15656 • Published 11 days ago • 13
RLVR-World: Training World Models with Reinforcement Learning Paper • 2505.13934 • Published 13 days ago • 14
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models Paper • 2505.14810 • Published 12 days ago • 58
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding Paper • 2505.16990 • Published 10 days ago • 20
NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification Paper • 2505.16938 • Published 10 days ago • 110
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design Paper • 2505.16175 • Published 11 days ago • 38
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models Paper • 2505.17015 • Published 10 days ago • 8
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning Paper • 2505.15966 • Published 11 days ago • 48
Distilling LLM Agent into Small Models with Retrieval and Code Tools Paper • 2505.17612 • Published 10 days ago • 75
One RL to See Them All: Visual Triple Unified Reinforcement Learning Paper • 2505.18129 • Published 9 days ago • 56
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning Paper • 2505.17667 • Published 10 days ago • 81
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models Paper • 2505.17225 • Published 10 days ago • 62
Shifting AI Efficiency From Model-Centric to Data-Centric Compression Paper • 2505.19147 • Published 7 days ago • 139
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles Paper • 2505.19914 • Published 6 days ago • 39
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration Paper • 2505.20256 • Published 6 days ago • 16
Interleaved Reasoning for Large Language Models via Reinforcement Learning Paper • 2505.19640 • Published 7 days ago • 12
Alchemist: Turning Public Text-to-Image Data into Generative Gold Paper • 2505.19297 • Published 7 days ago • 73
s3: You Don't Need That Much Data to Train a Search Agent via RL Paper • 2505.14146 • Published 13 days ago • 16
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow Paper • 2505.17399 • Published 10 days ago • 14
MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems Paper • 2505.18943 • Published 8 days ago • 24
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms Paper • 2505.20322 • Published 9 days ago • 13
DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction Paper • 2505.21473 • Published 5 days ago • 13
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios Paper • 2505.21333 • Published 5 days ago • 37
Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning Paper • 2505.17813 • Published 9 days ago • 50
GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning Paper • 2505.20355 • Published 7 days ago • 35
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization Paper • 2505.19000 • Published 8 days ago • 39
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO Paper • 2505.22453 • Published 4 days ago • 42
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection Paper • 2505.20289 • Published 6 days ago • 9
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Paper • 2505.19897 • Published 6 days ago • 98
rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset Paper • 2505.21297 • Published 5 days ago • 25
ZeroGUI: Automating Online GUI Learning at Zero Human Cost Paper • 2505.23762 • Published 3 days ago • 42
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning Paper • 2505.23380 • Published 3 days ago • 20
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence Paper • 2505.23747 • Published 3 days ago • 60
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model Paper • 2505.23606 • Published 3 days ago • 13
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding Paper • 2505.22618 • Published 4 days ago • 26