First Finish Search: Efficient Test-Time Scaling in Large Language Models
Abstract
First Finish Search improves accuracy in large language models by stopping inference at the first completed sample, significantly outperforming other decoding strategies in reasoning tasks.
Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches n independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves 82.23% accuracy on the AIME datasets, a 15% improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.
Community
๐ข New Paper Alert: First Finish Search โ Efficient Test-Time Scaling in LLMs
We introduce First Finish Search (FFS), a simple yet surprisingly effective test-time decoding strategy for improving reasoning in large language models (LLMs). FFS launches multiple decoding paths in parallel and stops as soon as any one of them finishes, requiring no beam search or reranking.
๐ Key Insights:
- Shorter reasoning traces are often more accurate than longer ones.
- FFS is training-free, parallelizable, and drastically reduces latency and token usage.
- Achieves 82.23% accuracy on AIME datasets using DeepSeek-R1 โ a 15% gain over the base model, rivaling much larger models like o4-mini.
๐ We benchmark FFS against beam search, majority voting, and budget forcing across 4 reasoning models and 4 challenging datasets (AIME24, AIME25-I/II, GPQA-Diamond).
๐ง Our theoretical analysis explains why stopping early often works, and when it might not.
๐ Read the paper: https://arxiv.org/abs/2505.18149
๐ฅ Authors: Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty
๐ฌ Happy to discuss or collaborate! Feel free to reach out or ask questions.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence (2025)
- SSR: Speculative Parallel Scaling Reasoning in Test-time (2025)
- Scaling Reasoning can Improve Factuality in Large Language Models (2025)
- Dynamic Early Exit in Reasoning Models (2025)
- Value-Guided Search for Efficient Chain-of-Thought Reasoning (2025)
- Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement (2025)
- HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper