AutoBench Run 2 Results are Out! Surprise: Gemini 2.5 Pro is not the Best Affordable Thinking Model
Explore the Performance of o4 Mini, Gpt 4.1 Mini, Gemini 2.5 Pro, Claude 3.7 Sonnet:thinking, DeepSeek V3-0324, and All the Latest Models with our New Interactive Leaderboard!
Following up on our initial introduction to AutoBench, we're thrilled to announce the completion of our second major benchmark run and, more excitingly, the launch of the AutoBench Interactive Leaderboard! This new tool, hosted on Hugging Face Spaces, provides an accessible and dynamic way to explore the rich results from this latest evaluation. Dive Straight into the Results: Autobench Leaderboard: the top 25 LLMs
This second run, completed on April 28, 2025, evaluated 25 cutting-edge Large Language Models (LLMs), including newcomers like o4 Mini
, Gpt 4.1 Mini
, Gemini 2.5 Pro
, Claude 3.7 Sonnet: Thinking
, and DeepSeek V3-0324
. We didn't just rank them on conversational quality using our unique "Collective-LLM-as-a-judge" method; we also incorporated crucial cost and latency metrics, offering a more holistic view of model performance.
AutoBench Run 2: Methodology & Scale
AutoBench utilizes a distinct evaluation process. For the details consult the Autobench Hugging Face Page. The key features of the method are:
1. LLM-Generated Questions: High-quality, diverse questions are generated by capable LLMs across numerous domains (logic, coding, history, science, etc.) and ranked to ensure relevance.
2. LLM-as-a-Judge: The core of AutoBench involves using multiple LLMs to collectively rank the quality of responses generated by the models under test.
The new version of AutoBench, to be soon released as Open Source, just as with version 1.0, provides a more efficient ranking process and was designed to handle responses also from "thinking" models. This enabled us to use several powerful thinking models for both answer and ranking generation, increasing the overall quality of the benchmark.
Run 2 Details:
- Date Completed: April 28, 2025
- Models Tested: 25 contemporary LLMs (22 rankers)
- Iterations: ~310 (unique ranked questions)
- Answers Generated: 7,700+
- Pairwise Ranks Collected: 180,000+
- Average Answer Lenght: 10k+ tokens
- New Metrics:
- Average Cost: Cost per response (in USD Cents).
- Average Latency: Average response duration (in seconds).
- P99 Latency: 99th percentile response duration (in seconds), highlighting consistency.
Please note that AutoBench is designed to generate highly challenging questions for LLMs on a wide range of domains (coding, creative writing, current news, general culture, grammar, history, logics, math, science, and technology). Answer lenght ranges from 2k tokens for fast models, all the way to 20k+ tokens for "heavy thinkers" such as DeepSeek R1.
Validation: How AutoBench Compares to Other Benchmarks
A crucial question for any new benchmark, especially an automated one like AutoBench, is how well it aligns with existing, trusted evaluation methods, particularly those involving human preference. To validate our "LLM-as-a-judge" approach, we compared the rankings from AutoBench Run 2 against two prominent external benchmarks:
- Chatbot Arena (CBA): A widely respected benchmark based on crowdsourced human votes comparing LLM outputs side-by-side.
- Artificial Analysis Intelligence Index (AAII): A composite index assessing LLMs across reasoning, knowledge, math, and coding tasks.
The results show a compelling alignment:
- AutoBench vs. Chatbot Arena: Strong correlation of 82.51%.
- AutoBench vs. AAII: Good correlation of 83.74%.
This strong correlation, especially with the human-preference-driven Chatbot Arena, lends significant credibility to AutoBench's automated methodology. It suggests that our LLM-as-a-judge system effectively captures nuances in model quality and capability that resonate with human evaluation, providing a reliable and scalable alternative for assessing LLM performance.
Key Findings: Overall AutoBench Rankings
Based purely on the AutoBench score derived from the LLM judges, the top-performing models in this run were:
1. o4-mini-2025-04-16
: 4.57
2. gemini-2.5-pro-preview-03-25
: 4.46
3. claude-3.7-sonnet:thinking
: 4.39
4. gpt-4.1-mini
: 4.34
5. grok-3-beta
: 4.34
Surprisingly to us, and contrary to most other benchmarks, o4-mini proves to be the top model in almost all domains. In general, all Open AI models take the top spots in all domains such as "Math" and "Logics" which require high "reasoning" skills.
The full, sortable rankings are available on the interactive leaderboard
The Performance vs. Cost vs. Latency Trade-off
While the AutoBench score reflects judged quality, real-world deployment requires considering efficiency. Our analysis revealed significant trade-offs:
Top Performers: As expected, models achieving the highest AutoBench scores, such as
claude-3.7-sonnet:thinking
,grok-3-beta
, andgemini-2.5-pro-preview-03-25
, incurr higher API costs by 1 or even 2 orders of magnitude compared to smaller and faster models.Value Leaders: Models like
gemini-2.0-flash-001
,gemma-3-27b-it
,gpt-4o-mini
, and several Llama variants offer compelling value propositions, delivering respectable performance at a lower cost and often with faster response times.The trade-off between the performance rank of various LLMs, as determined by AutoBench, and their corresponding average cost per response in USD. Please note as the log scale shows that pricing ranges across 2 orders of magnitude.
Latency Insights: The P99 latency metric proved insightful. Models like
gemini-2.0-flash-001
andnova-pro-v1
demonstrated consistent speed (low P99), whereas others likedeepSeek-R1
anddeepSeek-V3-0324
were prone to occasional, significant delays (high P99), which could impact user experience. these results are in line with the measured average answer duration per each model.The relationship between the AutoBench performance rank and the 99th percentile (P99) of response duration in seconds for the evaluated LLMs. It highlights how consistently fast (or slow) models are, showing the potential impact on user experience, particularly for outlier, slower responses.
These multi-dimensional results underscore the importance of choosing models based on specific application needs, balancing quality, budget, and responsiveness. The interactive leaderboard is designed specifically to help navigate these trade-offs.
Domain-Specific Strengths and Weaknesses
AutoBench evaluates performance across various domains, revealing specific model strengths:
o4-mini-2025-04-16
: Showcased broad excellence, performing exceptionally well in challenging domains like Math and Science.gemini-2.5-pro-preview-03-25
: Displayed particular strength in Technology, General Culture, and History.- Math Domain: Continued to be a difficult area for numerous models, highlighting its value as a differentiator in LLM capabilities.
You can filter by domain on the leaderboard to explore these granular insights further.
Explore the Results: The AutoBench Interactive Leaderboard
Built with Gradio and hosted right here on Hugging Face Spaces, the leaderboard makes exploring our comprehensive benchmark data intuitive and insightful. Access the Leaderboard Here: https://huggingface.co/spaces/AutoBench/AutoBench-Leaderboard
Key features include:
- Multi-Metric Sorting: Rank models by AutoBench score, cost, average latency, or P99 latency.
- Interactive Plots: Visualize the complex trade-offs between performance, cost, and speed.
- Domain Filtering: Analyze model performance within specific areas like Coding, Logic, or Creative Writing.
- Up-to-Date Comparisons: Easily compare the latest LLMs evaluated in our April 2025 run.
Data Release and Future Plans
In the spirit of transparency and community collaboration, we are releasing:
- Data Samples: Representative question/answer/rank samples from the run.
- Detailed Iteration Data: Granular, iteration-level results for in-depth analysis.
Find all data, code, and related information on the main AutoBench Hugging Face page: https://huggingface.co/AutoBench
Additionally, this run was powered by a significantly improved version of the AutoBench engine, enhancing efficiency and speed. We are preparing to release this as AutoBench 1.1 (Open Source) in the near future – stay tuned!
Support & Acknowledgements
We extend our sincere gratitude to Translated (https://translated.com/) for their generous support of the AutoBench project through the provision of valuable LLM compute credits. This support was instrumental in enabling the extensive evaluations conducted in this run.
We also want to express our deep appreciation to the following individuals for their extremely valuable support and insightful feedback throughout the development and execution of AutoBench:
Their expertise and guidance have been invaluable to the AutoBench project.
Get involved
AutoBench is a step towards more robust, scalable, and future-proof LLM evaluation. We invite you to explore the code, run the benchmark, contribute to its development, and join the discussion on the future of LLM evaluation!
- Explore the code and data: Hugging Face AutoBench Repository
- Try our Demo on Spaces: AutoBench 1.0 Demo
- Contribute: Help us by suggesting new topics, refining prompts, or enhancing the weighting algorithm—submit pull requests or issues via the Hugging Face Repo.
We strongly encourage the AI community to engage with the interactive leaderboard, explore the released data, and share feedback. AutoBench aims to be a dynamic, evolving resource, and we look forward to future runs and the open-source release of AutoBench 1.0.