AutoBench Run 2 Results are Out! Surprise: Gemini 2.5 Pro is not the Best Affordable Thinking Model

Community Article Published April 29, 2025

Explore the Performance of o4 Mini, Gpt 4.1 Mini, Gemini 2.5 Pro, Claude 3.7 Sonnet:thinking, DeepSeek V3-0324, and All the Latest Models with our New Interactive Leaderboard!

banner

Following up on our initial introduction to AutoBench, we're thrilled to announce the completion of our second major benchmark run and, more excitingly, the launch of the AutoBench Interactive Leaderboard! This new tool, hosted on Hugging Face Spaces, provides an accessible and dynamic way to explore the rich results from this latest evaluation.    Dive Straight into the Results: Autobench Leaderboard: the top 25 LLMs

image/png

This second run, completed on April 28, 2025, evaluated 25 cutting-edge Large Language Models (LLMs), including newcomers like o4 Mini, Gpt 4.1 Mini, Gemini 2.5 Pro, Claude 3.7 Sonnet: Thinking, and DeepSeek V3-0324. We didn't just rank them on conversational quality using our unique "Collective-LLM-as-a-judge" method; we also incorporated crucial cost and latency metrics, offering a more holistic view of model performance.   

AutoBench Run 2: Methodology & Scale

AutoBench utilizes a distinct evaluation process. For the details consult the Autobench Hugging Face Page. The key features of the method are:

1.  LLM-Generated Questions: High-quality, diverse questions are generated by capable LLMs across numerous domains (logic, coding, history, science, etc.) and ranked to ensure relevance.

2.  LLM-as-a-Judge: The core of AutoBench involves using multiple LLMs to collectively rank the quality of responses generated by the models under test.

The new version of AutoBench, to be soon released as Open Source, just as with version 1.0, provides a more efficient ranking process and was designed to handle responses also from "thinking" models. This enabled us to use several powerful thinking models for both answer and ranking generation, increasing the overall quality of the benchmark.

Run 2 Details:   

  • Date Completed: April 28, 2025
  • Models Tested: 25 contemporary LLMs (22 rankers)
  • Iterations: ~310 (unique ranked questions)
  • Answers Generated: 7,700+
  • Pairwise Ranks Collected: 180,000+
  • Average Answer Lenght: 10k+ tokens
  • New Metrics:
    • Average Cost: Cost per response (in USD Cents).
    • Average Latency: Average response duration (in seconds).
    • P99 Latency: 99th percentile response duration (in seconds), highlighting consistency.

Please note that AutoBench is designed to generate highly challenging questions for LLMs on a wide range of domains (coding, creative writing, current news, general culture, grammar, history, logics, math, science, and technology). Answer lenght ranges from 2k tokens for fast models, all the way to 20k+ tokens for "heavy thinkers" such as DeepSeek R1.   

Validation: How AutoBench Compares to Other Benchmarks

A crucial question for any new benchmark, especially an automated one like AutoBench, is how well it aligns with existing, trusted evaluation methods, particularly those involving human preference. To validate our "LLM-as-a-judge" approach, we compared the rankings from AutoBench Run 2 against two prominent external benchmarks:   

  • Chatbot Arena (CBA): A widely respected benchmark based on crowdsourced human votes comparing LLM outputs side-by-side.
  • Artificial Analysis Intelligence Index (AAII): A composite index assessing LLMs across reasoning, knowledge, math, and coding tasks.

The results show a compelling alignment:   

  • AutoBench vs. Chatbot Arena: Strong correlation of 82.51%.
  • AutoBench vs. AAII: Good correlation of 83.74%.

This strong correlation, especially with the human-preference-driven Chatbot Arena, lends significant credibility to AutoBench's automated methodology. It suggests that our LLM-as-a-judge system effectively captures nuances in model quality and capability that resonate with human evaluation, providing a reliable and scalable alternative for assessing LLM performance.   

Key Findings: Overall AutoBench Rankings

Based purely on the AutoBench score derived from the LLM judges, the top-performing models in this run were:    1.  o4-mini-2025-04-16: 4.57 2.  gemini-2.5-pro-preview-03-25: 4.46 3.  claude-3.7-sonnet:thinking: 4.39 4.  gpt-4.1-mini: 4.34 5.  grok-3-beta: 4.34

Surprisingly to us, and contrary to most other benchmarks, o4-mini proves to be the top model in almost all domains. In general, all Open AI models take the top spots in all domains such as "Math" and "Logics" which require high "reasoning" skills.

The full, sortable rankings are available on the interactive leaderboard   

The Performance vs. Cost vs. Latency Trade-off

While the AutoBench score reflects judged quality, real-world deployment requires considering efficiency. Our analysis revealed significant trade-offs:   

  • Top Performers: As expected, models achieving the highest AutoBench scores, such as claude-3.7-sonnet:thinking, grok-3-beta, and gemini-2.5-pro-preview-03-25, incurr higher API costs by 1 or even 2 orders of magnitude compared to smaller and faster models.

  • Value Leaders: Models like gemini-2.0-flash-001, gemma-3-27b-it, gpt-4o-mini, and several Llama variants offer compelling value propositions, delivering respectable performance at a lower cost and often with faster response times. Graph comparing rank vs cost as calculated by AutoBench The trade-off between the performance rank of various LLMs, as determined by AutoBench, and their corresponding average cost per response in USD. Please note as the log scale shows that pricing ranges across 2 orders of magnitude.

  • Latency Insights: The P99 latency metric proved insightful. Models like gemini-2.0-flash-001 and nova-pro-v1 demonstrated consistent speed (low P99), whereas others like deepSeek-R1 and deepSeek-V3-0324 were prone to occasional, significant delays (high P99), which could impact user experience. these results are in line with the measured average answer duration per each model. image/png The relationship between the AutoBench performance rank and the 99th percentile (P99) of response duration in seconds for the evaluated LLMs. It highlights how consistently fast (or slow) models are, showing the potential impact on user experience, particularly for outlier, slower responses.

These multi-dimensional results underscore the importance of choosing models based on specific application needs, balancing quality, budget, and responsiveness. The interactive leaderboard is designed specifically to help navigate these trade-offs.   

Domain-Specific Strengths and Weaknesses

AutoBench evaluates performance across various domains, revealing specific model strengths:   

  • o4-mini-2025-04-16: Showcased broad excellence, performing exceptionally well in challenging domains like Math and Science.
  • gemini-2.5-pro-preview-03-25: Displayed particular strength in Technology, General Culture, and History.
  • Math Domain: Continued to be a difficult area for numerous models, highlighting its value as a differentiator in LLM capabilities.

You can filter by domain on the leaderboard to explore these granular insights further.   

Explore the Results: The AutoBench Interactive Leaderboard

Built with Gradio and hosted right here on Hugging Face Spaces, the leaderboard makes exploring our comprehensive benchmark data intuitive and insightful.    Access the Leaderboard Here: https://huggingface.co/spaces/AutoBench/AutoBench-Leaderboard

Key features include:   

  • Multi-Metric Sorting: Rank models by AutoBench score, cost, average latency, or P99 latency.
  • Interactive Plots: Visualize the complex trade-offs between performance, cost, and speed.
  • Domain Filtering: Analyze model performance within specific areas like Coding, Logic, or Creative Writing.
  • Up-to-Date Comparisons: Easily compare the latest LLMs evaluated in our April 2025 run.

  

Data Release and Future Plans

In the spirit of transparency and community collaboration, we are releasing:   

  • Data Samples: Representative question/answer/rank samples from the run.
  • Detailed Iteration Data: Granular, iteration-level results for in-depth analysis.

Find all data, code, and related information on the main AutoBench Hugging Face page: https://huggingface.co/AutoBench

Additionally, this run was powered by a significantly improved version of the AutoBench engine, enhancing efficiency and speed. We are preparing to release this as AutoBench 1.1 (Open Source) in the near future – stay tuned!   

Support & Acknowledgements

We extend our sincere gratitude to Translated (https://translated.com/) for their generous support of the AutoBench project through the provision of valuable LLM compute credits. This support was instrumental in enabling the extensive evaluations conducted in this run.

We also want to express our deep appreciation to the following individuals for their extremely valuable support and insightful feedback throughout the development and execution of AutoBench:

Their expertise and guidance have been invaluable to the AutoBench project.

Get involved

AutoBench is a step towards more robust, scalable, and future-proof LLM evaluation. We invite you to explore the code, run the benchmark, contribute to its development, and join the discussion on the future of LLM evaluation!

  • Explore the code and data: Hugging Face AutoBench Repository
  • Try our Demo on Spaces: AutoBench 1.0 Demo
  • Contribute: Help us by suggesting new topics, refining prompts, or enhancing the weighting algorithm—submit pull requests or issues via the Hugging Face Repo.

We strongly encourage the AI community to engage with the interactive leaderboard, explore the released data, and share feedback. AutoBench aims to be a dynamic, evolving resource, and we look forward to future runs and the open-source release of AutoBench 1.0.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment