About AgentLeaderboards

Real AI model benchmarks from authoritative sources

Our Mission

With the explosion of AI models released every week—often with exaggerated claims—it's hard to find objective, trustworthy performance data. AgentLeaderboards aggregates real benchmark scores from authoritative sources to help you make informed decisions about which AI model to use.

Data Sources

Hugging Face Open LLM Leaderboard

The gold standard for open-source LLM evaluation. We reference their benchmarks including:

  • IFEval - Instruction following accuracy
  • BBH (Big Bench Hard) - 23 challenging reasoning tasks
  • MATH Lvl 5 - Competition-level mathematics
  • GPQA - PhD-level science questions
  • MuSR - Multi-step soft reasoning (1000+ word problems)
  • MMLU-PRO - Professional-level general knowledge

Visit Hugging Face Open LLM Leaderboard

LM Council Benchmarks

Curated by AI Explained, featuring benchmarks from Epoch AI and Scale AI:

  • Humanity's Last Exam - 2,500 hardest multi-modal questions
  • SimpleBench - Common-sense reasoning without memorization
  • SWE-bench Verified - Real GitHub bug fixes
  • Terminal-Bench 2.0 - Terminal-based coding tasks
  • FrontierMath - Research-level mathematics
  • METR Time Horizons - Agent task duration capabilities

Visit LM Council Benchmarks

Community Sources

We also track community benchmarks and user ratings from:

  • LM Arena (human preference testing)
  • WebDev Arena (website quality ratings)
  • Chatbot Arena (chat performance)
  • Independent research papers and preprints

Methodology

How We Calculate Overall Scores

Our "Overall Score" is a weighted average of performance across multiple benchmarks, normalized to a 0-100 scale. We prioritize:

  • Benchmarks with high correlation to real-world performance
  • Tasks that are difficult to game or memorize
  • Multiple evaluation types (reasoning, knowledge, coding, instruction-following)

Data Freshness

We update benchmarks daily from public leaderboards. When a new model is released, it typically appears within 24-48 hours of being evaluated by authoritative sources.

Pricing Data

All pricing information is sourced from official provider documentation and updated regularly. Prices are shown as input/output cost per million tokens (USD).

Transparency & Limitations

What we do: Aggregate and present real benchmark data from trusted sources.

What we don't do: Run our own benchmarks or modify scores.

Limitations: Benchmarks are proxies for real-world performance and may not reflect your specific use case. Always test models on your own data before making critical decisions.

Bias acknowledgment: Benchmark datasets may contain biases. We display multiple benchmarks to provide a more complete picture.

Questions or Feedback?

AgentLeaderboards is an EngineeredEverything project.

Found an error? Have a suggestion? Want to submit a benchmark?
Contact us at: [email protected]