by SAILResearch
A curated list of awesome leaderboard-oriented resources for AI domain
# Add to your Claude Code skills
git clone https://github.com/SAILResearch/awesome-ai-leaderboardGuides for using ai agents skills like awesome-ai-leaderboard.
Last scanned: 6/11/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-06-11T08:47:35.271Z",
"npmAuditRan": true,
"pipAuditRan": true,
"promptInjectionRan": true
}No comments yet. Be the first to share your thoughts!
30 days in the Featured rail · terms & refunds
Awesome AI Leaderboard is a curated list of awesome AI leaderboards, along with various development tools and evaluation organizations according to our recent survey:
If you find this repository useful, please consider giving us a star :star: and citation:
@article{zhao2025workflows,
title={On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards},
author={Zhao, Zhimin and Bangash, Abdul Ali and C{\^o}go, Filipe Roseiro and Adams, Bram and Hassan, Ahmed E},
journal={IEEE Transactions on Software Engineering},
year={2025},
publisher={IEEE}
}
Additionally, we provide a search toolkit that helps you quickly navigate through the leaderboards.
If you want to contribute to this list (please do), welcome to propose a pull request.
If you have any suggestions, critiques, or questions regarding this list, welcome to raise issue.
Also, a leaderboard should be included if only:
| Name | Description |
|---|---|
| Demo Leaderboard Backend | Demo leaderboard backend helps users manage the leaderboard and handle submission requests, check this for details. |
| Evaluation Results on the Hub | Evaluation Results on the Hub enables model authors to store and display evaluation results in model cards by embedding structured metadata, making benchmark scores publicly accessible and comparable across models. |
| Kaggle Competition Creation | Kaggle Competition Creation enables you to design and launch custom competitions, leveraging your datasets to engage the data science community. |
| Name | Description |
|---|---|
| AIcrowd | AIcrowd hosts machine learning challenges and competitions across domains such as computer vision, NLP, and reinforcement learning, aimed at both researchers and practitioners. |
| AI Hub | AI Hub offers a variety of competitions to encourage AI solutions to real-world problems, with a focus on innovation and collaboration. |
| AI Studio | AI Studio offers AI competitions mainly for computer vision, NLP, and other data-driven tasks, allowing users to develop and showcase their AI skills. |
| Allen Institute for AI | The Allen Institute for AI provides leaderboards and benchmarks on tasks in natural language understanding, commonsense reasoning, and other areas in AI research. |
| Codabench | Codabench is an open-source platform for benchmarking AI models, enabling customizable, user-driven challenges across various AI domains. |
| DataFountain | DataFountain is a Chinese AI competition platform featuring challenges in finance, healthcare, and smart cities, encouraging solutions for industry-related problems. |
| DrivenData | DrivenData hosts machine learning challenges with a social impact, aiming to solve issues in areas, such as public health, disaster relief, and sustainable development. |
| Dynabench | Dynabench offers dynamic benchmarks where models are evaluated continuously, often involving human interaction, to ensure robustness in evolving AI tasks. |
| Eval AI | EvalAI is a platform for hosting and participating in AI challenges, widely used by researchers for benchmarking models in tasks, such as image classification, NLP, and reinforcement learning. |
| Grand Challenge | Grand Challenge provides a platform for medical imaging challenges, supporting advancements in medical AI, particularly in areas, such as radiology and pathology. |
| Hilti | Hilti hosts challenges aimed at advancing AI and machine learning in the construction industry, with a focus on practical, industry-relevant applications. |
| InsightFace | InsightFace focuses on AI challenges related to face recognition, verification, and analysis, supporting advancements in identity verification and security. |
| Kaggle | Kaggle is one of the largest platforms for data science and machine learning competitions, covering a broad range of topics from image classification to NLP and predictive modeling. |
| nuScenes | nuScenes enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car, facilitating research in autonomous driving. |
| Robust Reading Competition | Robust Reading refers to the research area on interpreting written communication in unconstrained settings, with competitions focused on text recognition in real-world environments. |
| Tianchi | Tianchi, hosted by Alibaba, offers a range of AI competitions, particularly popular in Asia, with a focus on commerce, healthcare, and logistics. |
| Name | Description |
|---|---|
| AI Benchmarking Hub | AI Benchmarking Hub tracks and compares AI model performance in reasoning, coding, and knowledge tasks. |
| Arena | Arena operates a chatbot arena where various foundation models compete based on user preferences across multiple categories: text generation, web development, computer vision, text-to-image synthesis, search capabilities, and coding assistance. |
| Artificial Analysis | Artificial Analysis is a platform to help users make informed decisions on AI model selection and hosting providers. |
| BenchGecko | BenchGecko is a comprehensive leaderboard that tracks thousands of models across 128 benchmarks, featuring cross-provider pricing comparisons, AI economy insights, an agent leaderboard, and an MCP server directory. |
| CompassRank | CompassRank is a platform to offer a comprehensive, objective, and neutral evaluation reference of foundation models for the industry and research. |
| EvoClaw | EvoClaw is a leaderboard for evaluating and ranking AI agents across benchmark tasks. |
| FlagEval | FlagEval is a comprehensive platform for evaluating foundation models. |
| Generative AI Leaderboards | Generative AI Leaderboard ranks the top-performing generative AI models based on various metrics. |
| Holistic Agent Leaderboard | HAL is a standardized, cost-aware, and third-party leaderboard for evaluating agents. |
| Holistic Evaluation of Language Models | Holistic Evaluation of Language Models (HELM) is a reproducible and transparent framework for evaluating foundation models. |
| LLM Stats | LLM Stats, the most comprehensive LLM leaderboard, benchmarks and compares API mode |