by SAILResearch
A curated list of awesome leaderboard-oriented resources for AI domain
# Add to your Claude Code skills
git clone https://github.com/SAILResearch/awesome-foundation-model-leaderboardsAwesome AI Leaderboard is a curated list of awesome AI leaderboards, along with various development tools and evaluation organizations according to our recent survey:
If you find this repository useful, please consider giving us a star :star: and citation:
@article{zhao2025workflows,
title={On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards},
author={Zhao, Zhimin and Bangash, Abdul Ali and C{\^o}go, Filipe Roseiro and Adams, Bram and Hassan, Ahmed E},
journal={IEEE Transactions on Software Engineering},
year={2025},
publisher={IEEE}
}
Additionally, we provide a search toolkit that helps you quickly navigate through the leaderboards.
If you want to contribute to this list (please do), welcome to propose a pull request.
If you have any suggestions, critiques, or questions regarding this list, welcome to raise issue.
Also, a leaderboard should be included if only:
No comments yet. Be the first to share your thoughts!
| Name | Description | | ---- | ----------- | | Demo Leaderboard Backend | Demo leaderboard backend helps users manage the leaderboard and handle submission requests, check this for details. | | Evaluation Results on the Hub | Evaluation Results on the Hub enables model authors to store and display evaluation results in model cards by embedding structured metadata, making benchmark scores publicly accessible and comparable across models. | | Kaggle Competition Creation | Kaggle Competition Creation enables you to design and launch custom competitions, leveraging your datasets to engage the data science community. |
| Name | Description | | ---- | ----------- | | AIcrowd | AIcrowd hosts machine learning challenges and competitions across domains such as computer vision, NLP, and reinforcement learning, aimed at both researchers and practitioners. | | AI Hub | AI Hub offers a variety of competitions to encourage AI solutions to real-world problems, with a focus on innovation and collaboration. | | AI Studio | AI Studio offers AI competitions mainly for computer vision, NLP, and other data-driven tasks, allowing users to develop and showcase their AI skills. | | Allen Institute for AI | The Allen Institute for AI provides leaderboards and benchmarks on tasks in natural language understanding, commonsense reasoning, and other areas in AI research. | | Codabench | Codabench is an open-source platform for benchmarking AI models, enabling customizable, user-driven challenges across various AI domains. | | DataFountain | DataFountain is a Chinese AI competition platform featuring challenges in finance, healthcare, and smart cities, encouraging solutions for industry-related problems. | | DrivenData | DrivenData hosts machine learning challenges with a social impact, aiming to solve issues in areas, such as public health, disaster relief, and sustainable development. | | Dynabench | Dynabench offers dynamic benchmarks where models are evaluated continuously, often involving human interaction, to ensure robustness in evolving AI tasks. | | Eval AI | EvalAI is a platform for hosting and participating in AI challenges, widely used by researchers for benchmarking models in tasks, such as image classification, NLP, and reinforcement learning. | | Grand Challenge | Grand Challenge provides a platform for medical imaging challenges, supporting advancements in medical AI, particularly in areas, such as radiology and pathology. | | Hilti | Hilti hosts challenges aimed at advancing AI and machine learning in the construction industry, with a focus on practical, industry-relevant applications. | | InsightFace | InsightFace focuses on AI challenges related to face recognition, verification, and analysis, supporting advancements in identity verification and security. | | Kaggle | Kaggle is one of the largest platforms for data science and machine learning competitions, covering a broad range of topics from image classification to NLP and predictive modeling. | | nuScenes | nuScenes enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car, facilitating research in autonomous driving. | | Robust Reading Competition | Robust Reading refers to the research area on interpreting written communication in unconstrained settings, with competitions focused on text recognition in real-world environments. | | Tianchi | Tianchi, hosted by Alibaba, offers a range of AI competitions, particularly popular in Asia, with a focus on commerce, healthcare, and logistics. |
| Name | Description | | ---- | ----------- | | AI Benchmarking Hub | AI Benchmarking Hub tracks and compares AI model performance in reasoning, coding, and knowledge tasks. | | Arena | Arena operates a chatbot arena where various foundation models compete based on user preferences across multiple categories: text generation, web development, computer vision, text-to-image synthesis, search capabilities, and coding assistance. | | Artificial Analysis | Artificial Analysis is a platform to help users make informed decisions on AI model selection and hosting providers. | | CompassRank | CompassRank is a platform to offer a comprehensive, objective, and neutral evaluation reference of foundation models for the industry and research. | | FlagEval | FlagEval is a comprehensive platform for evaluating foundation models. | | Generative AI Leaderboards | Generative AI Leaderboard ranks the top-performing generative AI models based on various metrics. | | Holistic Agent Leaderboard | HAL is a standardized, cost-aware, and third-party leaderboard for evaluating agents. | | Holistic Evaluation of Language Models | Holistic Evaluation of Language Models (HELM) is a reproducible and transparent framework for evaluating foundation models. | | LLM Stats | LLM Stats, the most comprehensive LLM leaderboard, benchmarks and compares API models using daily-updated, open-source community data on capability, price, speed, and context length. | | Openrouter Leaderboard | Openrouter Leaderboard offers a real-time comparison of language models based on normalized token usage for prompts and completions, updated frequently. | | PinchBench | PinchBench is a benchmark for evaluating and comparing AI model performance across diverse tasks. | | [Scale