awesome-foundation-model-leaderboards

Name: awesome-foundation-model-leaderboards
Author: SAILResearch

by SAILResearch

Pending

A curated list of awesome leaderboard-oriented resources for AI domain

355stars

50forks

Added 1/7/2026

View on GitHub Download ZIP

AI Agentsai-agentartificial-intelligenceawesome-listbenchmarkdeep-learning

Installation

# Add to your Claude Code skills
git clone https://github.com/SAILResearch/awesome-foundation-model-leaderboards

Getting Started

Guides for using ai agents skills like awesome-foundation-model-leaderboards.

Caveman: Cut Claude Token Use by 65%
How agent-side prompt compression works, when to use it, and when not to.
What is an AI Skills Marketplace?
Definitions, how marketplaces work, and how to choose between them in 2026.
Getting Started with AI Skills

README.md

Agentic AI for Beginners

Build your first AI agent from scratch - tool use, ReAct pattern, memory, deployment

41 minBeginner

Comments (0)

to leave a comment.

No comments yet. Be the first to share your thoughts!

Related Skills

ECC

by affaan-m

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

194,955

hookdeck-cli coro-code

claude-code

by anthropics

Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.

120,031

19,897

Shell

AI Agents

View details

Compare

Awesome AI Leaderboard is a curated list of awesome AI leaderboards, along with various development tools and evaluation organizations according to our recent survey:

If you find this repository useful, please consider giving us a star :star: and citation:

@article{zhao2025workflows,
  title={On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards},
  author={Zhao, Zhimin and Bangash, Abdul Ali and C{\^o}go, Filipe Roseiro and Adams, Bram and Hassan, Ahmed E},
  journal={IEEE Transactions on Software Engineering},
  year={2025},
  publisher={IEEE}
}

Additionally, we provide a search toolkit that helps you quickly navigate through the leaderboards.

If you want to contribute to this list (please do), welcome to propose a pull request.

If you have any suggestions, critiques, or questions regarding this list, welcome to raise issue.

Also, a leaderboard should be included if only:

It is actively maintained.
It is related to AI.

Tools
Challenges
Rankings
- Model Ranking
  - Comprehensive
  - Text
  - Code
  - Image
  - Video
  - Math
  - Agent
  - Research
  - Business
  - Safety
  - Medical
  - Audio
  - Embodied
  - 3D
  - Game
  - Multimodal
  - Time Series
- Database Ranking
- Dataset Ranking
- Metric Ranking
- Infrastructure Ranking
- Paper Ranking
- Usage Ranking
- Company Ranking

Tools

| Name | Description | | ---- | ----------- | | Demo Leaderboard Backend | Demo leaderboard backend helps users manage the leaderboard and handle submission requests, check this for details. | | Evaluation Results on the Hub | Evaluation Results on the Hub enables model authors to store and display evaluation results in model cards by embedding structured metadata, making benchmark scores publicly accessible and comparable across models. | | Kaggle Competition Creation | Kaggle Competition Creation enables you to design and launch custom competitions, leveraging your datasets to engage the data science community. |

Challenges

| Name | Description | | ---- | ----------- | | AIcrowd | AIcrowd hosts machine learning challenges and competitions across domains such as computer vision, NLP, and reinforcement learning, aimed at both researchers and practitioners. | | AI Hub | AI Hub offers a variety of competitions to encourage AI solutions to real-world problems, with a focus on innovation and collaboration. | | AI Studio | AI Studio offers AI competitions mainly for computer vision, NLP, and other data-driven tasks, allowing users to develop and showcase their AI skills. | | Allen Institute for AI | The Allen Institute for AI provides leaderboards and benchmarks on tasks in natural language understanding, commonsense reasoning, and other areas in AI research. | | Codabench | Codabench is an open-source platform for benchmarking AI models, enabling customizable, user-driven challenges across various AI domains. | | DataFountain | DataFountain is a Chinese AI competition platform featuring challenges in finance, healthcare, and smart cities, encouraging solutions for industry-related problems. | | DrivenData | DrivenData hosts machine learning challenges with a social impact, aiming to solve issues in areas, such as public health, disaster relief, and sustainable development. | | Dynabench | Dynabench offers dynamic benchmarks where models are evaluated continuously, often involving human interaction, to ensure robustness in evolving AI tasks. | | Eval AI | EvalAI is a platform for hosting and participating in AI challenges, widely used by researchers for benchmarking models in tasks, such as image classification, NLP, and reinforcement learning. | | Grand Challenge | Grand Challenge provides a platform for medical imaging challenges, supporting advancements in medical AI, particularly in areas, such as radiology and pathology. | | Hilti | Hilti hosts challenges aimed at advancing AI and machine learning in the construction industry, with a focus on practical, industry-relevant applications. | | InsightFace | InsightFace focuses on AI challenges related to face recognition, verification, and analysis, supporting advancements in identity verification and security. | | Kaggle | Kaggle is one of the largest platforms for data science and machine learning competitions, covering a broad range of topics from image classification to NLP and predictive modeling. | | nuScenes | nuScenes enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car, facilitating research in autonomous driving. | | Robust Reading Competition | Robust Reading refers to the research area on interpreting written communication in unconstrained settings, with competitions focused on text recognition in real-world environments. | | Tianchi | Tianchi, hosted by Alibaba, offers a range of AI competitions, particularly popular in Asia, with a focus on commerce, healthcare, and logistics. |

Rankings

Model Ranking

Comprehensive

| Name | Description | | ---- | ----------- | | AI Benchmarking Hub | AI Benchmarking Hub tracks and compares AI model performance in reasoning, coding, and knowledge tasks. | | Arena | Arena operates a chatbot arena where various foundation models compete based on user preferences across multiple categories: text generation, web development, computer vision, text-to-image synthesis, search capabilities, and coding assistance. | | Artificial Analysis | Artificial Analysis is a platform to help users make informed decisions on AI model selection and hosting providers. | | BenchGecko | BenchGecko is a comprehensive leaderboard that tracks thousands of models across 128 benchmarks, featuring cross-provider pricing comparisons, AI economy insights, an agent leaderboard, and an MCP server directory. | | CompassRank | CompassRank is a platform to offer a comprehensive, objective, and neutral evaluation reference of foundation models for the industry and research. | | EvoClaw | EvoClaw is a leaderboard for evaluating and ranking AI agents across benchmark tasks. | | FlagEval | FlagEval is a comprehensive platform for evaluating foundation models. | | Generative AI Leaderboards | Generative AI Leaderboard ranks the top-performing generative AI models based on various metrics. | | Holistic Agent Leaderboard | HAL is a standardized, cost-aware, and third-party leaderboard for evaluating agents. | | Holistic Evaluation of Language Models | Holistic Evaluation of Language Models (HELM) is a reproducible and transparent framework for evaluating foundation models. | | LLM Stats | LLM Stats, the most comprehensive LLM leaderboard, benchmarks and compares API mode