awesome-ai-leaderboard

ECC

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

233,400

35,584

JavaScript

AI Agentsai-agentsanthropic

The agent that grows with you

220,566

41,997

Python

AI Agentsaiai-agent

everything-claude-code

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

185,940

28,768

JavaScript

AI Agentsai-agentsanthropic

cc-switch

by farion1231

A cross-platform desktop All-in-One assistant for Claude Code, Codex, OpenCode, OpenClaw, Grok Build & Hermes Agent. Only official website: ccswitch.io

121,205

8,146

Rust

AI Agentsai-toolsclaude-code

claude-code

by anthropics

Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.

120,031

19,897

Shell

AI Agents

Browse all AI Agents skills

turbo codai

@article{zhao2025workflows, title={On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards}, author={Zhao, Zhimin and Bangash, Abdul Ali and C{\^o}go, Filipe Roseiro and Adams, Bram and Hassan, Ahmed E}, journal={IEEE Transactions on Software Engineering}, year={2025}, publisher={IEEE} }

Tools

Name

Description

Demo Leaderboard Backend

Demo leaderboard backend helps users manage the leaderboard and handle submission requests, check this for details.

Evaluation Results on the Hub

Evaluation Results on the Hub enables model authors to store and display evaluation results in model cards by embedding structured metadata, making benchmark scores publicly accessible and comparable across models.

Kaggle Competition Creation

Kaggle Competition Creation enables you to design and launch custom competitions, leveraging your datasets to engage the data science community.

Challenges

Name

Description

AIcrowd

AIcrowd hosts machine learning challenges and competitions across domains such as computer vision, NLP, and reinforcement learning, aimed at both researchers and practitioners.

AI Hub

AI Hub offers a variety of competitions to encourage AI solutions to real-world problems, with a focus on innovation and collaboration.

AI Studio

AI Studio offers AI competitions mainly for computer vision, NLP, and other data-driven tasks, allowing users to develop and showcase their AI skills.

Allen Institute for AI

The Allen Institute for AI provides leaderboards and benchmarks on tasks in natural language understanding, commonsense reasoning, and other areas in AI research.

Codabench

Codabench is an open-source platform for benchmarking AI models, enabling customizable, user-driven challenges across various AI domains.

DataFountain

DataFountain is a Chinese AI competition platform featuring challenges in finance, healthcare, and smart cities, encouraging solutions for industry-related problems.

DrivenData

DrivenData hosts machine learning challenges with a social impact, aiming to solve issues in areas, such as public health, disaster relief, and sustainable development.

Dynabench

Dynabench offers dynamic benchmarks where models are evaluated continuously, often involving human interaction, to ensure robustness in evolving AI tasks.

Eval AI

EvalAI is a platform for hosting and participating in AI challenges, widely used by researchers for benchmarking models in tasks, such as image classification, NLP, and reinforcement learning.

Grand Challenge

Grand Challenge provides a platform for medical imaging challenges, supporting advancements in medical AI, particularly in areas, such as radiology and pathology.

Hilti

Hilti hosts challenges aimed at advancing AI and machine learning in the construction industry, with a focus on practical, industry-relevant applications.

InsightFace

InsightFace focuses on AI challenges related to face recognition, verification, and analysis, supporting advancements in identity verification and security.

Kaggle

Kaggle is one of the largest platforms for data science and machine learning competitions, covering a broad range of topics from image classification to NLP and predictive modeling.

nuScenes

nuScenes enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car, facilitating research in autonomous driving.

Robust Reading Competition

Robust Reading refers to the research area on interpreting written communication in unconstrained settings, with competitions focused on text recognition in real-world environments.

Tianchi

Tianchi, hosted by Alibaba, offers a range of AI competitions, particularly popular in Asia, with a focus on commerce, healthcare, and logistics.

Model Ranking

Comprehensive

Name

Description

AI Benchmarking Hub

AI Benchmarking Hub tracks and compares AI model performance in reasoning, coding, and knowledge tasks.

Arena

Arena operates a chatbot arena where various foundation models compete based on user preferences across multiple categories: text generation, web development, computer vision, text-to-image synthesis, search capabilities, and coding assistance.

BenchGecko

BenchGecko is a comprehensive leaderboard that tracks thousands of models across 128 benchmarks, featuring cross-provider pricing comparisons, AI economy insights, an agent leaderboard, and an MCP server directory.

CompassRank

CompassRank is a platform to offer a comprehensive, objective, and neutral evaluation reference of foundation models for the industry and research.

EvoClaw

EvoClaw is a leaderboard for evaluating and ranking AI agents across benchmark tasks.

FlagEval

FlagEval is a comprehensive platform for evaluating foundation models.

Generative AI Leaderboards

Generative AI Leaderboard ranks the top-performing generative AI models based on various metrics.

Holistic Agent Leaderboard

HAL is a standardized, cost-aware, and third-party leaderboard for evaluating agents.

Holistic Evaluation of Language Models

Holistic Evaluation of Language Models (HELM) is a reproducible and transparent framework for evaluating foundation models.

Humanlaya

Humanlaya is a comprehensive leaderboard for evaluating and comparing AI models across benchmarks.

InferenceBench.ai

InferenceBench.ai is a benchmark for evaluating autonomous AI agents to optimize LLM inference wor

Frequently Asked Questions

What is awesome-ai-leaderboard?

Is awesome-ai-leaderboard safe to use?

How do I install awesome-ai-leaderboard?

Clone the repository with "git clone https://github.com/SAILResearch/awesome-ai-leaderboard" and add it to your Claude Code skills directory (see the Installation section above).

Are there alternatives to awesome-ai-leaderboard?

Related Skills

superpowers

by obra

An agentic skills framework & software development methodology that works.

234,966

20,863

Shell

AI Agentsaibrainstorming

ECC

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

233,400

35,584

JavaScript

AI Agentsai-agentsanthropic

The agent that grows with you

220,566

41,997

Python

AI Agentsaiai-agent

everything-claude-code

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

185,940

28,768

JavaScript

AI Agentsai-agentsanthropic

cc-switch

by farion1231

A cross-platform desktop All-in-One assistant for Claude Code, Codex, OpenCode, OpenClaw, Grok Build & Hermes Agent. Only official website: ccswitch.io

121,205

8,146

Rust

AI Agentsai-toolsclaude-code

claude-code

by anthropics

120,031

19,897

Shell

AI Agents