Track the top-performing AI models across multiple benchmarks. Data sourced from LMSYS Chatbot Arena, Hugging Face, Artificial Analysis, SWE-Bench, and Vectara.
Crowdsourced ELO rankings from blind A/B tests where users compare model outputs without knowing which model generated them.
Benchmarks open-weight models on standardized academic tasks including reasoning, knowledge, and instruction following.
Independent quality assessment combining multiple evaluation methods to create comprehensive model quality scores.
Measures AI models on real-world software engineering tasks from GitHub issues, testing actual code generation and bug-fixing ability.
Vectara's hallucination benchmark measures how often models fabricate information when summarizing documents. Lower is better — 7,700+ articles tested.