🌍 Multilingual MMLU Benchmark Leaderboard: This leaderboard is dedicated to evaluating and comparing the multilingual capabilities of large language models across different languages and cultures.
🔬 MMMLU Dataset: The dataset used for evaluation is OpenAI MMMLU Benchmark, which covers a broad range of topics from 57 different categories, covering elementary-level knowledge up to advanced professional subjects like law, physics, history, and computer science. MMMLU contains 14 languages: AR_XY (Arabic), BN_BD (Bengali), DE_DE (German), ES_LA (Spanish), FR_FR (French), HI_IN (Hindi), ID_ID (Indonesian), IT_IT (Italian), JA_JP (Japanese), KO_KR (Korean), PT_BR (Brazilian Portuguese), SW_KE (Swahili), YO_NG (Yoruba), ZH_CN (Simplified Chinese).
🎯 Our Goal is to raise awareness about the importance of improving the performance of LLMs across various languages, with a particular focus on cultural contexts. We strive to make LLM more inclusive and effective for users worldwide.
- "headers": [
- "T",
- "Model",
- "Average ⬆️",
- "AR",
- "BN",
- "DE",
- "ES",
- "FR",
- "HI",
- "ID",
- "IT",
- "JA",
- "KO",
- "PT",
- "SW",
- "YO",
- "ZH",
- "Type",
- "Architecture",
- "Precision",
- "Hub License",
- "#Params (B)",
- "Hub ❤️",
- "Available on the hub",
- "Model sha"
- "data": [
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/Anthropic/Claude-3.5-Sonnet" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">Anthropic/Claude-3.5-Sonnet</a>",
- 77.39,
- 78.48,
- 74.63,
- 81.74,
- 82.77,
- 82.37,
- 75.96,
- 80.49,
- 81.66,
- 79.43,
- 78.95,
- 82.73,
- 71.36,
- 54.46,
- 78.41,
- "instruction-tuned",
- "?",
- "bfloat16",
- "Claude-3.5-Sonnet",
- 0,
- 10000,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/AIDC/Macro-72B-Chat" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">AIDC/Macro-72B-Chat</a>",
- 76.06,
- 79.33,
- 76.56,
- 80.67,
- 82.56,
- 80.67,
- 76.86,
- 79.2,
- 81.58,
- 79.16,
- 78.77,
- 81.74,
- 63.67,
- 43.96,
- 80.07,
- "instruction-tuned",
- "?",
- "bfloat16",
- "AIDC",
- 72.7,
- 0,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">meta-llama/Llama-3.1-70B-Instruct</a>",
- 71.67,
- 71.08,
- 66.51,
- 77,
- 79.27,
- 77.92,
- 72.67,
- 75.69,
- 77.83,
- 73.79,
- 72.74,
- 78.89,
- 63.99,
- 41.16,
- 74.79,
- "instruction-tuned",
- "?",
- "bfloat16",
- "llama3.1",
- 70.6,
- 673,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/openai/GPT4-0125" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">openai/GPT4-0125</a>",
- 70.78,
- 71.12,
- 64.81,
- 75.72,
- 76.79,
- 75.82,
- 70.13,
- 73.68,
- 75.84,
- 71.64,
- 71.32,
- 76.17,
- 68.08,
- 47.26,
- 72.5,
- "instruction-tuned",
- "?",
- "bfloat16",
- "openai",
- 0,
- 10000,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/Qwen/Qwen2-72B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">Qwen/Qwen2-72B-Instruct</a>",
- 69.15,
- 72.02,
- 68.26,
- 74.36,
- 77.01,
- 75.63,
- 69.87,
- 73.12,
- 75.26,
- 74.05,
- 72.35,
- 76.83,
- 47.31,
- 34.64,
- 77.45,
- "instruction-tuned",
- "Qwen2ForCausalLM",
- "bfloat16",
- "tongyi-qianwen",
- 72.7,
- 675,
- true,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/Qwen/Qwen2.5-72B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">Qwen/Qwen2.5-72B-Instruct</a>",
- 69.05,
- 74.31,
- 67.15,
- 72.46,
- 77.52,
- 75.98,
- 69.05,
- 73.3,
- 72.54,
- 74.65,
- 71.78,
- 76.85,
- 48.84,
- 35.51,
- 76.71,
- "instruction-tuned",
- "Qwen2ForCausalLM",
- "bfloat16",
- "tongyi-qianwen",
- 72.7,
- 452,
- true,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">meta-llama/Meta-Llama-3-70B-Instruct</a>",
- 64.3,
- 60.63,
- 53.77,
- 71.42,
- 74.3,
- 73.17,
- 65.02,
- 70.59,
- 73.33,
- 65.55,
- 64.51,
- 73.74,
- 51.06,
- 33.62,
- 69.5,
- "instruction-tuned",
- "?",
- "bfloat16",
- "llama3",
- 70.6,
- 1430,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/openai/GPT4o-mini" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">openai/GPT4o-mini</a>",
- 62.63,
- 62.65,
- 59.95,
- 67.96,
- 68.22,
- 67.49,
- 62.22,
- 66.14,
- 68.27,
- 64.41,
- 63.55,
- 68.84,
- 53.11,
- 38.04,
- 65.92,
- "instruction-tuned",
- "?",
- "bfloat16",
- "openai",
- 0,
- 10000,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/AIDC/Macro-7B-Chat" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">AIDC/Macro-7B-Chat</a>",
- 60.05,
- 60.57,
- 54.36,
- 65.92,
- 67.74,
- 67.58,
- 54.34,
- 62.35,
- 65.42,
- 64.19,
- 62.95,
- 67.61,
- 43.93,
- 37.18,
- 66.54,
- "instruction-tuned",
- "?",
- "bfloat16",
- "AIDC",
- 7.62,
- 0,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/CohereForAI/aya-expanse-32b" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">CohereForAI/aya-expanse-32b</a>",
- 58.92,
- 61.57,
- 43.9,
- 64.71,
- 67.53,
- 67.46,
- 58.75,
- 65.43,
- 66.46,
- 64.34,
- 62.43,
- 67.19,
- 38.36,
- 33.39,
- 63.41,
- "instruction-tuned",
- "?",
- "float16",
- "cc-by-nc-4.0",
- 32.3,
- 165,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">Qwen/Qwen2.5-7B-Instruct</a>",
- 56.08,
- 56.4,
- 45.32,
- 62.06,
- 65.62,
- 64.88,
- 47.39,
- 61.66,
- 65.09,
- 60.75,
- 59.31,
- 64.43,
- 35.38,
- 32.32,
- 64.44,
- "instruction-tuned",
- "Qwen2ForCausalLM",
- "bfloat16",
- "tongyi-qianwen",
- 7.62,
- 255,
- true,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/Qwen/Qwen2-7B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">Qwen/Qwen2-7B-Instruct</a>",
- 51.95,
- 50.71,
- 43.36,
- 57.14,
- 60.16,
- 60.83,
- 45.12,
- 54.12,
- 58.99,
- 56.55,
- 53.98,
- 60.11,
- 34.35,
- 30.17,
- 61.78,
- "instruction-tuned",
- "Qwen2ForCausalLM",
- "bfloat16",
- "tongyi-qianwen",
- 7.62,
- 583,
- true,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/CohereForAI/aya-23-35B" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">CohereForAI/aya-23-35B</a>",
- 50.13,
- 51.84,
- 32.94,
- 55.45,
- 57.99,
- 58.08,
- 47.58,
- 55.5,
- 57.81,
- 54.53,
- 53.7,
- 58.33,
- 33.58,
- 30.4,
- 54.06,
- "instruction-tuned",
- "?",
- "float16",
- "cc-by-nc-4.0",
- 35,
- 264,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">meta-llama/Llama-3.1-8B-Instruct</a>",
- 50.01,
- 42.19,
- 38.8,
- 55.63,
- 59.14,
- 58.94,
- 45.84,
- 54.27,
- 56.27,
- 52.07,
- 50.78,
- 59.02,
- 40.28,
- 31.36,
- 55.52,
- "instruction-tuned",
- "?",
- "bfloat16",
- "llama3.1",
- 8.03,
- 3020,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/CohereForAI/aya-expanse-8b" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">CohereForAI/aya-expanse-8b</a>",
- 48.2,
- 48.75,
- 33.36,
- 53.91,
- 56.07,
- 55.5,
- 46.2,
- 53.34,
- 55.28,
- 51.49,
- 50.67,
- 55.83,
- 31.96,
- 29.88,
- 52.52,
- "instruction-tuned",
- "?",
- "float16",
- "cc-by-nc-4.0",
- 8.03,
- 271,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">meta-llama/Meta-Llama-3-8B-Instruct</a>",
- 46.57,
- 40.54,
- 36.43,
- 53.52,
- 55.8,
- 55.79,
- 41.43,
- 51,
- 53.33,
- 42.31,
- 46.54,
- 55.46,
- 37.5,
- 30.96,
- 51.42,
- "instruction-tuned",
- "?",
- "bfloat16",
- "llama3",
- 8.03,
- 3600,
- false,
- "main"
- [
- "⭕",
- "<a target="_blank" href="https://huggingface.co/CohereForAI/aya-23-8B" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">CohereForAI/aya-23-8B</a>",
- 40.96,
- 42.07,
- 27.43,
- 43.26,
- 47.86,
- 46.87,
- 38.87,
- 46.7,
- 47.07,
- 44.57,
- 43.64,
- 46.86,
- 26.17,
- 26.44,
- 45.7,
- "instruction-tuned",
- "?",
- "float16",
- "cc-by-nc-4.0",
- 8.03,
- 391,
- false,
- "main"
- [
- "metadata": null
💡 About "Multilingual Benchmark MMLU Leaderboard"
Overview
The Multilingual Massive Multitask Language Understanding (MMMLU) benchmark is a comprehensive evaluation platform designed to assess the general knowledge capabilities of AI models across a wide range of domains. It includes a series of Question Answering (QA) tasks across 57 distinct domains, ranging from elementary-level knowledge to advanced professional subjects such as law, physics, history, and computer science.
Translation Effort
For this evaluation, we used the OpenAI MMMLU dataset, which has been extensively curated and tested for a multilingual understanding of AI models. The dataset includes 14 different languages and is specifically designed to assess how well AI models can handle a wide range of general knowledge tasks across 57 domains.
While the translation of the test set was performed by OpenAI, it ensures a high level of accuracy and reliability for evaluating multilingual models. By leveraging this pre-existing, professionally curated dataset, we aim to focus on model performance across multiple languages, without the need for additional translations from our side.
Commitment to Multilingual AI
By focusing on human-powered translations and publishing both the translated test sets and evaluation code, we aim to promote the development of AI models that can handle multilingual tasks with greater accuracy. This reflects our commitment to improving AI’s performance in underrepresented languages and making technology more inclusive and effective globally.
Locales Covered
The MMMLU benchmark includes a test set translated into the following locales:
- AR_XY: Arabic
- BN_BD: Bengali
- DE_DE: German
- ES_LA: Spanish
- FR_FR: French
- HI_IN: Hindi
- ID_ID: Indonesian
- IT_IT: Italian
- JA_JP: Japanese
- KO_KR: Korean
- PT_BR: Brazilian Portuguese
- SW_KE: Swahili
- YO_NG: Yoruba
- ZH_CN: Simplified Chinese
Purpose
The MMMLU Leaderboard aims to provide a unified benchmark for comparing AI model performance across these multiple languages and diverse domains. With the inclusion of the QA task across 57 domains, it evaluates how well models perform in answering general knowledge questions in multiple languages, ensuring a high standard of multilingual understanding and reasoning.
Goals
Our primary goal is to provide a reliable comparison for AI models across different languages and domains, helping developers and researchers evaluate and improve their models’ multilingual capabilities. By emphasizing high-quality translations and including a broad range of topics, we strive to make AI models more robust and useful across diverse communities worldwide.
🤗 How it works
Submit a model for automated evaluation on our clusters on the "Submit here" tab!
📈 Tasks
We evaluate models on a variety of key benchmarks, with a focus on Multilingual Massive Multitask Language Understanding (MMLU) and its variants, including MMLU, C-MMLU, ArabicMMLU, KoreanMMLU, MalayMMLU, and others. These benchmarks assess general knowledge across a wide range of topics from 57 categories, such as law, physics, history, and computer science.
The evaluation is performed using the OpenCompass framework, a unified platform for evaluating language models across multiple tasks. OpenCompass allows us to execute these evaluations efficiently and at scale, covering multiple languages and benchmarks.
For detailed information on the tasks, please refer to the "Tasks" tab in the OpenCompass framework.
Notes:
- The evaluations are all 5-shot.
- Results are aggregated by calculating the average of all the tasks for a given language.
🔎 Results
You can find:
- Detailed numerical results in the results dataset
- Community queries and running status in the requests dataset
✅ Reproducibility
To reproduce the results, you can use opencompass, Since many open-source models cannot fully adhere to instructions for QA tasks, we perform post-processing on the results by using Qwen2.5-7B-Instruct to extract the answers from the model's output. This is a relatively simple task, so we can generally extract the model's true output, which corresponds to options A, B, C, and D. As not all of our PRs are currently integrated into the main repository.
git clone git@github.com:BobTsang1995/opencompass.git
cd opencompass
pip install -e .
pip install lmdeploy
python run.py --models lmdeploy_qwen2_7b_instruct --datasets mmmlu_gen_5_shot -a lmdeploy
🙌 Acknowledgements
This leaderboard was independently developed as a non-profit initiative with the support of several academic institutions, which provided valuable assistance to make this effort possible. We extend our heartfelt gratitude to these institutions for their support.
- Technische Universität München (TUM)
- Tsinghua University
- Universiteit van Amsterdam
- Mohamed Bin Zayed University of Artificial Intelligence
- University of Macau
- Cardiff University
- Nara Institute of Science and Technology
- Shanghai Jiao Tong University
- Dublin City University
- Université Grenoble Alpes
- Universidade de Coimbra
- The Ohio State University
- RMIT University
The entities above are ordered chronologically by the date they joined the project. However, the logos in the footer are ordered by the number of datasets donated.
Thank you in particular to: Yi Zhou (Cardiff University), Yusuke Sakai (Nara Institute of Science and Technology), Yongxin Zhou (Université Grenoble Alpes), Haonan Li (MBZUAI), Jiahui Geng (MBZUAI), Qing Li (MBZUAI), Wenxi Li (Tsinghua University/Shanghai Jiaotong University), Yuanyu Lin (University of Macau), Andy Way (Dublin City University), Zhuang Li (RMIT University), Zhongwei Wan (The Ohio State University), Di Wu (University of Amsterdam), Wen Lai (Technical University of Munich) (TUM)
For information about the dataset authors please check the corresponding Dataset Cards (linked in the "Tasks" tab) and papers (included in the "Citation" section below). We would like to specially thank the teams that created or open-sourced their datasets specifically for the leaderboard (in chronological order):
- MMMLU: OpenAI
We also thank MacroPolo Team, Alibaba International Digital Commerce for sponsoring the inference GPUs.
🚀 Collaborate!
We would like to create a leaderboard as diverse as possible, reach out if you would like us to include your evaluation dataset!
Comments and suggestions are more than welcome! Visit the 👏 Multilingual-MMLU-Benchmark-Leaderboard discussions page, tell us what you think about MMMLU Leaderboard and how we can improve it, or go ahead and open a PR!
Thank you very much! 💛
Some good practices before submitting a model
1) Make sure you can load your model and tokenizer using AutoClasses:
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
Note: make sure your model is public!
Note: if your model needs use_remote_code=True, we do not support this option yet but we are working on adding it, stay posted!
2) Convert your model weights to safetensors
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the Extended Viewer!
3) Make sure your model has an open license!
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗
4) Fill up your model card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
In case of model failure
If your model is displayed in the FAILED category, its execution stopped.
Make sure you have followed the above steps first.
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add --limit to limit the number of examples per task).
model | revision | private | precision | weight_type | status |
|---|---|---|---|---|---|
model | revision | private | precision | weight_type | status |
|---|---|---|---|---|---|
model | revision | private | precision | weight_type | status |
|---|---|---|---|---|---|