If you’ve spent any time looking at Arabic LLM evaluations, you’ve probably noticed something feels off. Benchmarks keep popping up, models keep scoring, but the numbers don’t always match real-world performance. I’ve been suspicious of this for a while, and it turns out the problem isn’t the models—it’s the benchmarks themselves.
A team from TII built QIMMA (قمّة, Arabic for “summit”) to tackle this head-on. Instead of just running models on existing benchmarks and calling it a day, they applied a quality validation pipeline first. What they found should make anyone working with Arabic NLP uncomfortable.
The Benchmark Mess Nobody Talks About
Arabic has over 400 million speakers, but its NLP evaluation ecosystem is a mess. The problems are predictable but persistent:
- Translation artifacts. Lots of Arabic benchmarks are just English ones run through a translator. Questions that make sense in English become awkward or culturally nonsensical in Arabic.
- No quality checks. Even native Arabic benchmarks get released without proper validation. Wrong answers, encoding bugs, cultural bias in labels—it’s all there.
- You can’t reproduce anything. Evaluation scripts and per-sample outputs rarely get published, so you can’t verify results or build on them.
- Fragmented coverage. One benchmark tests grammar, another tests trivia, but nobody looks at the whole picture.
QIMMA tries to be the exception. It’s open source, 99% native Arabic content (the only exception is code evaluation, which is language-agnostic), includes systematic quality validation, code evaluation, and public per-sample outputs. That’s a combination nobody else in the Arabic space has pulled off.
What’s Actually in There
QIMMA consolidates 109 subsets from 14 source benchmarks into a unified suite of over 52,000 samples across 7 domains:
- Cultural — AraDiCE-Culture, ArabCulture, PalmX
- STEM — ArabicMMLU, GAT, 3LM STEM
- Legal — ArabLegalQA, MizanQA
- Medical — MedArabiQ, MedAraBench
- Safety — AraTrust
- Poetry & Literature — FannOrFlop
- Coding — 3LM HumanEval+, 3LM MBPP+
This is the first Arabic leaderboard with code evaluation, which matters more than most people realize. Arabic-language problem statements for coding tasks are rare, and they test something fundamentally different from English prompts.
The Validation Pipeline That Changes Everything
This is where QIMMA earns its keep. Before any model evaluation happens, every single sample goes through a multi-stage filter.
Stage 1: Two models, one rubric. Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B each score every sample against a 10-point rubric with binary criteria. If both models score it below 7/10, it’s out. If only one flags it, it moves to human review.
Stage 2: Native speakers make the call. Flagged samples go to human annotators who are native Arabic speakers with cultural and dialectal familiarity. They handle the stuff automated systems can’t: cultural nuance, regional variation, subjective interpretation.
This pipeline is smarter than the usual approach of just trusting the benchmark creators. It acknowledges that even “established” resources have problems.
What They Found Is Embarrassing
The pipeline revealed systematic quality issues across benchmarks. Not isolated typos, but structural problems that would quietly corrupt evaluation results. The team hasn’t published the full breakdown yet, but the implications are clear: many existing Arabic LLM scores are inflated or misleading because the benchmarks themselves are broken.
This is higher than I expected, honestly. I knew translation artifacts were a problem, but systematic issues in native Arabic benchmarks suggest the field needs a reset.
The Rankings Once You Clean Things Up
Once the garbage samples were filtered out, the model rankings shifted. The team published the leaderboard on Hugging Face, and the results are worth studying. Models that performed well on unfiltered benchmarks sometimes dropped significantly after quality validation, while others held steady.
I won’t rehash the full leaderboard here—go check it yourself—but the takeaway is that benchmark quality matters as much as model quality. You can’t trust evaluation results if you don’t trust the evaluation data.
Why This Matters Beyond Arabic
QIMMA’s approach is applicable to any low-resource language, and honestly, to English too. The assumption that benchmarks are inherently trustworthy is lazy. Every dataset has biases, errors, and artifacts. The question is whether you bother to find them before publishing results.
The team open-sourced everything: the pipeline, the filtered datasets, the evaluation scripts. That’s how it should be. No more black-box leaderboards with unverifiable numbers.
My Take
This is the kind of work that makes the field better for everyone. It’s not flashy, it doesn’t have a buzzy name, but it fixes a fundamental problem. I wish more leaderboards would adopt similar validation pipelines before publishing results.
The only downside is that QIMMA currently covers 52,000 samples, which is respectable but not exhaustive. Arabic has diverse dialects and domains that aren’t represented yet. But it’s a start, and it’s the right kind of start.
If you’re building or evaluating Arabic LLMs, stop trusting the old benchmarks. Start with QIMMA.
Comments (0)
Login Log in to comment.
Be the first to comment!