How many raters do you actually need for AI benchmarks? Google has answers

How many raters do you actually need for AI benchmarks? Google has answers

6 0 0

If you’ve ever worked on ML evaluation, you know the pain: you need human raters to judge model outputs, but every rater costs money and time. So most teams default to 1-5 raters per item and call it a day. Google Research just published a paper that says this is probably wrong.

The paper, “Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation,” is refreshingly practical. Instead of hand-waving about reproducibility, Flip Korn and Chris Welty built a simulator and stress-tested thousands of configurations against real-world datasets for subjective tasks like toxicity and hate speech detection.

The core question: given a fixed budget, is it better to rate many items with few raters (breadth, the forest) or fewer items with many raters (depth, the tree)?

Infographic showing how human disagreement on 'Toxic' vs. 'Not Toxic' labels is often collapsed into a single plurality label.

Historically, the field has favored breadth. Most benchmarks assume 1-5 raters per item is enough to find a single “correct” truth. But human disagreement isn’t noise — it’s signal. Collapsing multiple ratings into a plurality label throws away information, and that directly hurts reproducibility.

Here’s the intuition: if two researchers run the same evaluation but get different results because their small rater pools happened to disagree, the benchmark is useless. Google’s simulator tested this systematically, varying N (total items) from 100 to 50,000 and K (raters per item) from 1 to 500, looking for configurations that hit statistical significance (p < 0.05) reliably.

The results are eye-opening. For many subjective tasks, the tree approach — more raters per item, even if it means fewer items — produces more stable and reproducible rankings. The sweet spot depends on your budget and the level of disagreement in your data, but the standard 1-3 raters is almost never optimal.

I’ve seen this problem firsthand. A few years back, I was evaluating a content moderation model and got wildly different results depending on which 3 raters I used. I assumed the model was unstable. Turns out, the raters were the unstable part.

Google also released an open-source simulator so you can run your own what-if scenarios. Plug in your estimated annotation costs, expected disagreement rates, and budget, and it tells you the optimal N and K. No more guessing.

The paper is worth reading if you care about benchmark integrity. It’s short, the math is accessible, and the simulator is genuinely useful. My only criticism: they don’t address the practical difficulty of finding and managing large rater pools. Getting 50 raters per item is great in theory, but coordinating that many annotators is a logistical nightmare for most teams.

Still, this is the kind of research that should change how we build benchmarks. Stop assuming 3 raters is enough. Run the numbers. Your future self — and the people trying to reproduce your results — will thank you.

Comments (0)

Be the first to comment!