The Scorecard Broke: Why LLM Benchmarks Stopped Meaning What You Think

Frontier models have saturated MMLU, hit 99% on GSM8K, and scored 93% on HumanEval. The benchmarks aren't measuring what they used to. Here's what actually matters now.