Finn's Take· TL;DRWhen artificial intelligence researchers realized that LLMs now achieve over 90% accuracy on popular benchmarks like MMLU , they faced a sobering reality: their tests had become too easy. The solution was radical—create the hardest academic evaluation ever designed for AI systems. Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 2,500 challenging questions across over a hundred subjects.
It was created by the Center for AI Safety (CAIS) and Scale AI, with contributions from nearly 1,000 subject-matter experts across more than 500 institutions in 50 countries. The questions span everything from translating ancient inscriptions to analyzing quantum mechanics, all crafted to probe the absolute limits of machine intelligence.
The benchmark's creators had a clear mission: to reveal, precisely and systematically, what AI cannot do, at least not yet. Unlike traditional tests that measure memorization, these questions require genuine expertise that can't be solved through quick internet searches.
In February 2026, Google's Gemini 3 Deep Think shattered expectations by scoring 48.4% on Humanity's Last Exam (without tools), demonstrating the ability to resolve high-level, multi-step logical chains that were previously considered 'too human' for AI to solve. This represents a massive leap from the early results when OpenAI's o1 system notched the top spot with a score of just 8.3%.
The achievement becomes even more striking when compared to human performance. Human experts, meanwhile, score around 90% in their respective domains. While Google's AI still trails significantly, the rapid progress is undeniable. In roughly one year, scores have climbed from single digits to the high 30s. The pace of improvement is remarkable—but the models are still failing nearly two-thirds of questions.
Beyond the headline score, Gemini 3 Deep Think demonstrated remarkable capabilities across multiple domains. The model achieved 84.6% on ARC-AGI-2 (verified by the ARC Prize Foundation), proving it can learn novel tasks and generalize logic rather than relying on memorized training data. It also achieved gold medal-level performance on international science olympiads, suggesting genuine problem-solving ability rather than mere pattern matching.
The implications extend far beyond academic benchmarks. Early testing reveals practical applications that could reshape research workflows. Lisa Carbone, a mathematician at Rutgers University, works on the mathematical structures required by the high-energy physics community to bridge the gap between Einstein's theory of gravity and quantum mechanics. In a field with very little existing training data, she used Deep Think to review a highly technical mathematics paper. Deep Think successfully identified a subtle logical flaw that had previously passed through human peer review unnoticed.
At Duke University, researchers used the system to solve complex materials science challenges, with the model proposing a new chemical synthesis recipe that achieved the desired thickness where previous approaches had failed. These real-world successes suggest the technology is moving beyond test performance toward genuine utility in research environments.
Despite the impressive scores, experts emphasize a crucial distinction. High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or "artificial general intelligence." The test measures structured academic knowledge, not the open-ended creativity and autonomous reasoning that would characterize true AGI.
On Humanity's Last Exam, models show calibration errors ranging from 34% to 89%. This means AI systems are systematically overconfident. This overconfidence reveals a fundamental limitation—these systems don't truly understand their own knowledge boundaries, a critical gap that separates current AI from genuine intelligence.
As researchers continue pushing the boundaries of what machines can accomplish, Humanity's Last Exam serves as both a milestone and a reality check. The rapid progress is undeniable, but the gulf between artificial intelligence and human understanding remains vast. The race isn't just about reaching higher scores—it's about building AI systems that can truly comprehend and reason about the world with the depth and nuance that defines human expertise.