The nonprofit Center for AI Safety (CAIS) and Scale AI, a company that provides various data labeling and AI development services, have introduced a challenging new benchmark for cutting-edge AI systems.
Known as Humanity’s Last Exam, this benchmark comprises numerous crowdsourced questions covering topics such as mathematics, humanities, and natural sciences. The questions are presented in various formats, including ones that feature diagrams and images to increase the difficulty of evaluation.
In a preliminary study, none of the prominent publicly available AI systems were able to achieve a score exceeding 10% on Humanity’s Last Exam.
CAIS and Scale AI intend to open up the benchmark to the research community to allow researchers to delve deeper into the nuances and assess new AI models.