Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Mucong Ding^1*, Chenghao Deng^1*, Jocelyn Choo¹, Zichu Wu², Aakriti Agarawal¹, Avi Schwarzschild³, Tianyi Zhou¹, Tom Goldstein¹, John Langford⁴, Anima Anandkumar⁵, Furong Huang¹

University of Maryland, College Park¹ University of Waterloo² Carnegie Mellon University³ Microsoft⁴ California Institute of Technology⁵
NeurIPS 2024 Track Datasets and Benchmarks
^*Equal Contribution

Paper Code X (twitter) arXiv Hugging Face

Easy2Hard-Bench (E2H-Bench): a collection of LLM datasets spanning a wide range of domains featuring standardized difficulty labels for individual problems.

Benchmark Overview

The Easy2Hard-Bench spans six distinct domains, including mathematics, programming, chess, and various reasoning tasks. These diverse tasks encompass a broad spectrum of prevalent cognitive challenges for LLMs. The problems in each dataset of Easy2Hard-Bench are annotated with a numerical value as difficulty estimation.

We find data sources of problems with abundant publicly available human performance statistics, which serve as a robust basis for difficulty estimation. For those datasets on which human performance is inaccessible in large scale, we use the evaluation results by LLMs from Open LLM Leadearboad as surrogate.

Difficulty Estimation

The difficulty of problems is estimated using continuous values, employing advanced statistical models such as Glicko-2 and Item Response Theory (IRT). This methodology utilizes abundant real-world performance results from humans and leaderboard data from LLMs, providing a clearer insight into the difficulty structure of each dataset.

The plots of difficulty distribution in E2H datasets showcase that the problems within each domain cover a wide range of difficulties. For E2H-AMC and E2H-Codeforces, we also draw the difficulty distribution of overlappping parts with the existing benchmarks MATH and APPS, repectvely. The comparisons verify that our datasets include more challenging problems.

Benchmarking SoTA LLMs

In light of the novel and challenging problems presented in our Easy2Hard-Bench, we select 6 SoTA LLMs for evaluation. We begin by presenting the performance of LLMs on all Easy2Hard-Bench datasets, segmented into easy, medium, and hard difficulty levels. It is evident that performance notably decreases as difficulty increases, validating the effectiveness of our difficulty estimations. The newly curated datasets (E2H-AMC, E2H-Codeforces, E2H-Lichess) are much more challenging than the pre-existing ones, because they extend the difficulty range greatly compared to existing selections.

Furthermore, we plot and analyze model behavior against increasing difficulty levels for each dataset. As evaluation difficulty increases, most models show monotonic decreasing accuracies, validating the correctness of provided difficulty ratings. While performance generally declines with difficulty, the extent of this decline varies significantly among models and datasets.

Profiling Easy2Hard Generalizations

Instead of only assessing the static behavior of specific checkpoints, Easy2Hard-Bench allows for fine-grained profiling of LLMs as they generalize across various training and evaluation difficulties. This also caters to the need to simulate challenging problems like weak-to-strong generalization. To our best knowledge, Easy2Hard-Bench is the first to deliver detailed easy-to-hard generalization results across continuous, wide-range of difficulties on LLMs.

In our preliminary experimental exploration, we focus on Supervised Finetuning (SFT) with relatively smaller LLMs, while deferring more specialized finetuning frameworks for future studies. LLMs are trained on subsets of training splits of varying difficulty (y-axis) via Supervised Fine-Tuning (SFT) and were evaluated across all evaluation difficulties (x-axis). The color gradient represents the performance difference relative to models trained on randomly selected difficulties of same sizes. We observe generalization benefits when training and evaluation difficulties are similar, and training on more challenging samples poses increased generalization difficulties.

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Easy2Hard-Bench (E2H-Bench): a collection of LLM datasets spanning a wide range of domains featuring standardized difficulty labels for individual problems.

Abstract

Benchmark Overview

Difficulty Estimation

Benchmarking SoTA LLMs

Profiling Easy2Hard Generalizations

Poster

BibTeX