Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

University of Maryland, College Park1     University of Waterloo2     Carnegie Mellon University3     Microsoft4     California Institute of Technology5
NeurIPS 2024 Track Datasets and Benchmarks

*Equal Contribution
Description of First Image

Easy2Hard-Bench (E2H-Bench): a collection of LLM datasets spanning a wide range of domains featuring standardized difficulty labels for individual problems.

Abstract

Despite the abundance of datasets available for assessing large language models (LLMs), the scarcity of continuous and reliable difficulty labels for individual data points, in most cases, curtails their capacity to benchmark model generalization performance across different levels of complexity. Addressing this limitation, we present Easy2Hard, an innovative collection of 6 benchmark datasets featuring standardized difficulty labels spanning a wide range of domains, such as mathematics and programming problems, chess puzzles, and reasoning questions, providing a much-needed tool for those in demand of a dataset with varying degrees of difficulty for LLM assessment. We estimate the difficulty of individual problems by leveraging the performance data of many human subjects and LLMs on prominent leaderboards. Harnessing the rich human performance data, we employ widely recognized difficulty ranking systems, including the Item Response Theory (IRT) and Glicko-2 models, to uniformly assign difficulty scores to problems. The Easy2Hard datasets distinguish themselves from previous collections by incorporating a significantly higher proportion of challenging problems, presenting a novel and demanding test for state-of-the-art LLMs. Through extensive experiments conducted with six state-of-the-art LLMs on the Easy2Hard datasets, we offer valuable insights into their performance and generalization capabilities across varying degrees of difficulty, setting the stage for future research in LLM generalization.

Benchmark Overview

The Easy2Hard-Bench spans six distinct domains, including mathematics, programming, chess, and various reasoning tasks. These diverse tasks encompass a broad spectrum of prevalent cognitive challenges for LLMs. The problems in each dataset of Easy2Hard-Bench are annotated with a numerical value as difficulty estimation.

We find data sources of problems with abundant publicly available human performance statistics, which serve as a robust basis for difficulty estimation. For those datasets on which human performance is inaccessible in large scale, we use the evaluation results by LLMs from Open LLM Leadearboad as surrogate.

overview0

Difficulty Estimation

The difficulty of problems is estimated using continuous values, employing advanced statistical models such as Glicko-2 and Item Response Theory (IRT). This methodology utilizes abundant real-world performance results from humans and leaderboard data from LLMs, providing a clearer insight into the difficulty structure of each dataset.

The plots of difficulty distribution in E2H datasets showcase that the problems within each domain cover a wide range of difficulties. For E2H-AMC and E2H-Codeforces, we also draw the difficulty distribution of overlappping parts with the existing benchmarks MATH and APPS, repectvely. The comparisons verify that our datasets include more challenging problems.

overview1

Benchmarking SoTA LLMs

In light of the novel and challenging problems presented in our Easy2Hard-Bench, we select 6 SoTA LLMs for evaluation. We begin by presenting the performance of LLMs on all Easy2Hard-Bench datasets, segmented into easy, medium, and hard difficulty levels. It is evident that performance notably decreases as difficulty increases, validating the effectiveness of our difficulty estimations. The newly curated datasets (E2H-AMC, E2H-Codeforces, E2H-Lichess) are much more challenging than the pre-existing ones, because they extend the difficulty range greatly compared to existing selections.

profile1

Furthermore, we plot and analyze model behavior against increasing difficulty levels for each dataset. As evaluation difficulty increases, most models show monotonic decreasing accuracies, validating the correctness of provided difficulty ratings. While performance generally declines with difficulty, the extent of this decline varies significantly among models and datasets.

profile2

Profiling Easy2Hard Generalizations

Instead of only assessing the static behavior of specific checkpoints, Easy2Hard-Bench allows for fine-grained profiling of LLMs as they generalize across various training and evaluation difficulties. This also caters to the need to simulate challenging problems like weak-to-strong generalization. To our best knowledge, Easy2Hard-Bench is the first to deliver detailed easy-to-hard generalization results across continuous, wide-range of difficulties on LLMs.

generalize

In our preliminary experimental exploration, we focus on Supervised Finetuning (SFT) with relatively smaller LLMs, while deferring more specialized finetuning frameworks for future studies. LLMs are trained on subsets of training splits of varying difficulty (y-axis) via Supervised Fine-Tuning (SFT) and were evaluated across all evaluation difficulties (x-axis). The color gradient represents the performance difference relative to models trained on randomly selected difficulties of same sizes. We observe generalization benefits when training and evaluation difficulties are similar, and training on more challenging samples poses increased generalization difficulties.

Poster

BibTeX

@inproceedings{
          ding2024easyhardbench,
          title={Easy2Hard-Bench: Standardized Difficulty Labels for Profiling {LLM} Performance and Generalization},
          author={Mucong Ding and Chenghao Deng and Jocelyn Choo and Zichu Wu and Aakriti Agrawal and Avi Schwarzschild and Tianyi Zhou and Tom Goldstein and John Langford and Anima Anandkumar and Furong Huang},
          booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
          year={2024},
          url={https://openreview.net/forum?id=iNB4uoFQJb}
        }