Problem Definition

It is difficult to know if LLM benchmarks truly measure the capabilities we actually want.

Solution

We build a separate dataset based on cognitive science. Using this, we measure the importance of an LLM’s parameters. Then, by measuring the performance degradation on LLM benchmarks after removing important parameters, we determine whether the desired capabilities are reflected in the benchmarks.

Achievements

Revealed that current benchmarks require a diverse range of cognitive abilities.
Accepted to EMNLP 2025 Main (Oral).

My Role

Participated as a co-author of the paper.
Analyzed the data obtained from experiments to derive key results and insights, and reflected them in the paper.

Method overview.