Mechanistic Interpretability
Problem Definition
It is difficult to know if LLM benchmarks truly measure the capabilities we actually want.
Solution
We build a separate dataset based on cognitive science. Using this, we measure the importance of an LLM’s parameters. Then, by measuring the performance degradation on LLM benchmarks after removing important parameters, we determine whether the desired capabilities are reflected in the benchmarks.
Achievements
- Revealed that current benchmarks require a diverse range of cognitive abilities.
- Accepted to EMNLP 2025 Main (Oral).
My Role
- Participated as a co-author of the paper.
- Analyzed the data obtained from experiments to derive key results and insights, and reflected them in the paper.
Method overview.