Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild
Abstract
Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.
Community
๐ฌ We just dropped a new paper: Towards Evaluation Engineering
If you've ever fought with a broken eval harness, you're not alone โ and now there's data to back that up.
Building on our earlier work on Leaderboard Operations (LBOps), which studied how FM leaderboards operate in the wild, we zoom in on the infrastructure underneath: evaluation harnesses themselves.
In this new paper (arXiv:2605.24213), we empirically studied 57 evaluation harnesses โ from LM Eval and HELM to SWE-bench and Ragas โ analyzing 19,638 GitHub issues to understand where developers actually struggle.
Some key findings:
- We derive a 5-stage harness workflow: Provisioning โ Specification โ Execution โ Assessment โ Reporting
- 41.4% of all issues concentrate in the Specification stage, where harnesses integrate models, datasets, and judges
- The top 3 root causes โ unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%) โ account for 61.7% of all challenges
- Only 8.8% of harnesses support regression alerting, and 22.8% support uncertainty quantification
We argue these findings establish the case for Evaluation Engineering (EvalEng) as a distinct SE concern: the operational counterpart to benchmark design, analogous to how DevOps sits alongside software development.
Would love to hear from harness maintainers and users: what's your biggest pain point?
Get this paper in your agent:
hf papers read 2605.24213 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper