From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
Abstract
Formalizing vibe-testing as a personalized evaluation process that combines customized prompts with user-specific judgment criteria can better align model assessment with real-world usefulness than traditional benchmarks alone.
Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.
Community
From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs
๐ Introduction
From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs
We study how people actually compare LLMs in practice, formalize that process as a structured evaluation framework, and implement a runnable pipeline for coding tasks. The core idea is simple: users personalize both what they test and how they judge responses, and those choices can materially change which model is preferred.
The paper combines empirical grounding with a full evaluation stack:
- Real-world evidence from a survey of user evaluation practices and a curated corpus of in-the-wild comparison reports.
- A formalization of vibe-testing into input dimensions for prompt design and output dimensions for response judgment.
- A reproducible pipeline for user profiling, prompt rewriting, objective evaluation, pairwise comparison, and downstream analysis on coding benchmarks.
๐ Key Results
The website narrative starts with the empirical observation behind the paper: benchmark rankings often miss what users actually care about, so people fall back on ad hoc vibe-testing to compare models in practice.
Empirical findings
82%of surveyed users reported that they have vibe-tested models.86%reported that a model had felt meaningfully different from what benchmark scores suggested.83%expressed interest in more structured or automated vibe-testing workflows.- We use in-the-wild model comparison examples to define vibe-testing input dimensions and output dimensions, grounding the framework in how people already test and judge models in practice.
- The formalization is backed by both survey evidence and a curated corpus of comparison reports from blogs, forums, tech media, and social platforms.
From empirical study to pipeline
Building on those empirical findings, the paper turns vibe-testing into a reproducible evaluation setup. It separates what users test from how users judge outputs, then instantiates that idea as a three-part pipeline: user profiling, vibe dataset construction, and head-to-head model comparison.
Experimental results
The main experimental takeaway is that personalization is not just presentation. In coding experiments, changing both the prompt framing and the comparison criteria can change which model wins. The table below shows the win rate of GPT-5.1 against GPT-4o across the four personas reported in the paper.
| Persona | Original prompts | Personalized prompts |
|---|---|---|
Beginner |
0.09 |
0.94 |
Intermediate |
0.16 |
0.77 |
Researcher |
0.63 |
0.97 |
Advanced |
0.88 |
0.82 |
Additional findings from the paper:
- LLM judges showed credible agreement with humans for pairwise preference judgments.
- Control paraphrases mostly preserved original rankings, suggesting the strongest shifts come from personalized framing rather than generic rewriting.
- Personalization effects were strong for some model pairs and weaker for others, which is exactly the kind of user-model interaction the framework is designed to reveal.
๐งญ Pipeline Overview
The paper presents the proof-of-concept pipeline as three stages in Figure 3:
- (A) User profiling: build a structured user profile
Pfrom a short natural-language description, including input preferencesP_inand output preferencesP_out. - (B) Vibe dataset construction: rewrite benchmark samples into personalized prompts aligned with
P_in, while checking semantic preservation so the task intent remains intact. - (C) Model comparison: compare two model responses from the same user perspective using
P_out, producing per-dimension head-to-head comparisons.
๐ Resources
- Paper: arXiv
- Project website: project website
- Experiment orchestrator docs:
scripts/README.md - Example experiment config:
configs/experiments/example_experiment.yaml - Project page source:
website/index.html
๐ Citation
@misc {feelings_to_metrics_2026,
title = {From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs},
author = {Itzhak Itay, Eliya Habba, Gabriel Stanosky, Yonatan Belinkov},
year = {2026},
note = {Under review},
url = {https://arxiv.org/abs/2604.14137}
}
๐ License
MIT License, Copyright (c) 2026
๐ฌ Contact
For questions or collaborations, please open a GitHub issue or reach out via email:
- Email: itay1itzhakdotgmail.com
Get this paper in your agent:
hf papers read 2604.14137 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper





