arxiv:2604.14137

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Published on Apr 16

Authors:

Abstract

Formalizing vibe-testing as a personalized evaluation process that combines customized prompts with user-specific judgment criteria can better align model assessment with real-world usefulness than traditional benchmarks alone.

AI-generated summary

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.

View arXiv page View PDF Add to collection

Community

itay1itzhak

2 days ago

•

edited 2 days ago

From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs

[Paper]
[Website]
[Contact]

📘 Introduction

From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs

We study how people actually compare LLMs in practice, formalize that process as a structured evaluation framework, and implement a runnable pipeline for coding tasks. The core idea is simple: users personalize both what they test and how they judge responses, and those choices can materially change which model is preferred.

The paper combines empirical grounding with a full evaluation stack:

Real-world evidence from a survey of user evaluation practices and a curated corpus of in-the-wild comparison reports.
A formalization of vibe-testing into input dimensions for prompt design and output dimensions for response judgment.
A reproducible pipeline for user profiling, prompt rewriting, objective evaluation, pairwise comparison, and downstream analysis on coding benchmarks.

Vibe-testing in the wild versus formalized vibe-testing

📊 Key Results

The website narrative starts with the empirical observation behind the paper: benchmark rankings often miss what users actually care about, so people fall back on ad hoc vibe-testing to compare models in practice.

Empirical findings

82% of surveyed users reported that they have vibe-tested models.
86% reported that a model had felt meaningfully different from what benchmark scores suggested.
83% expressed interest in more structured or automated vibe-testing workflows.
We use in-the-wild model comparison examples to define vibe-testing input dimensions and output dimensions, grounding the framework in how people already test and judge models in practice.
The formalization is backed by both survey evidence and a curated corpus of comparison reports from blogs, forums, tech media, and social platforms.

From empirical study to pipeline

Building on those empirical findings, the paper turns vibe-testing into a reproducible evaluation setup. It separates what users test from how users judge outputs, then instantiates that idea as a three-part pipeline: user profiling, vibe dataset construction, and head-to-head model comparison.

Experimental results

The main experimental takeaway is that personalization is not just presentation. In coding experiments, changing both the prompt framing and the comparison criteria can change which model wins. The table below shows the win rate of GPT-5.1 against GPT-4o across the four personas reported in the paper.

Persona	Original prompts	Personalized prompts
`Beginner`	`0.09`	`0.94`
`Intermediate`	`0.16`	`0.77`
`Researcher`	`0.63`	`0.97`
`Advanced`	`0.88`	`0.82`

Additional findings from the paper:

LLM judges showed credible agreement with humans for pairwise preference judgments.
Control paraphrases mostly preserved original rankings, suggesting the strongest shifts come from personalized framing rather than generic rewriting.
Personalization effects were strong for some model pairs and weaker for others, which is exactly the kind of user-model interaction the framework is designed to reveal.

🧭 Pipeline Overview

The paper presents the proof-of-concept pipeline as three stages in Figure 3:

(A) User profiling: build a structured user profile P from a short natural-language description, including input preferences P_in and output preferences P_out.
(B) Vibe dataset construction: rewrite benchmark samples into personalized prompts aligned with P_in, while checking semantic preservation so the task intent remains intact.
(C) Model comparison: compare two model responses from the same user perspective using P_out, producing per-dimension head-to-head comparisons.

🔗 Resources

Paper: arXiv
Project website: project website
Experiment orchestrator docs: scripts/README.md
Example experiment config: configs/experiments/example_experiment.yaml
Project page source: website/index.html

📚 Citation

@misc {feelings_to_metrics_2026,
  title        = {From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs},
  author       = {Itzhak Itay, Eliya Habba, Gabriel Stanosky, Yonatan Belinkov},
  year         = {2026},
  note         = {Under review},
  url          = {https://arxiv.org/abs/2604.14137}
}

📜 License

📬 Contact

For questions or collaborations, please open a GitHub issue or reach out via email:

Email: itay1itzhakdotgmail.com

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.14137

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.14137 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.14137 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.14137 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.