Papers
arxiv:2604.14137

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Published on Apr 16
Authors:
,
,
,

Abstract

Formalizing vibe-testing as a personalized evaluation process that combines customized prompts with user-specific judgment criteria can better align model assessment with real-world usefulness than traditional benchmarks alone.

AI-generated summary

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.

Community

From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs

[Paper]
[Website]
[Contact]


๐Ÿ“˜ Introduction

From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs

We study how people actually compare LLMs in practice, formalize that process as a structured evaluation framework, and implement a runnable pipeline for coding tasks. The core idea is simple: users personalize both what they test and how they judge responses, and those choices can materially change which model is preferred.

vibe-testing figure 1_page-0001

The paper combines empirical grounding with a full evaluation stack:

  • Real-world evidence from a survey of user evaluation practices and a curated corpus of in-the-wild comparison reports.
  • A formalization of vibe-testing into input dimensions for prompt design and output dimensions for response judgment.
  • A reproducible pipeline for user profiling, prompt rewriting, objective evaluation, pairwise comparison, and downstream analysis on coding benchmarks.
Vibe-testing in the wild versus formalized vibe-testing

๐Ÿ“Š Key Results

The website narrative starts with the empirical observation behind the paper: benchmark rankings often miss what users actually care about, so people fall back on ad hoc vibe-testing to compare models in practice.

figure_1_benchmark_failures_page-0001
figure_2_vibe_methods_page-0001

Empirical findings

  • 82% of surveyed users reported that they have vibe-tested models.
  • 86% reported that a model had felt meaningfully different from what benchmark scores suggested.
  • 83% expressed interest in more structured or automated vibe-testing workflows.
  • We use in-the-wild model comparison examples to define vibe-testing input dimensions and output dimensions, grounding the framework in how people already test and judge models in practice.
  • The formalization is backed by both survey evidence and a curated corpus of comparison reports from blogs, forums, tech media, and social platforms.

From empirical study to pipeline

Building on those empirical findings, the paper turns vibe-testing into a reproducible evaluation setup. It separates what users test from how users judge outputs, then instantiates that idea as a three-part pipeline: user profiling, vibe dataset construction, and head-to-head model comparison.

Experimental results

The main experimental takeaway is that personalization is not just presentation. In coding experiments, changing both the prompt framing and the comparison criteria can change which model wins. The table below shows the win rate of GPT-5.1 against GPT-4o across the four personas reported in the paper.

Persona Original prompts Personalized prompts
Beginner 0.09 0.94
Intermediate 0.16 0.77
Researcher 0.63 0.97
Advanced 0.88 0.82

pairwise_dimension_comparison_with_objective_original_page-0001
pairwise_dimension_comparison_with_objective_personlized_page-0001

Additional findings from the paper:

  • LLM judges showed credible agreement with humans for pairwise preference judgments.
  • Control paraphrases mostly preserved original rankings, suggesting the strongest shifts come from personalized framing rather than generic rewriting.
  • Personalization effects were strong for some model pairs and weaker for others, which is exactly the kind of user-model interaction the framework is designed to reveal.

๐Ÿงญ Pipeline Overview

The paper presents the proof-of-concept pipeline as three stages in Figure 3:

  • (A) User profiling: build a structured user profile P from a short natural-language description, including input preferences P_in and output preferences P_out.
  • (B) Vibe dataset construction: rewrite benchmark samples into personalized prompts aligned with P_in, while checking semantic preservation so the task intent remains intact.
  • (C) Model comparison: compare two model responses from the same user perspective using P_out, producing per-dimension head-to-head comparisons.

pipeline_page-0001


๐Ÿ”— Resources

  • Paper: arXiv
  • Project website: project website
  • Experiment orchestrator docs: scripts/README.md
  • Example experiment config: configs/experiments/example_experiment.yaml
  • Project page source: website/index.html

๐Ÿ“š Citation

@misc {feelings_to_metrics_2026,
  title        = {From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs},
  author       = {Itzhak Itay, Eliya Habba, Gabriel Stanosky, Yonatan Belinkov},
  year         = {2026},
  note         = {Under review},
  url          = {https://arxiv.org/abs/2604.14137}
}

๐Ÿ“œ License

MIT License, Copyright (c) 2026


๐Ÿ“ฌ Contact

For questions or collaborations, please open a GitHub issue or reach out via email:

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.14137
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.14137 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.14137 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.14137 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.